DM(X)

DMX Pitfalls : Natural Prediction Joins

Posted on April 10th, 2008 by bogdan

Some very interesting, although time-consuming project kept me away from posting in the last month. Mea maxima culpa, I will try to catch up and put some new and interesting content here more often.

Recently, trying to make sense of some prediction results, I realized that the DMX NATURAL PREDICTION JOIN statement has some obvious and some not so obvious dangers, which deserve some clarification.

To summarize, the PREDICTION JOIN operator is used to apply patterns from a trained mining model against new data. The general syntax is

SELECT EducationLevel

FROM MyModel

PREDICTION JOIN

OPENQUERY(MyDataSource, ”SELECT Height, , HairColor, ShoeSize FROM CustomerInfo’) AS T

ON

T.Height = MyModel.Height AND

T.HairColor = MyModel.HairColor AND

T.ShoeSize= MyModel.ShoeSize

In the standard PREDICTION JOIN statement, the ON clauses specify how columns from the new data are to mapped to the input columns of the mining model.

The statement supports a simplified format:

SELECT EducationLevel

FROM MyModel

NATURAL PREDICTION JOIN

OPENQUERY(MyDataSource, ”SELECT Height, HairColor, ShoeSize FROM CustomerInfo’) AS T

In the simplified format, the columns from the new data are mapped to the input columns of the mining model by name. Much easier to type, resulting in a more readable statement!

Now, the natural mapping may end up in three different scenarios. In the first scenario, all the model’s input columns find a corresponding data column in the input. The mapping is performed correctly and everything works as expected. Nothing to blog about in this case!

The second case: some of the inputs are mapped correctly to model columns, some are not, although you would expect them to be mapped correctly. More than a simple user error, this situation is actually “facilitated” by the way the BI Dev Studio works. Assume you start with a table containing 4 columns: Height, HairColor, ShoeSize, EducationLevel and use BI Dev Studio to create a mining structure and a model. BI Dev Studio will generate more friendly names for the columns and your mining model ends up with the structure below:

MyModel

(

Height

[Hair Color],

[Shoe Size],

[Education Level]

)

Knowing the schema of your relational table, you open Management Studio and execute the NATURAL PREDICTION QUERY above. The result of the query may be surprising: it is exactly the set of values in the input table. Well, the DMX statement returns absolutely perfect results, but they are not actually predictions! The EducationLevel column in the query is NOT a model column, therefore it is collected from the input query. The model column has the modified [Education Level] name (space included), so the DMX engine will not resolve the selected column as a model prediction but as a pass-through from the input data.

The fix is easy: SELECT [Education Level] … and it almost works. Almost, because the same “extra-space” problem applies to other input columns, The new result set, after the fix, is also problematic — only the Height column is correctly mapped to a model column, the others are not mapped by name (they do not contain the extra space in the model column names) and the model will only execute predictions based on Height.

For the whole query to work, you need to change it either to the full PREDICTION JOIN statement (with an ON clause) or to alias your query columns to match exactly the model columns.

The third case: Same example as above, just remove the Height column — now none of the inputs matches the model input columns.

The behavior in this case is unexpected:

- if you did not correct the EducationLevel to [Education Level], then the query returns the data from the input query, without using the model

- if you did correct the selected column to [Education Level], then the query returns an empty result set (no row). Now, this is actually helpful — if you know your table has 100 rows and your prediction query returns 0 rows, there must be a problem. DMX figures out that nothing from the input is actually used in the prediction query and it does not execute the query at all.

To make things more complex, the 0-result-if-nothing-matches rule does not apply to singleton statements. There are some reasons for this, but they do not change the fact that the statement behaves differently when used against OPENQUERY vs singleton.

SELECT [Education Level]

FROM MyModel

NATURAL PREDICTION JOIN

(SELECT ‘brown’ AS HairColor, 11 AS ShoeSize UNION

SELECT ‘black AS HairColor, 10 AS ShoeSize ) AS T

This query actually returns 2 predictions, although nothing in the input singleton matches the model columns. The mode returns predictions without an input — so-called marginal predictions, based on the prior statistics of the training data.

Conclusion

In general, use fully qualified column names and the full syntax (including the ON clause) in a PREDICTION JOIN statement, particularly when you write a tool that generates the DMX statement.

A query like below will either work correctly or fail with an error (column not found)

SELECT MyModel.EducationLevel

FROM MyModel

PREDICTION JOIN

OPENQUERY(MyDataSource, ”SELECT Height, , HairColor, ShoeSize FROM CustomerInfo’) AS T

ON

T.Height = MyModel.Height AND

T.HairColor = MyModel.HairColor AND

T.ShoeSize= MyModel.ShoeSize

Use NATURAL PREDICTION JOIN in the following situations:

- you know exactly how your model columns are named and are 100% confident that your prediction data table matches exactly the model schema. This is typically the case when you created your model with a CREATE MINING STRUCTURE/MODEL statement, starting directly from the schema of the training and prediction data.

- you are building a DataTable in a .Net application, starting directly from the model definition (and populate the table with application specific values). You will see such queries in the Excel add-ins’ trace window — many prediction statements are using rowset parameter, where the rowset schema is built from the model schema, and the rowset data is fetched from a spreadsheet

- you are executing a prediction against structure’s or model’s cases (typically the holdout set of cases

SELECT [Education Level] FROM MyModel NATURAL PREDICTION JOIN

(SELECT * FROM MyStructure.CASES WHERE IsTestCase() ) AS T

Tags: DMX //

One Response to “DMX Pitfalls : Natural Prediction Joins”

www.ssas-info.com // Apr 13, 2008 at 2:52 pm

Pingback. A link to this article was added on our website in the [SSAS Articles]/[Data Mining] section:
http://www.ssas-info.com/analysis-services-articles/56-data-mining/811-dmx-pitfalls-natural-prediction-joins

bogdanc on SQL Server Data Mining

DMX Pitfalls : Natural Prediction Joins

One Response to “DMX Pitfalls : Natural Prediction Joins”

Discussion Area - Leave a Comment

Categories

Blogroll

Archives