DMX in 2008: Model Filtering (Part 2 of 2)

This second post on model filtering deals with nested table filters. The most common use of nested tables in SQL Server Data Mining is to model transaction tables. That is, model Customer entities together with the “bag” of products they purchased.

Nested table filters primarily serve two purposes:

  • - Use (in the model) only those nested table rows with a certain property (e.g. consider only products that are not “Coca Cola”, because most people do buy “Coca Cola”, so it is not that interesting)
  • - Use (in the model) only those cases where the associated nested tables have certain properties (e.g. - build a prediction model on customers that have purchased Coca Cola)

Notice the difference between the filters: while the first filter removes products from shopping baskets before modeling, the second removes customers from the model (those that have not purchased Coca Cola).

Let’s start with a mining structure containing customers, some demographic information and a list of products they bought

CREATE MINING STRUCTURE TestStruct2
(
 OrderNumber TEXT KEY,
 IncomeGroup TEXT DISCRETE,
 Products TABLE
 (
      Product TEXT KEY,
      LineNumber LONG DISCRETE
 )
)

Next step, train the mining structure:

INSERT INTO MINING STRUCTURE TestStruct2
(
 OrderNumber,
 IncomeGroup,
 Products
 (
      SKIP,
      Product,
      LineNumber
 )
)
SHAPE {OPENQUERY([Adventure Works DW], ‘SELECT
 OrderNumber,
 IncomeGroup
 FROM dbo.vAssocSeqOrders ORDER BY OrderNumber‘)
}
APPEND ({ OPENQUERY([Adventure Works DW], ‘SELECT
 OrderNumber,
 Model,
 LineNumber
 FROM dbo.vAssocSeqLineItems ORDER BY OrderNumber‘) }
RELATE OrderNumber TO OrderNumber) AS Products

Now, the modeling part: for the first kind of filter described in the beginning, the goal is to build a model that predicts some, for example, demographics (IncomeGroup) based on the products a customer purchased, without considering a certain product (say, the ‘Mountain-500′ bike):

ALTER MINING STRUCTURE TestStruct2
ADD MINING MODEL IncomeGroupFromProducts
(
 OrderNumber,
 IncomeGroup PREDICT,
 Products
 (
      Product
 ) WITH FILTER(Product<>’Mountain-500′)
)USING Microsoft_Decision_Trees

The syntax of the nested table filter is very similar with the syntax of model filters - it effectively specifies a WHERE clause to be applied on the rows of the nested table. The training data for the model can be obtained with the following drillthrough query:

SELECT OrderNumber,  IncomeGroup,
 (SELECT Product FROM Products WHERE Product<>’Mountain-500′)
FROM MINING STRUCTURE TestStruct2.CASES WHERE IsTrainingCase()

The syntax changes for the second filter described before. Now the goal is to build a model predicting IncomeGroup only for those customers that did buy a certain product:

ALTER MINING STRUCTURE TestStruct2
ADD MINING MODEL IncomeGroupFromMountain500
(
 OrderNumber,
 IncomeGroup PREDICT,
 Products
 (
      Product
 )
)USING Microsoft_Decision_Trees WITH
FILTER(
EXISTS (SELECT * FROM Products WHERE Product=’Mountain-500′)
)

The new syntax element is the EXISTS predicate, which evaluates to TRUE if the query specified as argument returns at least one row.

Note that the list of columns for the EXISTS query is never used, so * is as good as anything else. The WHERE clause of the EXISTS predicate supports, in the case of nested table filters, only the simple <Column> <Operator> <Value> predicates (or Boolean combinations of such predicates) that I presented in my previous post

The training data for the model can be obtained with the following drillthrough query:

SELECT OrderNumber,  IncomeGroup,
 (SELECT Product FROM Products)
FROM MINING STRUCTURE TestStruct2.CASES
WHERE
IsTrainingCase() AND
EXISTS( SELECT * FROM Products WHERE Product=’Mountain-500′)

The query actually works in DMX, which implies that DMX contains now the EXISTS function which evaluates to TRUE if the sub-query argument returns at least one row. Note that, when used in a regular DMX query (i.e. not in a filter), the EXISTS sub-query can be applied to any nested table (including function results, such as PredictHistogram) and can take any WHERE clause that is valid in that context, including UDFs (therefore, outside of a filter, the EXISTS clause is not limited to structure table columns and simple predicates).

Combinations of filters and other SQL Server 2008 DMX features can be used to have very specialized data views, allowing complex modeling scenarios.

Notice how the structure above contains a LineNumber column in the nested table, a numeric field indicating the category of the product on the same line. The model below detects cross-sales rules that lead from products in line 1 (bikes) to products in any other category:

ALTER MINING STRUCTURE TestStruct2
ADD MINING MODEL LineNumber1Recommendations
(
 OrderNumber,
 Products AS LineNumber1Products
 (
      Product
 ) WITH FILTER(LineNumber=1),
 Products AS OtherLineNumberProducts PREDICT_ONLY
 (
      Product
 ) WITH FILTER(LineNumber<>1)
)USING Microsoft_Association_Rules
WITH FILTER(
      EXISTS(SELECT * FROM Products WHERE LineNumber=1) AND
      EXISTS(SELECT * FROM Products WHERE LineNumber<>1)
)

Here is a breakout of the filters and constructs:

  • LineNumber1Products is an input nested table, based on data in the Products structure column, and containing only products in the category denoted by LineNumber 1
  • OtherLineNumberProducts is a predictable (PREDICT_ONLY) nested table, also based on data in the Products structure column, but containing only products in the categories other than the one denoted by LineNumber 1
  • The model is trained only on those customers that have both at least one product in the LineNumber 1category and at least one product in a different category, because any other customers are not relevant for the problem the model is trying to solve.
  • All the filters (both the nested table row filters and the model case filter) are applied on mining structure columns which do not necessarily show in the model (the case of LineNumber). When they do show in the model under an alias, the filter is applied on the source structure column name.

So, download the November CTP of SQL Server 2008, try to use the filters to model your complex business problems and send us your feedback! Make sure to post any issues/questions on the Data Mining Forums . 

Happy New Year!

One Response to “DMX in 2008: Model Filtering (Part 2 of 2)”

  1. […] of selling e-books (publishers sell to retailers, who then sell to readers at a price that the …DM(X) DMX in 2008: Model Filtering (Part 2 of 2)… table, also based on data in the Products structure column, but containing only products in the […]

Discussion Area - Leave a Comment