(More) DMX features in 2008: Better use of structure columns in models

In a previous post I presented some of the new DMX features in the November CTP of SQL Server 2008. So, here are some new cool DMX tricks.

Assume you want to use Microsoft Association Rules and Microsoft Decision Trees on the same data. Also, assume that the data contains one numeric column (say, Age). You may have noticed that Decision Trees supports continuous columns, while Association Rules does not. Not a big deal, we have discretization, and Age can be added twice to the same mining structure as, say Age (Continuous) and Age Disc (Discretized). However, the two different names raise a problem in the case of NATURAL PREDICTION JOIN (where input columns are bound by name to the model columns).

In SQL Server 2005, the mining model columns typically had the same name as the mining structure’s ones. There was no way in DMX to change the model column names (well, there is a way in BI DevStudio).

Another problem: one would not include the email, name or phone number of a customer in a mining model, because, at best, this would increase training time and, in the worst case, would unnecessarily complicate the model with fake patterns. But this makes it hard to link the training cases leading to one pattern (available with the drillthrough feature) to information that would perhaps make that pattern actionable (like contact information, if the pattern suggest strong probability to buy a product).

In 2008, these problems are much easier to solve, with mode column aliasing and structure columns drillthrough, and here is how these work.

I’ll make the same assumptions as previously, primarily that you have already installed the Adventure Works DW sample database, coming with SQL Server 2005.

The first thing, as always, is to create a new mining structure.
 

CREATE MINING STRUCTURE TestStruct2
(
 CustomerKey LONG KEY,
 FirstName TEXT DISCRETE,
 LastName TEXT DISCRETE,
 YearlyIncome DOUBLE CONTINUOUS,
 YearlyIncomeDisc DOUBLE DISCRETIZED
 EnglishOccupation TEXT DISCRETE,
 CommuteDistance TEXT DISCRETE,
 BikeBuyer BOOLEAN DISCRETE
)
Notice the first two non-key columns in the structure, contact information that should not be used in analysis, and the two YearlyIncome columns, a continuous one and a discretized one. Next, model creation:

ALTER MINING STRUCTURE TestStruct2
ADD MINING MODEL BikeBuyerClassDT
(
 CustomerKey,
 YearlyIncomeDisc AS YearlyIncome,
 EnglishOccupation,
 CommuteDistance,
 BikeBuyer PREDICT
) USING Microsoft_Decision_Trees WITH DRILLTHROUGH
The novelty is the AS fragment - it creates a YearlyIncome model column which is bound to the YearlyIncomeDisc structure column, and this solves the first problem.

Besides allowing similar NATURAL PREDICTION JOIN statements for all the models in a mining structure, aliasing allows some other very interesting constructs. For example, it allows a single nested table from the structure twice in the same mining model, opening new modeling possibilities which deserve a separate post (likely, my next).

Now, the model can be easily trained with an INSERT INTO statement like below:

INSERT INTO MINING STRUCTURE TestStruct2
(
 CustomerKey,
 FirstName,
 LastName,
 YearlyIncome,
 YearlyIncomeDisc,
 EnglishOccupation,
 CommuteDistance,
 BikeBuyer
)
OPENQUERY([Adventure Works DW], ‘SELECT CustomerKey,
 FirstName,
 LastName,
 YearlyIncome,
 YearlyIncome,
 EnglishOccupation,
 CommuteDistance,
 BikeBuyer  FROM dbo.vTargetMail
‘)

Almost nothing special about INSERT INTO, just note how both YearlyIncome and YearlyIncomeDisc columns appear in the structure’s column list and how, for this reason, YearlyIncome appears twice in the relational query.

Now, the model is trained and ready to use, therefore ready for drillthrough. This operation allows now selecting columns from mining structure that do NOT appear in the mining model, and here is the syntax for it:

SELECT
  CustomerKey,
  YearlyIncome,
  EnglishOccupation,
  CommuteDistance,
  BikeBuyer,
  StructureColumn(‘FirstName’) AS [FirstName],
  StructureColumn(‘LastName’) AS [FirstName]
  FROM BikeBuyerClassDT.CASES
// WHERE IsInNode(‘…..’) — optional content node identifier
The StructureColumn functions can be used to include, in drillthrough, the values associated with the respective cases in columns that are part of the structure but not of the mining model.

The BI Dev Studio viewers take advantage of this feature, providing a nice UI for selecting structure columns to be included in the drillthrough.

There’s one more thing to say about StructureColumn when used inside nested tables. Assume you have a structure nested table containing product names, and their categories and, for whatever reason, you did not include the category column in the model’s nested table. Here’s how to fetch that structure category column in a drillthrough statement:

SELECT 
          CustomerKey,
          YearlyIncome,
          (SELECT ProductName,  StructureColumn(‘Category’)  AS [Category]  FROM Products) AS Products
FROM BikeBuyerClassDT.CASES
// WHERE IsInNode(‘…..’) — optional content node identifier

One Response to “(More) DMX features in 2008: Better use of structure columns in models”

  1. […] was published on February 25, 2010 in Sports and tagged with Column, Issue 5, NBA, Sports …DM(X) (More) DMX features in 2008: Better use of structure …Assume you have a structure nested table containing product names, and their categories and, for […]

Discussion Area - Leave a Comment