New DMX features in 2008 : Test and Training data

As Raman mentioned in the Data Miner newsletter, the November CTP of SQL Server 2008 is now available.

It includes many really cool new features in Analysis Services. Among them: Holdout support, Model filtering, DMX Column aliasing, Drillthrough enhancements, Cross validation and practically two new forecasting algorithms under the Microsoft_Time_Series umbrella.
I intend to present all of them briefly, and I start today with the Holdout support.

Most of the data mining tasks require a validation step, performed right after modeling. This validation step
consist (typically) in evaluating model’s performance against data that was not seen during training (test data). 

The test data ideally has the same statistical properties as the training data (data seen by the mining model during training). An easy way to achieve statistical similarity between the training and test set is to use random sampling. This method is not guaranteed to give correct results (statistical similar populations) but, assuming that the random sampling mechanism is independent of the data, it will work in most common scenarios.

SQL Server Integration Services has a Random Sample transform, which extracts a random sample (with a certain percentage) using a mechanism independent of the actual data being samples. This is why we strongly recommended using Integration Services to generate test/training partitions for SQL Server Data Mining.

However, there are a few problems:
- Integration Services will have to save at least one of the sample sets in a relational table (or some form of output destination)
- This sampling method can only be applied to data coming from a relational source (or, in general, a source that can be used with IS). That means it is difficult to use IS sampling with application that do data mining on in-memory data
- Integration Services is rather hard to use to sample data for models with nested tables. It can be done, but it takes around 11 simple steps and 14 transforms :-) to do this for a single nested table (an example is available here: Sampling Nested Tables )

Now, there is a simpler way to do this. You may remember that, in the SQL Server Data mining architecture, a Mining Structure object acts as a data space while a mining model is a problem to be addressed in that data space. As the mining structure describes the data space, it is  natural for the structure to partition the data into training and testing sets.

In SQL Server 2008, the new wizard for creating a model (or structure — yes, there is a wizard now for creating a structure with no models!) allows specifying the desired size of the “Holdout” dataset — that is, data to be stored in the mining structure for testing purposes, without being available for models training. By default, the holdout size is 30% (leaving 70% for training). You may choose to specify a fixed number of rows instead of a percentage, or both (in this case the semantic is “use 30%, but no more than 10000 rows for test data”).

The rest of this post shows how to express the holdout wizardry in DMX.

I assume that you have already installed the Adventure Works DW sample database, coming with SQL Server 2005 (or 2008 CTP). If not, you can download the sample databases from here. I also assume that:
- you have created an Analysis Services database
- that database contains a data source, named [Adventure Works DW], pointing to the Adventure Works DW relational database
- you have started Sql Server Management Studio and created a new DMX query, being now connected to the Analysis Services 2008 instance, to the database mentioned above
The first step is to create a new mining structure. The DMX looks very much like in 2005:
CREATE MINING STRUCTURE TestStruct
(
 CustomerKey LONG KEY,
 FirstName TEXT DISCRETE,
 LastName TEXT DISCRETE,
 YearlyIncome DOUBLE DISCRETIZED,
 EnglishOccupation TEXT DISCRETE,
 CommuteDistance TEXT DISCRETE,
 BikeBuyer BOOLEAN DISCRETE
) WITH HOLDOUT( 30 PERCENT OR 5500 CASES)
The new element is the “WITH HOLDOUT( 30 PERCENT OR 5500 CASES)” part — it instructs the mining structure to preserve 30% of the data (but no more than 5500 cases) for testing. Note how the WITH HOLDOUT option appears at the mining structure level and has nothing to do with the internal layout of the structure. The nested tables problem is now solved!

The next step, is to train the mining structure. At this point, any training mechanism may be used - including application rowsets as parameters
INSERT INTO MINING STRUCTURE TestStruct
(
 CustomerKey,
 FirstName,
 LastName,
 YearlyIncome,
 EnglishOccupation,
 CommuteDistance,
 BikeBuyer
)
OPENQUERY([Adventure Works DW], ‘SELECT CustomerKey,
 FirstName,
 LastName,
 YearlyIncome,
 EnglishOccupation,
 CommuteDistance,
 BikeBuyer  FROM dbo.vTargetMail
‘)

The Analysis Services engine performs the following steps:
- executes the query (same as in SQL 2005)
- it samples the data randomly, as it reads the query results
- a new (hidden) entity is created and associated with the mining structure : a Holdout store. This store maintains, for each data point, the partition it belongs to (test or training). The Holdout store is preserved under the mining structure’s data folder - it is the file named *.TestStruct.hld.partition, which is not directly accessible, or readable. However, if you were
wondering what this file is, here is the answer!

Another cool thing about the mining structure in SQL Server 208 — it allows you to query the data directly.
Here is an example of such query:
SELECT * FROM TestStruct.CASES
Or, even better:
SELECT *, IsTestCase() AS [TestCase] FROM TestStruct.CASES

The new function, IsTestCase(), returns true for cases (data points) belonging to the holdout (test) partition and false for cases belonging to the training partition. For ease of use, there is also a IsTrainingCase function (which means exactly NOT IsTestCase()). Moving the IsTestCase function call to the WHERE clause of the query will return all the test cases for the mining structure.
Now, drillthrough at the mining structure level may seem like an elevation of privileges when compared with SQL Server 2005 (remember, there is a model Drillthrough permission!). However, structure drillthrough only works for structure admins (i.e. users that can create a new model with drillthrough permission at any time, and that are model admins by definition). The model drillthrough permission is still in place, for users that are NOT admins but have a good reason to perform model drillthrough.
Whenever training a mining model inside this structure, it will only see the training cases from the mining structure. This is by design (and by definition, after all this is what the training partition is about). (However, if you really need to create a model based on the training data, you will have to create it in a separate structure and train that structure with a query like the one above)

To show a couple more tricks allowed by the structure holdout, I need to create a mining model inside this structure. In the same Management Studio window, execute the following statements:

ALTER MINING STRUCTURE TestStruct
ADD MINING MODEL BikeBuyerClass
(
 CustomerKey,
 YearlyIncome,
 EnglishOccupation,
 CommuteDistance,
 BikeBuyer PREDICT
) USING Microsoft_Decision_Trees
GO
INSERT INTO BikeBuyerClass
GO

Now, you have a BikeBuyerClass model which predicts the BikeBuyer column. The whole point of the test/training split was to evaluate the accuracy of the mining model against the data in the holdout partition (test data). If you digged into the internals of Analysis Services 2005, you might have noticed that the accuracy stored procedure (Lift chart, classification matrix, scatter plot) were managed stored procedures executing PREDICTION JOIN queries against external data sources and computing the accuracy results.

With SQL Server 2008 (and the test data inside the mining structure), we added some new (system) stored procedure-like calls. Here is how to use the new stored procedures on the newly created model:

CALL SystemGetLiftTable ( BikeBuyerClass, 2, ‘BikeBuyer’, true) // - to get the lift chart for the TRUE state of the BikeBuyer attribute

CALL SystemGetClassificationMatrix( BikeBuyerClass, 2, ‘BikeBuyer’) // - to get the full classification matrix for the BikeBuyer attribute

(there is, of course, a SystemGetScatterTable procedure, for continuous targets)

These new calls deserve a post of their own, so I will not get in much detail here. However, here are enough details on the parameters to make these calls usable:
- The first parameter, BikeBuyerClass, is the mining model name
- The second parameter (2) is a bitmask identifying the data partition(s) to be used in the accuracy stored procedure. Here are some bit values:
 - 1 identifies the training partition
 - 2 identifies the test partition
 - 3 ( bitwise OR combination of 1 and 2) identifies ALL the data in the mining structure (training + test)
- the 3rd parameter, ‘BikeBuyer’, is the name of the predictable attribute to evaluate
- the 4th [optional] parameter of the SystemGetLiftTable call is the target state for the lift chart.

Having the test partition inside the mining structure has another important advantage — the test data is stored in the internal format understood by the mining model. For those of you writing plug-in algorithms, the data is maintained in the
Attribute/Value format used by algorithms. That means that executing predictions does not require mappings between the actual data values and internal representations. Therefore, accuracy measurements on structure test data are much faster than on external data. Both SQL Server Management Studio 2008 and BI Dev Studio 2008 allow you now to get the evaluation charts on top of the structure test data or on top of external data.
Have fun with these new DMX constructs, and come back in a few days for the next post on DMX enhancements in SSDM 2008!

One Response to “New DMX features in 2008 : Test and Training data”

  1. […] between the actions and the results -Feedback, people need to receive continuous feedback …DM(X) New DMX features in 2008 : Test and Training data(there is, of course, a SystemGetScatterTable procedure, for continuous targets) These new calls […]

Discussion Area - Leave a Comment