How much training data is enough?

This an actual question I had to answer in one of the previous Tech Ready conferences: when does one know that feeding more data to the mining model will stop improving accuracy? Or: how much training data is enough? With the new accuracy procedures in SQL Server 2008, I think I have now a better answer.

Algorithms learn from data and, generally, feeding them with more data will improve the accuracy. This improvement, however, is asymptotical. If the training data is statistically representative for the problem space, then a classifier will stop learning after a certain limit: new data points will support existing patterns and will help reducing the risk of over-training,  but will not add more information to the model. The solution I suggest for SQL Server 2008 allows you to figure out if your mining model is likely too gain from adding new data or is likely to maintain the same accuracy.

The solution is based on the cross validation mechanism, so I’ll start with a small presentation of this mechanism. The cross-validation takes a data set, splits it into a number of (roughly equal size) partitions, called folds, then builds a set of mining models (as many as there are folds). Each mining model is built on top of the whole data set minus one of the folds and then the accuracy of the model is measured on the fold that was left out. Effectively, all data is used in training and all data is left out (by at least one mining model). Also, each model is validated against data that was not seen during training. The accuracy measures for each model are then evaluated across all the folds. A good mining solution would provide compact accuracy measurements (i.e. no big differences between the measurements on different folds). Spikes in the measurements may indicate problems with the data. Note that cross validation does not improve the accuracy of the model, it just provides an “honest” estimation of that, reducing the risk of over-training.

In SQL Server 2008, cross validation allows the user to specify how many cases should be used for the procedure. This is mainly for performance purposes — it allows a quick comparison between different mining models/algorithms on the same data set.

Now back to the original problem: would adding new data benefit the accuracy?

The x-validation stored procedure supports multiple syntax flavors. The one that is interesting for my purpose is described below:

CALL SystemGetCrossValidationResults( <structure>, <model>, <NumberOfFolds>, <NumberOfCases>, ‘<TargetAttribute>’)

The first parameters are evident: the mining structure containing the data and the mining model to be cross validated. Note that the mining structure must be processed for the procedure to work, but the model does not have to be processed. As I mentioned, the procedure creates a number of models and the source model is used as a template.

The Number of Folds parameter specifies how many folds should be used in the procedure while the Number Of Cases parameter specifies how much is the data set to be used by the procedure. The Target Attribute parameter is used for:

- ensuring stratified sampling for the folds (the distribution of the Target Attribute’s states is almost the same in all the folds)

- directing the accuracy measurements to be performed on the mining models built by the procedure.

A typical call is would set NumberOfFolds to 10 and NumberOfCases to, say, 1000. Such an invocation builds a 1000 cases sample (random selection) and build 10 100-cases folds over the sample. 10 models are built, each trained with 900 cases and evaluated against the remaining 100.

To see if new data would improve the accuracy, I could build models starting with a small training set, then incrementally increase (for example, double) the training set size until the accuracy flattens.

Therefore, here are the steps of the algorithm:

1. Start with an existing mining structure, using a reasonably large data set

I started with a mining structure, MyStruct, built on top of the [v Target Mail] view of the Adventure Works BI sample database

2. Define a mining model in the structure.

I created a simple model, ClassifyBikeBuyer, predicting Bike Buyer and also English Occupation. I used Microsoft Decision Trees all the wizard’s default suggestions (but I changed Bike Buyer to discrete boolean)

3. Start with a rather small training set size (say, 10 — it is highly unlikely that 10 is a good training set size, but it makes the charts look good :-) )

4. Execute the cross validation procedure using 2 as the number of folds and 2*TrainingSetSize as the number of cases. This step is the interesting part of the algorithm: I have to use at least 2 folds, or the procedure would fail. On the other hand, I want to use at least 2 folds, so that the accuracy measurements are not biased.

CALL SystemGetCrossValidationResults( MyStruct, ClassifyBikeBuyer, 2, 20, ‘Bike buyer’)

5. Record the accuracy results for the current training size. The stored procedure result contains multiple accuracy measurements for each model built by the procedure. As the number of folds argument was 2, there are 2 sets of results.

XValRes

Now, there are a few things to be recorded in the result set:

- first, the values of the accuracy measurements for each fold. I used a single measure, the Log Likelihood Score, a value that, by itself, does not depend directly on the training or test set size (Pass/Fail classification scores do depend)

- then, the partition size for each fold. This is very important, because it is the termination condition for the algorithm (I do not know ahead of time how much data is in my mining structure). The folds are almost the same size — the stratified sampling mechanism may lead to minor differences. If at least one fold has a size exceeding or equal to my training set size, the procedure had enough data (see more details on Partition Size at point 6).

To simplify the results for this execution (training set size = 10 cases), I stored the μ (average) and σ (standard deviation) for the Likelihood Log Score measurement

6. Double up the training set size and repeat from step 4 until not enough data is found.

The Partition Size in the result is not the size of the training data for the model indicated by Partition Index. Every row marked with Partition Index = 1 is about the model trained leaving partition 1 out (and, therefore, evaluated on partition 1).

The procedure above will stop when step 4 asks for a data set size which exceeds the data available in the mining structure.

My mining structure has around 13k cases (the view contains 18k and I preserved 30% as holdout). Therefore, the procedure worked correctly for data sets of: 20, 40, 80, 160, 320, 640, 1280, 2560, 5120 and 10240 cases (so it stopped with a training size of 5120 cases — remember, data size is split in 2 folds!). When asking for 20480 cases, the procedure only found enough data to create 2 partitions of 6470 cases. To evaluate the accuracy of (almost) the whole training set in the structure, I added one more call:

CALL SystemGetCrossValidationResults( MyStruct, ClassifyBikeBuyer, 10, 0, ‘Bike buyer’)

Effectively, I asked the procedure to perform 10-fold cross validation, so I can evaluate the accuracy of a model using 90% of the training data. Not exactly the whole data set, but close enough!

I repeated the same steps for the ‘English Occupation’ target and plotted the results in Excel. The chart below presents the accuracy evolution (Likelihood Log Score) with the size of the training set, and the error associated with each data point is the standard deviation for the likelihood log score among the folds of that size.

image

So, what does the chart tell?

Likelihood Log Score is a measure that cannot exceed 0. The closer it is to 0, the better is the accuracy of the model. The chart above suggests that:

- my structure data is pretty good for predicting Bike Buyer. The accuracy grows minimally after 1280 training cases

- however, the data does not seem enough to predict the English Occupation. As you can see, the slope is still pretty steep on the last segment of the English Occupation line

The algorithm seems interesting enough to apply it later, so I encapsulated it in a small C# Analysis Services stored procedure.

All the code is available in this file: AccuracyEx.cs . To use it, you will need to download the file, create a new C# class library project, add a reference to Microsoft.AnalysisServices.AdomdServer library and include the file in the project.

Once you build and deploy the stored procedure (Safe permission set, impersonate current user), you can invoke it with a call like below:

CALL AccuracyEx.EvaluateModelingData(’ClassifyBikeBuyer’, ‘English Occupation’, ‘Log Score’)

The first parameter is the mining model (which does not need to be processed, but must belong to a processed structure)

The second parameter is the target predictable attribute

The last parameter is the measure to be used in the result. If an empty string is passed as argument, then all the measures will be returned.

Here is the output I used in the chart above:

image

The procedure can be easily modified to work for multiple predictable targets (no input target attribute) and even for Clustering models.

Discussion Area - Leave a Comment