Entries Tagged as ''

How much training data is enough?

This an actual question I had to answer in one of the previous Tech Ready conferences: when does one know that feeding more data to the mining model will stop improving accuracy? Or: how much training data is enough? With the new accuracy procedures in SQL Server 2008, I think I have now a better answer.

Algorithms learn from data and, generally, feeding them with more data will improve the accuracy. This improvement, however, is asymptotical. If the training data is statistically representative for the problem space, then a classifier will stop learning after a certain limit: new data points will support existing patterns and will help reducing the risk of over-training,  but will not add more information to the model. The solution I suggest for SQL Server 2008 allows you to figure out if your mining model is likely too gain from adding new data or is likely to maintain the same accuracy.

The solution is based on the cross validation mechanism, so I’ll start with a small presentation of this mechanism. The cross-validation takes a data set, splits it into a number of (roughly equal size) partitions, called folds, then builds a set of mining models (as many as there are folds). Each mining model is built on top of the whole data set minus one of the folds and then the accuracy of the model is measured on the fold that was left out. Effectively, all data is used in training and all data is left out (by at least one mining model). Also, each model is validated against data that was not seen during training. The accuracy measures for each model are then evaluated across all the folds. A good mining solution would provide compact accuracy measurements (i.e. no big differences between the measurements on different folds). Spikes in the measurements may indicate problems with the data. Note that cross validation does not improve the accuracy of the model, it just provides an “honest” estimation of that, reducing the risk of over-training.

In SQL Server 2008, cross validation allows the user to specify how many cases should be used for the procedure. This is mainly for performance purposes — it allows a quick comparison between different mining models/algorithms on the same data set.

Now back to the original problem: would adding new data benefit the accuracy?

[

DMX in 2008: Model Filtering (Part 2 of 2)

This second post on model filtering deals with nested table filters. The most common use of nested tables in SQL Server Data Mining is to model transaction tables. That is, model Customer entities together with the “bag” of products they purchased.

Nested table filters primarily serve two purposes:

  • - Use (in the model) only those nested table rows with a certain property (e.g. consider only products that are not “Coca Cola”, because most people do buy “Coca Cola”, so it is not that interesting)
  • - Use (in the model) only those cases where the associated nested tables have certain properties (e.g. - build a prediction model on customers that have purchased Coca Cola)

Notice the difference between the filters: while the first filter removes products from shopping baskets before modeling, the second removes customers from the model (those that have not purchased Coca Cola).

[