## Regression accuracy: Excel’s regression vs. the SSDM algorithm

A recent post on the MSDN Forums raised an interesting issue: Excel Data Analysis’s Linear Regression and SSDM were returning different results. Specifically, the SSDM results were much worse.

The issue turned out to be a data modeling issue (columns were not mapped properly).  However, during the investigation I had to compare Excel’s linear regression with the SSDM regression algorithm(s). Thought this might be interesting, so here is one way to compare the results from the two implementations.

I started with some simple (X,Y) data (available for download as a CSV file). First step - run Excel’s Data Analysis regression tool. The results are displayed typically in a separate spreadsheet, and the interesting part  is the Coefficients sections: Therefore, the Excel regression formula is
Y = 2.37486095661847*X -0.310344827586214

Next thing — apply Excel’s regression coefficients to the existing data. I did this by adding a column in my data spreadsheet and populating it with a formula: Next thing, I created a Microsoft Linear Regression mining model on the same data. There is a variety of ways to do this, such as exporting data to a table, connecting directly to the Excel spreadsheet or, the simplest way, by using the Excel add-ins.

To get the model’s predictions in Excel, I used one of the functions exposed by the Excel add-ins, DMPREDICT. If you do not have the add-ins you can always execute a prediction query in SQL Server Management Studio or BI Dev Studio.
However, with the add-ins’s function, getting the prediction results is really easy:

The function syntax:
DMPREDICT(”", “TestLinReg”, “[Y]”, A2, “X”)

- the first parameter (empty string) is the Analysis Services connection to be used. An empty string refers the current connection.
- the second parameter, “TestLinReg”, is the name of the mining model that will predict
- the third parameter, “[Y]”, is the requested predicted entity (predictable column, in this case, but could also be any prediction function)
- the last two parameters define the prediction input: the value in the A2 cell should be mapped as “X”.

The DMPREDICT call resolves to a DMX query like below:
SELECT [Y] FROM TestLinReg NATURAL PREDICTION JOIN
(SELECT [A2] AS X ) AS T
where A2 is replaced with the actual value. With Excel’s auto-fill feature, getting all the predictions is really easy.

Now, with both result sets at hand, we can compare them. I created two absolute error columns, one for the Excel’s regression results and the other for SSDM’s prediction results. The column for SSDM absolute error has a slightly different formula: =ABS(D2-B2) (to use the column populated with DMPREDICT)

The last steps: use the absolute error columns to compute the Mean Absolute Error and Root Mean Squared Error for both result sets:
Again, Excel’s calculation engine does most of the work:  The final results suggest that there is no significant difference between the results returned by the two regression algorithms for this data set: ### 2 Responses to “Regression accuracy: Excel’s regression vs. the SSDM algorithm”

1. […] know from sqlserverdatamining.com and the DM Forums has started his own data mining blog - with a first entry to answer many of the questions around "Excel does this and DM does that".  In this […]

2. Great new look for the site.