If you happen to be at the SQL PASS conference this week in Seattle, you will find us presenting the new version of the Predixion Insight package. We added rich collaborative features and completely new visualizations.
]]>The web site has a new look, our Support forums are now up and running — feel free to post your questions, for Predixion Insight or in general, for Data Mining in Excel.
Predixion Insight has now a splash screen with links to Video Tutorials, our online documentation and a rich sample dataset.
One feature that I love in this release is the Data Profiling feature. With a couple of clicks, this feature will compute common statistics for all columns in your PowerPivot or Excel dataset, and render a report allowing you to explore these statistics.
to try it out, go to Insight Analytics \ Explore Data and select Profile Data.
Like all other Predixion Insight tasks, this allows you to select an Excel range or table or a Power Pivot table as your data source for profiling.
The second pane of the wizard allows one to select the type of statistics to be collected: Basic or Advanced. Advanced is a bit slower (think milliseconds to seconds), but so much more useful.
The Data Profiling result is rendered in a new Excel sheet, with a few sections:
the first section, Data Summary, tells you what kind of columns were detected in your data and it looks like this:
CustomerID is identified as a likely key (a unique row identifier), so statistics will not be collected for this column
This section contains information about numeric columns. Here are the profile columns generated under Advanced analysis:
- # - the ordinal of the column in the source table. Useful if, after sorting by any other measure, one wants to restore the table to the original layout
- Column Name
- “Looks Like” – the result of the Predixion Insight heuristic analysis of the column. A column may be regarded as Interval (a range of values), Multinomial (a few distinct value, hence categorical or discrete), Binomial (2 values) or Constant (a single value)
- Count of Blanks, Minimum, Maximum, Mean, Sample Variance, Standard Deviation, Range, Kurtosis, Skewness, Standard Error of Mean, Approximate Median, Mode, the 95% confidence interval (assuming normal distribution). If you feel rusty about any of these metrics, just click on the column name and your browser will give you a friendly refresher
Ordinal, Column Name, “Looks Like” and Count of Blanks appear again, just like in Continuous Statistics. The number of distinct states is added for discrete columns.
Other columns:
- Top 80% states – number of distinct states covering 80% of the data
- The Top 3 most/least popular values, with their respective counts
The report can be filtered by column name. You may notice that certain columns appear both in the Continuous and the Discrete section of the report. Such columns may have a numeric type but a distribution which suggests they should be regarded as categorical. The Data Profile report should help you decide whether to use Classify or Estimate for such columns. (If in doubt, this may help: No model for Constant, Estimate for Interval, Classify for all others)
For numeric columns, here are the correlation and covariance matrices. While not very visible in this dataset, the correlation matrix uses a heat map from 1 (green, perfect positive correlation) to –1 (red, perfect negative correlation). White cells show no correlation (a value of 0). The Correlation matrix can be sorted by correlation for each individual column – sorting it by Column Index, #, will restore the shape of the matrix
]]>The rest of this post is just me bragging about how cool Predixion Insight is and how one can have fun with it.
A few years ago, I was working with my adviser on a paper trying to offer biochemists a predictive tool for the bioactivity of HIV-1 protease inhibitors. I am a bit ashamed to admit this, but to this day an “HIV-1 protease inhibitor” is, for me (just like the day I first met one), mainly a vector of 4 numbers in one dataset and 25 numbers in another.
In analyzing those datasets, I used a few different tools, including the Microsoft data mining add-ins. Excel, empowered by the data mining add-ins, was by far the best data playground for me at the time (naturally, no trace of bias here!). What made this experience memorable is that, after working as a developer on those add-ins for a few years, I was using them as a user and not for a test or demo nor on a customer’s dataset.
Today, I played a bit with those datasets and tried to replicate that work using Excel 2010, PowerPivot and Predixion Insight. I tried to use the combo as a real user, not as a developer, and see how I can solve a specific problem.
Here are a few things that marked the experience. I will try to blog about each individual feature in more detail in the upcoming weeks, so here is just an enumeration
- Asynchronous task execution. I can launch a task (say, a neural network processing task) and then, while the clouds are crunching the 25-dimensional protease inhibitors, I can schedule additional tasks or use excel. Even better, the processing results are cached in the cloud and available for retrieval when the job completes, from wherever I log in. Naturally, result sets can be retrieved multiple times
- Power Pivot integration: on one hand, it is nice to know that you can analyze more than 1M rows. However, I found sampling pretty useful in solving this issue, at least for modeling. I found more interesting the fact that I can synthesize very rich datasets using power pivot. Some things are just native there (cross table RELATED calls, aggregations). Others are added by Predixion Insight (outlier and missing values handling, binning of numeric data). All these transformations are wonderfully refreshable and can be nicely applied to new data.
- Visual Macros: these are executable reports generated by Predixion Insight for Excel, reports which make repetitive tasks much easier. In brief, after going through a wizard, such as specifying which of 25 input columns should be used in a regression model, a script is generated, script which can be modified and re-executed. Also, multiple such scripts can be chained on the same worksheet and executed as a package. This way , creating 5 regression models on the same 25 columns, with different algorithm parameters, stops being boring. These macros can be used for queries as well and I intend to show some Visual Macros based application worksheets in a future post
Add the fact that I ran the experiments on a tiny laptop running just Office and you’ll see why the 2010 version of the Excel-based advanced analytics tools is my new preferred data playground. Hence, the point at the beginning of this post. Enjoy!
Technorati Tags: Predixion,Insight,Excel,PowerPivot
]]>Ever since two years ago when I spent my summer developing a cloud prototype I got hooked on the potential of predictive analytics in the cloud. Predixion Insight is unlocking that potential. I hope and believe that you will find our product a powerful instrument for analyzing your data, anywhere, anytime. We certainly put a lot of work in making it so.
Some information about the product features is already available on Jamie’s blog . I will follow up with more posts describing in detail some product features (after the public Beta becomes available).
]]>The company aims to bring predictive analytics technology within reach of every information worker. My past professional incarnation taught me that it may be quite a challenge to make statistics and data mining appealing and intuitive. It also taught me that this is exactly what many users actually expect from an advanced analytics solution, so I jumped at this challenge. (By the way, if you find this kind of problems interesting and appealing then check out our job posting at stackoverflow).
The fact that Jamie MacLennan is the CTO of this company also helped. I worked for quite a few years with Jamie on Microsoft’s SQL Server Data Mining and I am really looking forward to building other cool products.
So – until Predixion has a product I will keep posting about SQL Server Data Mining but also about my trip in the startup world. Meanwhile, if you use SQL Server Data Mining, take this survey and win a copy of Jamie and Bogdan’s book!
]]>The next, very-soon-to-show-up-here post will discuss where I am going and what I will be doing there. Talking about future, though, is talking about work, which would spoil this weekend.
This blog will come back to life (although it might move to a different platform). The main topic remains Data Mining and predictive analytics, the core technology remains SQL Server Analysis Services.
Regards,
bogdan
PS – The Cloud Table Analysis Tools service completed its prototype mission and has been retired. Thanks a lot to all the users! More in my next post
]]>So, here is how the rest of the Association Rules viewer works.
Once the viewer is loaded, the first call is something like:
CALL System.Microsoft.AnalysisServices.System.DataMining.AssociationRules.GetStatistics(’Customers’)
The single parameter of this stored procedure is the mining model name. The result is a one-row table containing the following columns:
Column |
Sample Value |
Comments |
MAX_PAGE_SIZE | 2000 | The maximum server supported page size for fetching rule and itemsets. This parameter ensures the the viewer will not make requests which will make the server go out of memory, details later. |
MIN_SUPPORT | 89 | Minimum actual support for rules detected by the model |
MAX_SUPPORT | 2439 | Maximum actual support for rules detected by the model |
MIN_ITEMSET_SIZE | 0 | Minimum itemset size |
MAX_ITEMSET_SIZE | 3 | Maximum itemset size |
MIN_RULE_PROBABILITY | 0.401529636711281 | minimum actual rule probability |
MAX_RULE_PROBABILITY | 0.993975903614458 | maximum actual rule probability |
MIN_RULE_LIFT | 0.514182044237125 | minimum actual rule importance |
MAX_RULE_LIFT | 2.13833283242171 | maximum actual rule importance |
- in the browser: keep using http://sqlserverdatamining.com/cloud
- in Excel (add-in): go to the Connections button in the Analyze (in the Cloud) ribbon and type the following URL: https://131.107.181.99/CloudDM/TATServices/
With this change, you Excel add-in will once again use HTTPS to secure the data transmission
Thanks for your patience!
]]>- For the web interface, use the http://www.sqlserverdatamining.com/cloud URL
- For the Excel add-in, please change the services connection URL. To do that, click the Connections button in the "Analyze (in the Cloud)" ribbon and change the destination URL to
http://131.107.181.101/CloudDM/TATServices/
NOTE: This temporary solution does not support SSL. Your data is transmitted in clear
I’ll post here an update as soon as the servers are up again