Querying Rules and Itemsets (like the Data Mining Viewers do)

I will try to continue the series started by Jamie, presenting the other set of queries issued by the Microsoft Association Rules viewer. Recently, a question on these queries appeared on the MSDN Data Mining Forums and the poster raised a very good point: while the stored procedures were intended as internal calls for the built-in viewers, external applications and viewers may want to employ them.

So, here is how the rest of the Association Rules viewer works.

Once the viewer is loaded, the first call is something like:

CALL System.Microsoft.AnalysisServices.System.DataMining.AssociationRules.GetStatistics(’Customers’)

The single parameter of this stored procedure is the mining model name.  The result is a one-row table containing the following columns:

Column

Sample Value

Comments

MAX_PAGE_SIZE 2000 The maximum server supported page size for fetching rule and itemsets. This parameter ensures the the viewer will not make requests which will make the server go out of memory, details later.
MIN_SUPPORT 89 Minimum actual support for rules detected by the model
MAX_SUPPORT 2439 Maximum actual support for rules detected by the model
MIN_ITEMSET_SIZE 0 Minimum itemset size
MAX_ITEMSET_SIZE 3 Maximum itemset size
MIN_RULE_PROBABILITY 0.401529636711281 minimum actual rule probability
MAX_RULE_PROBABILITY 0.993975903614458    maximum actual rule probability
MIN_RULE_LIFT 0.514182044237125 minimum actual rule importance
MAX_RULE_LIFT 2.13833283242171 maximum actual rule importance

Now, this information is cached by the algorithm during training and saved as a text field in the root content node. It can be retrieved with a simple DMX statement like:

SELECT TOP 1 NODE_DESCRIPTION FROM Customers.CONTENT

The query will return a big string like below:
“Association Rules Model; ITEMSET_COUNT=378; RULE_COUNT=119; MIN_SUPPORT=89; MAX_SUPPORT=2439; MIN_ITEMSET_SIZE=0; MAX_ITEMSET_SIZE=3; MIN_PROBABILITY=0.401529636711281; MAX_PROBABILITY=0.993975903614458; MIN_LIFT=0.514182044237125; MAX_LIFT=2.13833283242171”

 

The stored procedure guarantees that the string information is parsed according to the whatever method the server uses to cache the information. So, for instance, if Microsoft decides to add some new statistics in the description string, you can safely rely on the stored procedure to fill the right columns with the appropriate information.

Once these statistics are retrieved, some pieces of the Rules viewer are already populated:

image

For instance, the Up-Down controls for Minimum probability an minimum importance have a minimum and a maximum value.

 

Extracting the rules

The next viewer call fetches some rules for the rules viewer.

The call looks like below:

CALL System.Microsoft.AnalysisServices.System.DataMining.AssociationRules.GetRules(’Customers’, 0, 1999, 1, 0.4, 0.51, ”, True)

Here is a list of parameters:

Parameter ordinal

Comment

1 string — The name of the mining model
2 integer — The index of the first rule to be retrieved (0 means start with the first rule)
3 integer — The index of the last rule to be retrieved (1999, together with 0 for the previous parameter, makes a total of 2000 rules, i.e. the rules page size from statistics)
4 integer — The sort order for the rules, more details below
5 double –the minimum probability for returned rules
6 double — the minimum importance for returned rules
7 a string filter to be applied on returned rules (may be empty)
8 a boolean flag (true/false) flag indicating whether the long name or short name of  a rule should be returned – the state of the “Show Long Name” checkbox in the viewer

 

The parameter order is somehow confusing. The stored procedure works like this:

- scans all the rules detected during the model training. These rules can be obtained with a query like 

SELECT * FROM Customers.CONTENT WHERE NODE_TYPE = 8

Such a query will typically take long time to execute and even longer time to interpret the results. More details on the content information for rule will be presented in a future post. If you don’t want to wait, look at page 355 in the “Data Mining with Microsoft SQL Server 2008“ book. Browsing the rules using the server side Adomd.net object model is far more efficient than executing the query, and this is one of the major reasons for the viewers to use the stored procedures.

- for each rule, apply the filters described by the other parameters (minimum probability, minimum importance and, if not empty, the string filter).

- insert the qualified rules (which pass all filters) in a heap data structure for sorting, based on the sort order argument

- once the rules browsing is complete, traverse the heap and return the rules starting with the start index, up to the end index

The sort order parameter is an integer representation of the enumeration below:

public enum AssociationSortOrder
     {
      AscendingProbability   = 0,
      DescendingProbability   = 1,
      AscendingLift     = 2,
      DescendingLift     = 3,
      AscendingSupport    = 4,
      DescendingSupport    = 5,
      AscendingItemsetLength    = 6,
      DescendingItemsetLength   = 7,
      AscendingLexicographical  = 8,
      DescendingLexicographical = 9,
     }

The default sort order is 1 (AscendingProbability). You can change the sort order by clicking one of the rules table headers. The only enumeration options that are valid for rules are 0, 1, 2, 3, 8 and 9 (respectively, ascending and descending order by probability, importance or lexicographically). The other values in the enumeration are reserved for retrieving the itemsets.

The string filter parameter can be any .Net regular expression to be applied to the rule description. In particular, it can also be an item name (which is a valid .Net regular expression). If this filter is not empty, then only rules matching the regular expression will be returned. With a bit of experimentation, you can build rather sophisticated rule filters using the RegEx mechanism.

Now, the result of the stored procedure is a non-normalized representation of lots of information describing the rules. It has the following columns, which have different semantics depending on the table row:

Column

What it may hold

NODE_UNIQUE_NAME  If this is not empty, then the current row is a rule. Otherwise, the current row contains additional information
NODE_CAPTION For rule rows, this is the rule caption
NODE_SUPPORT For the first row in the result, the total number of returned rules
For rule rows, the rule support.
Empty for rows containing additional information
NODE_PROBABILITY For rule rows, the rule probability. Empty otherwise
NODE_LIFT For rule rows, the rule importance. Empty otherwise
NODE_SIZE For rule rows, the size of the itemset defining the rule. An example: for a rule like (A,B)->C, this column will contain 3
ATTRIBUTE_NAME For rule rows, as well as for the first row, this is empty. For the other rows, this contains the name of an attribute that participates in a rule
ATTRIBUTE_VALUE For rule rows, as well as for the first row, this is empty. For the other rows, this contains the value of an attribute that participates in a rule

  Here is a result sample:

NODE_UNIQUE_NAME

NODE_CAPTION

NODE_SUPPORT

NODE_PROBABILITY

NODE_LIFT

NODE_SIZE

ATTRIBUTE_NAME

ATTRIBUTE_VALUE

    119          
1378

Attack of the Clones = Existing, Return of the Jedi = Existing -> Empire Strikes Back = Existing

165

0.993975903614458

1.35796…   

3    
            Attack of the Clones Existing
            Return of the Jedi Existing
            Empire Strikes Back Existing
1485

Alien = Existing –> Aliens = Existing           

103

0.515

1.1578…

2    
            Alien Existing
            Aliens Existing

The first row tells the viewer that 119 rows have been returned in this result set. This row always populates only the NODE_SUPPORT column with the number of returned rules.

The second row is the first actual rule being returned:

“Attack of the Clones = Existing, Return of the Jedi = Existing -> Empire Strikes Back = Existing”

This rule corresponds to the content node with the NODE_UNIQUE_NAME of 1378, has a support of 165, a probability of 0.993… and an importance of 1.35796…. It consists of 3 attributes.

The next 3 rows will contain the attribute names and their values. By convention, the last row in this batch will be the rule result. Therefore, the next 3 rows contain, respectively:

- the two predicates forming the left hand side of the rule : “Attack of the Clones” = “Existing” and “Return of the Jedi” = “Existing”

- the predicate on the right hand side of the rule: “Empire Strikes Back” = “Existing”

 

Row  5 of the result table contains the next rule (from the content node with the ID of 1485) and the next two rows contain the components of this rule. And so on, and so forth.

The decomposition of rule attributes and values allows changing on the fly the rendering:”Show attribute and value”, “Show attribute only” or “Show value only”.

You are probably already familiar with the resulting view:

image

 

 

Extracting the itemsets

The itemset extraction is very similar to the rule extraction.

The stored procedure call looks like:

CALL System.Microsoft.AnalysisServices.System.DataMining.AssociationRules.GetItemsets(’Customers’, 0, 1999, 5, 0, 89, ”, True)

Here is a list of parameters:

Parameter ordinal

Comment

1 string — The name of the mining model
2 integer — The index of the first itemset to be retrieved (0 means start with the first itemset)
3 integer — The index of the last itemset to be retrieved (1999, together with 0 for the previous parameter, makes a total of 2000 itemsets, i.e. the rules page size from statistics)
4 integer — The sort order for the itemsets, more details below
5 integer –the minimum itemset size
6 integer — the minimum itemset support
7 a string filter to be applied on returned itemsets (may be empty)
8 a boolean flag (true/false) flag indicating whether the long name or short name of  an itemset should be returned – the state of the “Show Long Name” checkbox in the viewer

The sort order parameter is an integer representation of the enumeration described in the rules section,AssociationSortOrder.

The default sort order is 5 (DecendingSupport). You can change the sort order by clicking one of the itemset table headers. The only enumeration options that are valid for rules are 4,5,6,7, 8 and 9 (respectively, ascending and descending order by support, itemset size or lexicographically).

The result of the GetItemset stored procedure is very similar to the result of the GetRules stored procedure.

 

That’s all! If you write a nice viewer using the stored procedures, send us a demo (or at least a snapshot!)

3 Responses to “Querying Rules and Itemsets (like the Data Mining Viewers do)”

  1. Bogdan.
    Excellent. This is exactly the kind of stuff of which I would like to see more.

  2. Can i ask? What is query to get this:

    Attack of the Clones Existing
    Return of the Jedi Existing
    Empire Strikes Back Existing

    I always get’s only first line in NODE_DISTRIBUTION:

    Movies (Star Wars Episode V: Empire Strikes Back) Existing

    Thank you

  3. Got it :)

Discussion Area - Leave a Comment