Microsoft Neural Network — Step-by-step Predictions

A recent post on the MSDN Forums raises the issue of reconstructing the Microsoft Neural Network and reproducing the prediction behavior based on the model content. The forum did not allow a very detailed reply and, as I believe this is an interesting topic, I will give it another try in this post. As an example, I use a small neural network which is trained on two inputs, X (having the states A and B) and Y (C and D) in order to predict a Z discrete variable, having the states E and F.

This post is a bit large, so here is what you will find:

  • - a description of the topology of the Microsoft Neural Network, particularized for my example
  • - a step-by-step description of the prediction phases
  • - a set of DMX statements that exemplify how to extract network properties from a Microsoft Neural Network mining model content
  • - a spreadsheet containing the sample data I used as well as a sheet which contains the model content and uses cell formulae to reproduce the calculations that lead to predictions (the sheet can be used to execute predictions for this network)

The network topology

Once trained, a Microsoft Neural Network model  looks more or less like below:

image

During prediction, the input nodes are populated with values deriving from the input data. The values are linearly combined with the edges leading from input nodes to the middle (hidden) layer of nodes and the input vector is translated to a new vector, which has the same dimension as the hidden layer. The translated vector is then “activated” using the tanh function for each component. The resulting vector goes through a similar transformation, this time from the hidden layer to the output layer. Therefore, it is linearly converted to the output layer dimensionality (using the weight of the edges linking hidden nodes to output nodes). The result of this transformation is activated using the sigmoid function and the final result is the set of output probabilities. These probabilities are normalized before being returned as a result (in a call like PredictHistogram)

The input layer contains one node for each state of each input attribute. Actually, for each state that appears in the training set and has a probability that is strictly higher than 0.0 and strictly lower than 1.0. For example, a discrete or discretized X input with states A and B has at most 3 associated input nodes : one for A, one for B and, if data contains NULLs, one for the Missing state. A continuous input has at most 2 associated input nodes: one for Missing (if this appears) and one for Existing (the non-null state).

The output layer contains one node for each state of each output attribute, following the same rules like the input layer.

The number of nodes in the hidden layer depends on the number of input and output nodes, as well as on parameters set by users.In our example, X (A,B) and Y(C, D) lead to 4 input nodes, while Z(E,F) leads to 2 output nodes.  The number of nodes in the hidden layer is computed like below:

#Hidden = HIDDEN_NODE_RATIO * sqrt(#InputNodes*#OutputNodes)

The default HIDDEN_NODE_RATIO value is 4.0, therefore: 4*sqrt(4*2) ~ 4*2.75 = 11 hidden nodes in my example network
 

Prediction phase 1: Map the Input

I’ll go through the calculations that are involved when executing a DMX statement like

SELECT PredictHistogram(Z) FROM DemoModel NATURAL PREDICTION JOIN

(SELECT ‘B’ AS X, ‘D’ AS Y) AS T

First, a vector is prepared to hold the input layer of the network. The vector has as many values as nodes in the input layer. Values in the vector depend on the values in the prediction input. The input nodes are, respectively, associated with (X=A, X=B, Y=C, Y=D). Each of the corresponding values is initialized with a score which depends on whether the attribute value indicated by the node is present (ScoreON) or absent (ScoreOFF). The scores for a state depend on the attribute type. In general, z-score normalization is used (i.e. val -Mu/StdDev), where Mu and StdDev are observed over the training set. The scores are computed like below:

ScoreON = 1 - Mu / Sigma

ScoreOFF = -Mu / Sigma

for discrete/discretized variables:

Mu = p –  the prior probability of a state

StdDev  = sqrt(p(1-p))

Here is the distribution of input attributes in my training set:

Total Training Cases = 252
p(X = Missing) = 0 ( 0/252)
p(X=A) = 0.6944 (175/252)
p(X=B) = 0.30555 (77/252)
p(Y = Missing) = 0 ( 0/252)
p(Y=C) = 0.4801 (121/252)
p(Y=D) = 0.5198 (131/252)
p(Z=Missing)  = 0 (0/252)
p(Z=E)=0.5238(132/252)
p(Z=F)=0.4761(120/252

The distribution of all the attributes in the model can be collected from the model content, by querying the marginal distribution node (NODE_TYPE=24) like below:

// Q1: Training set distributions
SELECT FLATTENED
(SELECT ATTRIBUTE_NAME, ATTRIBUTE_VALUE, [SUPPORT], [PROBABILITY] FROM NODE_DISTRIBUTION)
FROM DemoModel.CONTENT WHERE NODE_TYPE=24

Note that this returns the distribution for all attributes, input or predictable. With the formulae above and the distributions from the content, the scores for the input become:

Input Node

ScoreON

ScoreOFF

X=A 0.663324958 -1.507556723
X+B 1.507556723 -0.663324958
Y=C 1.040502104 -0.961074462
Y=D 0.961074462 - 1.040502104

The input vector (X=B, Y=D) now looks like:
(ScoreOff(X=A), ScoreOn(X=B), ScoreOff(Y=C), ScoreOn(Y=D))
therefore:
(-1.507556723, 1.507556723, -0.961074462, 0.961074462)

Each input node also has a unique identifier which is used in the hidden layer to map coefficients. The identifiers of the input node can be extracted from a model with a DMX query like below:

// Q2: Unique identifiers for input level nodes
select NODE_UNIQUE_NAME, NODE_DESCRIPTION FROM DemoModel.CONTENT WHERE NODE_TYPE=21

These identifiers are used in the next phase.

Phase 2: The Hidden Layer

The hidden layer consists, as mentioned above, of 11 nodes. Each of these nodes has a set of coefficients to be applied to each input node as well as an intercept.
Therefore, each hidden node can be described by a vector:
(Hi0, Hi1, Hi2, Hi3, Hi4)

where:
- i is the index of the hidden node, 0-10
- Hij is the coefficient to be applied to input node j (0-3)
- Hi4 is the intercept
A DMX statement like below returns the hidden node coefficients and their mappings to the input layer:
 

// Q3: Hidden layer nodes and coefficients
select FLATTENED NODE_UNIQUE_NAME,
(SELECT ATTRIBUTE_NAME, ATTRIBUTE_VALUE FROM NODE_DISTRIBUTION)
FROM DemoModel.CONTENT WHERE NODE_TYPE=22

In the result of this statement,

The NODE_UNIQUE_NAME column identifies the hidden node
The ATTRIBUTE_NAME identifies the input layer node (is the node unique name from Q2)
The ATTRIBUTE_VALUE is the coefficient
The coefficient associated with an empty ATTRIBUTE_NAME is the intercept (Hi4 in the vector definition).

The first step consists in applying the hidden node is computing the linear function for the input, for each node:
H(i) = Hi0*Input0 + Hi1*Input1 + Hi2*Input2 + Hi3*Input3 + Hi4

The second step is activation of the hidden node, performed with tanh:
expA = exp(H(i))
H(i) = (expA - 1 /expA) / (expA + 1 /expA)

Phase 3: apply the output layer  coefficients

Per my target attribute, Z (E, F), the output layer consists of 2 nodes. Each of these has a set of coefficients to be applied to the hidden layer nodes. Therefore, each output node can be described by a vector:
(Oi0, Oi1, Oi2,… Oi11, Oi12) where:
- i is the index of the output node, 0-1 (two output nodes)
- Oij is the coefficient to be applied to hidden node j (0-11)
- Oi12 is the intercept
Here is the DMX statement which extracts the coefficients of the output node

// Q4: Output layer nodes and coefficients
select FLATTENED NODE_UNIQUE_NAME,NODE_DESCRIPTION,
(SELECT ATTRIBUTE_NAME, ATTRIBUTE_VALUE FROM NODE_DISTRIBUTION WHERE VALUETYPE=7)
FROM DemoModel.CONTENT WHERE NODE_TYPE=23

  • - The NODE_DESCRIPTION column contains the natural space description of the output attribute value represented by the node
    - The ATTRIBUTE_NAME identifies the hidden layer node (the node unique name in Q3)
    - The ATTRIBUTE_VALUE is the coefficient for that respective node
    Once more, the coefficient associated with an empty ATTRIBUTE_NAME is the intercept (Oi12 above). The restriction (VALUETYPE=7) filters out some other node information which is included for viewer purpose and is not relevant to the computations

The first step consists is computing a linear function:
O(i) = Oi0*H0 + Oi1*H1 + … + Oi11*H11 + Oi12

The second step is activation of the output node which, for multinomial targets (please consider that the target has 3 states, including the Missing) is:
O(0) = exp(O(0)) / sum_i(exp(O(i))), where i=[1,# of classes]

Now, O[] contains all the output results.
 

Phase 4: Normalization

Normalization is performed under the assumption that a state that was constant in the training set (either always On or always Off, such as Missing) is not actually a certainty.The normalization for the actual states (those computed from the output nodes) is done
by multiplying with a normalization factor. If the Missing state is the only constant state, it takes the rest of the probability.

The normalization factor is:
(TrainingCases + NumStates - ConstantStates)  / (TrainingCases + NumStates)
In our example, the default value for the HOLDOUT_SIZE parameter (0.7) leads to the actual training cases used by the network: 177 (70%*252). Consequently:
- training cases = 177 (70% of the actual data)
- NumStates = 3 (Missing, E and F)
- ConstantStates = 1 (the Missing state)
therefore, the normalization factor is 0.994444444

The output probabilites are computed for the output states as below:
P(E)  = O(0)*NormalizationFactor // first output node
P(F)  = O(1)*NormalizationFactor // second output node
P(Missing) = (P(Missing)*TrainingCases + 1) / (TrainingCases + NumStates) =
           = 1 / (177 + 3) (the complement of P(E) and P(F) )

 

A spreadsheet with all the calculations

An Excel 2007 workbook is available here (nnetdemo.xlsx) for download. Update: For some reason, it shows as a ZIP file when trying to download it — make sure you save it with the XLSX Excel 2007 extension !

The workbook contains two spreadsheets. One, named Source Data, contains the data set I used to build the model above. If you use the Data Mining Add-ins for Excel, you can get the model by using the Data Mining Client ribbon, the Advanced \ Create Mining Model task. Just select Microsoft Neural Network, mark X and Y as Inputs and Z as Predict Only and leave the parameters to their default values.

The second spreadsheet, named NNet Demo, contains all the network edges extracted from the model with the DMX statements mentioned previously, as well as the prior probabilities of all the attributes. It uses a set of cell formulae to reproduce the calculations that happen inside the network. At the top of the page there are two distinct sections, one for entering your input and the second containing the prediction result

image

Just click on the light blue cells. They may have a different color depending on your color scheme. Anyway, use the cells under the X and Y labels to select your inputs and notice how the prediction results (probabilities for different states of Z change).

You can get a nice graph of the calculations, which matches closely the network’s topology, if you switch to the Formulas ribbon in Excel and, with the Prediction Results selected, click multiple times on the Trace Precedents button (the page size is really small below, don’t try to read the actual values :-) )

image

6 Responses to “Microsoft Neural Network — Step-by-step Predictions”

  1. Thank you. This was very helpful. I’m still not sure how the coefficients for the linear functions (the weights on the arcs) are calculated, but it’s not important. I understand all of the other calculations and understand the general idea. I appreciate the time it took to do this.

  2. Very good article. I have converted this information into a SQL script to extract my NNet trained model into a formula for use in code. The problem I have run into is the use of non-binary (on or off) descrete inputs and outputs.

    My other point of confusion is you never predict a probability for a specific input case. I want to convert my trained model into code to use specific inputs to output a probability of the output case.

    Can you point me to anything to solve this issue? Thanks.

  3. Well, I should have reread my post more carefully before submitting.

    Correction to my Post: Output is discrete with 5 possible states. My inputs are continuous! They are not descrete.

  4. DS — “I want to convert my trained model into code to use specific inputs to output a probability of the output case.”
    Could you please provide a few more details? I am not sure I understand what you are trying to do.
    Also, I’d recommend posting any questions to http://forums.microsoft.com/MSDN/ShowForum.aspx?ForumID=81&SiteID=1 . You will probably get a quicker and sometimes more complete answer

  5. […] Applying logistic model coefficients Does this help? http://www.bogdancrivat.net/dm/archives/36 — – — This posting is provided "AS IS" with no warranties, and confers no rights. […]

  6. Hi

    I am struggling to verify the formula for logistic regression. I have read the article above and SQL manuals, but always seem to be out by approximately 1% to 2% in the probability.

    Would it be possible to explain how the formula above simplifies for logistic regression?

    Regards
    Carl

Discussion Area - Leave a Comment