The origin of the neural network can be traced to 1940s when two researchers, Warren McCulloch and Walter Pits, tried to build a model to simulate how biological neurons work. Though the focus of this research was on the anatomy of the brain, it turns out that this model introduced a new approach for solving technical problems outside neurobiology.
During the 1960s and 1970s, with the advance of computer technology, researchers implemented some prototypes of the models based on the work of McCulloch. In 1982, John Hopfield invented backpropagation, a method to adjust the weights of a neural network in backward direction based on the learning error, as is explained later in this chapter.
Since 1980s, the theories of neural networks have matured, and the computing power of modern computers has enabled the processing of large neural networks within a reasonable time frame. Neural network technologies are applied to more and more commercial applications, for example, voice and handwriting recognition, fraud detection of credit card transactions, and customer churn analysis.
Neural networks mainly address the classification and regression tasks of data mining. Like decision trees, neural networks can find nonlinear relationships among input attributes and predictable attributes. Neural networks, however, find smooth rather than discontinuous nonlinearities. On the negative side, it usually takes longer to learn to use a neural network than it does to use decision trees and Naïve Bayes. Another drawback of neural networks is the difficulty in interpreting results. Aneural network model contains no more than a set of weights for the network. It is difficult to see the relationships in the model and why they are valid.
Neural networks support discrete and continuous outputs. When the outputs are continuous, the task is regression. In fact, classic regression techniques, such as logistic regression, can be represented as special cases of neural networks. Although typically used for classification and regression, feed-forward neural networks can also be applied to segmentation, when used with a bottleneck configuration (small hidden layer).
What Is Neural Network?
What is a feed-forward neural network? Neural networks are more sophisticated than decision trees and Naive Bayes are. Figure displays a couple of examples. A neural network contains a set of nodes (neurons) and edges that form a network. There are three types of nodes: input, hidden, and output. Each edge links two nodes with an associated weight. The direction of an edge represents the data flow during the prediction process. Each node is a unit of processing. Input nodes form the first layer of the network. In most neural networks, each input node is mapped to one input attribute such as age, gender or income. The original value of an input attribute needs to be massaged to a floating number in the same scale (often between –1 to 1) before processing.
Hidden nodes are the nodes in the intermediate layers. A hidden node receives input from nodes in the input layers or precedent hidden layer. It combines all the input based on the weight of associated edges, processes some calculations, and emits a result value of the processing to the following layer.
Output nodes usually represent the predictable attributes. Aneural network may have multiple output attributes, as displayed in Figure(Example of neural network). It is possible to separate the output nodes to several different networks. But in most cases, it reduces the processing time when they are combined as these networks can share the common cost of scanning the source data. The result of the output node is often a floating number between 0 and 1.
The prediction for neural network is straightforward, the attribute values of an input case are normalized and mapped to the neurons of the input layer. Then each hidden layer node processes the inputs and triggers an output for the layers that follow. At the end, the output neurons start to process and generate an output value. This value is then mapped to the original scale (in terms of continuous attribute) or original category (in terms of discrete attribute). While processing a neural network is time-consuming, making predictions against a trained neural network is rather efficient.
As displayed in Figure(Example of neural network), the topologies of the neural networks may vary. Figure shows a very simple network. It has one output attribute without a hidden layer. All the input neurons connect to the output neuron directly. Such a neural network is exactly same as logistic regression. Figure is a network with three layers: input, hidden and output. There are three neurons in the hidden layer. Each neuron of the hidden layer is fully connected to the input of the precedent layer. The hidden layer is a very important aspect of neural network. It enables the network to learn nonlinear relationships.
Non-feed-forward networks have directed cycles in their topology or “architecture.” That is, while following the direction of edges in a neural network, you can return to the same node. The Microsoft Neural Network is a feed-forward network. After the topology of a neural network is configured, that is, the number of hidden nodes is specified, the training process involves finding the best set of weights for the edges in the network. This is a time-consuming task. Initially, the weights are randomly assigned. During each training iteration, the network processes the training cases to generate predictions on the output layer based on the current network configurations. It then calculates the error for the outputs. Based on these errors, it adjusts the weights of network using backward propagation. We will go over the details of the neural network learning process in the following sections.
Example of neural network


Combination and Activation
Each neuron in the neural network is a basic processing unit. A neuron has a number of inputs and one output. It combines all the input values (combination), does certain calculations, and then triggers an output value (activation). The process is very similar to the biological neuron.
Figure(A basic processing unit) displays the structure of a neuron. It contains two functions: a combination of inputs and a calculation of outputs. The combination function combines the input values into a single value. There are different ways to combine inputs. The most popular method is the weighted sum, meaning that the sum of each input value is multiplied by its associated weight. Other combination functions include mean, max logical OR, and logical AND of the input values. The Microsoft Neural Network uses the weighted sum approach. The output of combination is then passed through the activation function.
Similar to the way that a biological neuron works, when using the activation function, small changes of the input value sometimes trigger large output changes, and sometimes large changes of the input value have insignificant impact on the output. In particular, the output is sensitive to the input only when the input is in its midrange. This property enhances the neural network’s ability to learn as it introduces the nonlinearity into the network. Several math functions satisfy this property. The most well-known functions are sigmoid (logistic) and tanh. These are nonlinear functions and result in nonlinear behavior. The definitions of sigmoid and tanh are:
sigmoid: O = 1/(1+ea)
tanh: O = (ea – e-a)/(ea + e-a)
where a is the input value and o is the output value.
Figure displays the distribution of the sigmoid and tanh functions. The xaxis is the input value and the y-axis represents the output it triggers. The output value of sigmoid function is between 0 and 1, whereas the output value for tanh is between –1 and 1. When the input value is close to 0, the output is very sensitive to slight changes in the input. When the absolute value of the input gets larger, the output becomes less sensitive.
Microsoft Neural Network uses tanh as the activation function in the hidden nodes. For output nodes, it uses the sigmoid function.
A basic processing unit

Activation function

Backpropagation, Error Function, and Conjugate Gradient
The core part of processing a neural network is backpropagation. The training of neural network is an iterative process. At each iteration, the algorithm compares the output values with the actual known values to get the errors for each output neuron. The weights pointing to the output neurons are modified based on the error calculations. These modifications are then propagated from the output layer through the hidden layers down to the input layer. All the weights in the neural network are adjusted accordingly. The core process of neural network training is described in the following steps:
- The algorithm randomly assigns values for all the weights in the network at the initial stage (usually ranging from –1.0 to 1.0).
- For each training example (or each set of training examples), it calculates the outputs based on the current weights in the network.
- The output errors are calculated, and the backpropagation process calculates the errors for each output and hidden neuron in the network. The weights in the network are updated.
- Repeat step 2 until the condition is satisfied. Some neural networks update the weights after examining each case. This is called case[online] updating. Other neural networks update the weights until all the sample cases are analyzed. This is called epoch[batch] updating. One interaction through the training dataset is called an epoch. The Microsoft Neural Network uses epoch updating because it is more robust for regression models.The neural network needs a measure to indicate the quality of the training. This measure is the error function (also called a loss function). The whole purpose of neural network training is to minimize the training error.
There are many different choices for error functions, for example, the squared residual (the square of the delta between predicted value and actual value) or binary threshold for binary classification (if the delta between output and actual value is less than 0.5, then the error is 0; otherwise, it is 1). The following formula gives one of the common methods for calculating the error for neurons at the output layer using the derivative of the logistic function. (The Microsoft Neural Network uses sum-of-squares error for continuous attribute and cross-entropy for discrete attribute):
Erri = Oi(1 - Oi)(Ti - Oi)
In this case, Oi is the output of the output neuron unit i, and Ti is the actual value for this output neuron based on the training sample. The error calculation of hidden neuron is based on the errors of the neurons in the following layers and the associated weights. The following is the formula:
Erri = Oi(1 - Oi)Ój Errjwij
Here, Oi is the output of the hidden neuron unit i, which has j outputs to the following layer. Errj is the error of neuron unit j, wij is the weight between these two neurons. Once the error of each neuron is calculated, the next step is to adjust the weights in the network accordingly, using the following method.
wij = wij + l*Errj*Oi
Here l is a value ranging from 0 and 1. The variable l is called learning rate. If the value of l is smaller, the changes on the weights are smaller after each iteration, thus the learning rate is slow. The value of l usually decreases during the training process. At the initial stage of training, l is large, which allows the neural network to move quickly towards the optimum solution. Afterward it decreases, so you can fine-tune the network to search for the best solution.
Many neural networks apply a method called the conjugate gradient in the process of adjusting the weight after each iteration. Conjugate gradient method is an algorithm for finding the nearest local minimum. The gradient method uses derivative (gradient) to find the next direction. Conjugate takes into account the previous direction when it calculate the next direction so that it could avoid zig-zag problem, meaning taking short-cut.
Because the search space for the best set of weights is huge, with many local optimal points, researchers apply different nonlinear optimization methods to guide the training process. There are many optimization algorithms, such as genetic algorithms, simulated annealing, iterative improvement, and so on.
A Simple Example of Processing a Neural Network
The best way to explain the neural network training process is to go through a simple case of updating an example. In this example, we use weighted sum as the combination function, and the sigmoid as the activation function. Figure 10.4 shows the topology of a simple neural network with six neurons. The initial weights of the edges are displayed in the figure.
This example has three input nodes and one output node, which mapped to the four attributes of a sample case. Suppose that the sample case is (1, 1, 0, 1), the last digit is the output. The first step is to calculate the outputs of each hidden and output neuron as shown in Table.
Calculation of Outputs for Hidden and Output Neurons

We get the output value of neuron 6 which is 0.667. The actual value is given by the sample as 1. We can thus calculate the error of the output neuron. Using the backpropagation method, we can derive all the errors for all the output and hidden neurons as listed in Table.
An example of neural network training

Calculation of Errors for Hidden and Output Neurons

The sample neural network uses the case updating method. Once the error is calculated, we can adjust the weights accordingly. Table gives the new set of weights after the first training case. The step size is a constant, with a value of 0.8.
Calculation of New Weights

Normalization and Mapping
The neural network requires the value of input variables to be normalized in the same scale of value; otherwise, those variables with large value scale will dominate the training process. There are a dozen different methods to normalize continuous input attributes, including z-score, z-axis, log score, and so on. The simplest method is the following:
V = (A – Amin)/(Amax - Amin)
where A is the value of the attribute, Amin is its minimum value, and Amax its maximum value. However, this simple method has some issues. For example, if extreme minium or maximum values exist in the distribution, the normalized result will be skewed. Suppose that the attribute you want to normalize is income, and the majority of the households have income less than $200,000. If there is a household with over $1,000,000 income, the majority of the families will be mapped to the first 10–20% of the range. In this case, the log score is a better solution because it maps all the values to the log space first to reduce the scale issue.
For discrete variables, the easiest method is to map it to equal space points from 0 to 1. For example, there are five states for Education: partial high school, high school, undergraduate, graduate, and Ph.D. These values can be mapped to 0, 0.25, 0.50, 0.75, and 1.0, respectively. Working with the Microsoft Neural Network, you would use the following method for input attribute normalization:
Where for continuous input, ì is mean and ä is the standard deviation;for discrete input, ì = p (probability of a state), and ä2 = p * (1 – p) The relationship between the attribute and neurons is 1 to n. An attribute is mapped to n neurons. The Microsoft Neural Network maps a continuous attribute to two nodes: one representing the value and the other representing the missing state. It maps a discrete attribute into n + 1 nodes, n being the number of distinct states and 1 representing the missing state. If the attribute is binary with two states — Missing or Existing — it is modeled as a single node.
Figure shows an example of input normalization and mapping. The top table is the training input data. The bottom table displays the data after normalization and mapping process. You can see from the figure that the four input columns (not counting the ID) are mapped to 10 input neurons. If Gender, Income, and IQ are the input attributes, and Plan is the predictable attribute, there are seven input neurons and three output neurons.
Input normalization and mapping

Topology of the Network
The topology of the neural network needs to be fixed before processing. The number of input and output neurons is fixed with a training dataset. The options are mainly related to the configuration of the hidden layers, such as number of hidden layers and the number of hidden neurons at each hidden layer.
A neural network could have any number of hidden layers. The capacity of a network is a complicated function of the number of nodes and number of layers. So, multiple hidden layers may increase the learning capacity. It will also increase the processing time. The other drawback is potentially overtraining. With too many hidden layers and hidden nodes, the network tends to remember the training cases instead of generalizing the patterns (similar to the oversplit issue in decision trees). It has been proven that in most cases, one hidden layer is sufficient. The Microsoft Neural Network doesn’t allow more than one hidden layer.
The number of neurons in the hidden layer is also very important. Using too few will starve the network of the resources it needs to solve the problem. Using too many will increase the training time. Researchers propose a rough guideline for choosing the number of hidden neurons: c*sqrt(m*n), where n is the number of input neurons, m is the number of output neurons, and c is a constant. The optimal number varies from problem to problem: you should experiment with the number of nodes. In the Microsoft Neural Network, the default value for c is 4.
Similar to other Microsoft algorithms, a mining model based on the Microsoft Neural Network can have multiple predictable attributes. This results in multiple sub-neural-networks. For example, if there are two predict attributes — Age and Home Ownership — you have to create two separate neural networks, one to predict each predictable attribute. However, if these two attributes are predict_only, they can share the same network. Each input attribute will be mapped to multiple input neurons. Sometimes, this can result in a large number of input neurons if there are many discrete attributes with many distinct values. By default, the total number of output neurons per subnetwork is limited to 500 in the Microsoft Neural Network algorithm. It will build multiple neural networks in case the number of output neurons is over 500. When there are lots of input attributes, the Microsoft Neural Network algorithm invokes the feature selection process. The feature selection process selects the most important 255 input attributes.
Training the Ending Condition
The training process of neural network is iterative. Depending on the complicity of patterns in the sample dataset, it may take hundreds or even thousands of iterations through the data. What is the stop condition for a neural network? The following is a list of possible stop criteria:
- Sufficient accuracy on a holdout set: The misclassification rate is below a given threshold.
- Maximum iteration: The training process has reached the high limit of the number of iterations.
- Convergence of the weights: The change on the weights after each iteration falls below a threshold.
- Time out: The number of iteration exceeds the limit. The Microsoft Neural Network uses the first three conditions as the stop criteria. The training stops when any of the top three conditions is satisfied.