next up previous contents
Next: k-Nearest Neighbor Classifiers Up: Choice of Classifiers Previous: Classification Trees

Back-Propagation Neural Networks

The back-propagation neural network (BPNN, see Figure 1.9) was developed by Rumelhart et al. [12] as a solution to the problem of training multi-layer perceptrons. The fundamental advances represented by the BPNN were the inclusion of a differentiable transfer function at each node of the network and the use of error back-propagation to modify the internal network weights after each training epoch.

Figure 1.9: A schematic of a back-propagation neural network. The back-propagation neural networks used in this work all have three layers of neurons, or nodes (input, hidden, and output). Each node in the input and hidden layers is connected to each of the nodes in the next layer (hidden or output). All connections between nodes are directed (i.e. the information flows only one way), and there are no connections between the nodes within a particular layer. Each connection between nodes has a weighting factor associated with it. These weights are modified using the back-propagation algorithm during the training process to produce ``learning''.

The BPNN was chosen as a classifier primarily because of its ability to generate complex decision boundaries in the feature space [13]. There is even work suggesting that a BPNN, under appropriate circumstances, can approximate Bayesian posterior probabilities at its outputs [14]. This is significant because a Bayesian classifier provides the best performance possible (i.e., lowest error rate) for a given distribution of the feature data. As with other non-parametric approaches to pattern classification, it is not possible to predict the performance of a BPNN a priori. Furthermore, there are several parameters of the BPNN that must be chosen, including the number of training samples, the number of hidden nodes, and the learning rate.

Based on the work of Baum and Haussler [15], it is possible to place a bound (m) on the number of training samples needed to guarantee a particular level of performance on a set of test samples drawn from the same distribution as the training data. Specifically, if at least m samples are used to train a network with W weights and N nodes such that a fraction equal to $1-\frac{\epsilon}{2}$ of them are classified correctly, then one can be confident that a fraction $1-\epsilon$ of future (test) samples from the same distribution will be classified correctly, where

\begin{displaymath}m \geq O\left(\frac{W}{\epsilon}\log\frac{N}{\epsilon}\right)
\end{displaymath} (1.3)

As a specific example, to guarantee no more than a 10% error in classifying the test data, the number of training samples should be equal to roughly 10 times the number of weights in the network. For a typical network generated below, this represents a requirement for 5000-10000 training samples. It is simply not tractable to generate that many images. Fortunately, this bound does not preclude the possibility of generating a successful classifier using fewer training samples, as many studies have empirically demonstrated.

The theoretical basis for selecting the number of hidden nodes to use in a single hidden layer network is not well developed. The only general method available to optimize this parameter is to test the network with various numbers of hidden nodes and select the one that performs best.

next up previous contents
Next: k-Nearest Neighbor Classifiers Up: Choice of Classifiers Previous: Classification Trees
Copyright ©1999 Michael V. Boland