The back-propagation neural network (BPNN, see Figure
1.9) was developed by Rumelhart *et al.*
[12] as a solution to the problem of training
multi-layer perceptrons. The fundamental advances represented by the
BPNN were the inclusion of a differentiable transfer function at each
node of the network and the use of error back-propagation to modify
the internal network weights after each training epoch.

The BPNN was chosen as a classifier primarily because of its ability
to generate complex decision boundaries in the feature space
[13]. There is even work suggesting that a BPNN, under
appropriate circumstances, can approximate Bayesian posterior
probabilities at its outputs [14]. This is significant
because a Bayesian classifier provides the best performance possible
(i.e., lowest error rate) for a given distribution of the feature
data. As with other non-parametric approaches to pattern
classification, it is not possible to predict the performance of a
BPNN *a priori*. Furthermore, there are several parameters of
the BPNN that must be chosen, including the number of training
samples, the number of hidden nodes, and the learning rate.

Based on the work of Baum and Haussler [15], it is
possible to place a bound (*m*) on the number of training samples
needed to *guarantee* a particular level of performance on a
set of test samples drawn from the same distribution as the training
data. Specifically, if at least *m* samples are used to train a
network with *W* weights and *N* nodes such that a fraction equal to
of them are classified correctly, then one can
be confident that a fraction
of future (test) samples
from the same distribution will be classified correctly, where

(1.3) |

As a specific example, to guarantee no more than a 10% error in classifying the test data, the number of training samples should be equal to roughly 10 times the number of weights in the network. For a typical network generated below, this represents a requirement for 5000-10000 training samples. It is simply not tractable to generate that many images. Fortunately, this bound does not preclude the possibility of generating a successful classifier using fewer training samples, as many studies have empirically demonstrated.

The theoretical basis for selecting the number of hidden nodes to use in a single hidden layer network is not well developed. The only general method available to optimize this parameter is to test the network with various numbers of hidden nodes and select the one that performs best.