Back-Propagation Neural Network

Back-Propagation Neural Networks (BPNNs) were implemented using the
Netlab (`http://www.ncrg.aston.ac.uk/netlab/`) scripts for
Matlab, and the code in Section 5.5.2
(p. ). A fixed number of instances from
each class were randomly assigned to the training (40 instances from
each class), stop (20 instances from each class), and test (remaining
13 to 38 instances from each class) sets. The mean and standard
deviation of each feature were calculated using the instances assigned
to the training set. These values were then used to normalize the
training data to have a mean of 0 and a variance of 1. The mean and
standard deviation of the training data were also used to normalize
the stop and test sets.

The normalized training and stop sets were used to train a back-propagation neural network with a number of inputs equal to the number of features being evaluated, 20 hidden nodes, and 10 output nodes. The momentum and learning rate were 0.9 and 0.001, respectively. The target outputs of the network for each instance were defined to be 0.9 for the node representing the correct class of that instance and 0.1 for the other outputs. After each epoch of training, the stop data were passed through the network and a sum of squared error was calculated for the difference between the actual network outputs and the target outputs. When this error term for the stop data reached a minimum, training was halted. At that point, the test data were applied to the network and the outputs recorded. All of the steps above (starting with random assignment of the feature data), were repeated 10 times for each set of features being evaluated. The results of iteration were therefore 10 networks, each trained with a different subset of the data, and 10 corresponding sets of network output data.

A classification for each test instance was determined in one of
two ways: 1) The network output with the highest value was defined to
be the classification of the corresponding test instance. 2) A
threshold value was used such that only instances for which there was
a single output above the threshold were classified. Instances for
which there were either 0 or more than one outputs above the threshold
were assigned to an `unknown' category. The best threshold for each
network was determined by classifying the stop set instances using
threshold values ranging from 0.05 to 0.95 in steps of 0.05. A
classification was assigned only if there was a single output above
the threshold. The *accuracy* of the classification for a particular
threshold was defined to be the number of correct classifications
divided by the number of attempted classifications (total instances
minus unknowns). The *recall* of the same classifier was defined as
the number of correct classifications divided by total number of
instances. The best threshold for each network was then defined to be
the one that maximized (
*accuracy*^{2}+*recall*^{2}). Figure
3.6 includes a recall vs. accuracy plot.
The point at which
*accuracy*^{2}+*recall*^{2} is maximized (a threshold of
0.55) is marked with a .

To summarize network performance, the outputs of the test data from each of the 10 networks were simply converted to confusion matrices in which the elements were the number of instances from a given class that had been assigned to each of the 10 output classes. These matrices were summed across all 10 network trials and then the entries were converted to percentages. If an `unknown' class was included in a particular confusion matrix, a second summarizing number was generated for each diagonal element which was the percentage of the classification attempts (number of instances in the test set minus those that were classified as unknowns) that were correct. The performance of each of the 10 individual BPNN classifiers used with each feature set was summarized as the average of the diagonal elements (percentage of correct classifications) of a confusion matrix generated from that trial only. All 10 trials were then summarized by the mean and variance of these 10 classification rates.

Ninety-five percent confidence intervals were calculated as described
in Section 1.7.2
(p. ) with the number of trials (*N*in Equation 1.5) equal to the number of
training or test cases used in a single trial. The number of samples
in a single trial was used rather than the sum of samples used across
the 10 trials because the latter number does not represent strictly
independent events. Calculated this way, these intervals represent an
upper bound of the confidence interval on the classification rate.