next up previous contents
Next: k-Nearest Neighbor Method Up: Materials and Methods Previous: Feature Selection

Back-Propagation Neural Network

Back-Propagation Neural Networks (BPNNs) were implemented using the Netlab ( scripts for Matlab, and the code in Section 5.5.2 (p. [*]). A fixed number of instances from each class were randomly assigned to the training (40 instances from each class), stop (20 instances from each class), and test (remaining 13 to 38 instances from each class) sets. The mean and standard deviation of each feature were calculated using the instances assigned to the training set. These values were then used to normalize the training data to have a mean of 0 and a variance of 1. The mean and standard deviation of the training data were also used to normalize the stop and test sets.

The normalized training and stop sets were used to train a back-propagation neural network with a number of inputs equal to the number of features being evaluated, 20 hidden nodes, and 10 output nodes. The momentum and learning rate were 0.9 and 0.001, respectively. The target outputs of the network for each instance were defined to be 0.9 for the node representing the correct class of that instance and 0.1 for the other outputs. After each epoch of training, the stop data were passed through the network and a sum of squared error was calculated for the difference between the actual network outputs and the target outputs. When this error term for the stop data reached a minimum, training was halted. At that point, the test data were applied to the network and the outputs recorded. All of the steps above (starting with random assignment of the feature data), were repeated 10 times for each set of features being evaluated. The results of iteration were therefore 10 networks, each trained with a different subset of the data, and 10 corresponding sets of network output data.

A classification for each test instance was determined in one of two ways: 1) The network output with the highest value was defined to be the classification of the corresponding test instance. 2) A threshold value was used such that only instances for which there was a single output above the threshold were classified. Instances for which there were either 0 or more than one outputs above the threshold were assigned to an `unknown' category. The best threshold for each network was determined by classifying the stop set instances using threshold values ranging from 0.05 to 0.95 in steps of 0.05. A classification was assigned only if there was a single output above the threshold. The accuracy of the classification for a particular threshold was defined to be the number of correct classifications divided by the number of attempted classifications (total instances minus unknowns). The recall of the same classifier was defined as the number of correct classifications divided by total number of instances. The best threshold for each network was then defined to be the one that maximized ( accuracy2+recall2). Figure 3.6 includes a recall vs. accuracy plot. The point at which accuracy2+recall2 is maximized (a threshold of 0.55) is marked with a $\circ$.

Figure 3.6: Recall is plotted versus accuracy for threshold values ranging from 0.1 to 0.9. The threshold that maximizes accuracy2+recall2 (0.55) is marked with a $\circ$.

To summarize network performance, the outputs of the test data from each of the 10 networks were simply converted to confusion matrices in which the elements were the number of instances from a given class that had been assigned to each of the 10 output classes. These matrices were summed across all 10 network trials and then the entries were converted to percentages. If an `unknown' class was included in a particular confusion matrix, a second summarizing number was generated for each diagonal element which was the percentage of the classification attempts (number of instances in the test set minus those that were classified as unknowns) that were correct. The performance of each of the 10 individual BPNN classifiers used with each feature set was summarized as the average of the diagonal elements (percentage of correct classifications) of a confusion matrix generated from that trial only. All 10 trials were then summarized by the mean and variance of these 10 classification rates.

Ninety-five percent confidence intervals were calculated as described in Section 1.7.2 (p. [*]) with the number of trials (Nin Equation 1.5) equal to the number of training or test cases used in a single trial. The number of samples in a single trial was used rather than the sum of samples used across the 10 trials because the latter number does not represent strictly independent events. Calculated this way, these intervals represent an upper bound of the confidence interval on the classification rate.

next up previous contents
Next: k-Nearest Neighbor Method Up: Materials and Methods Previous: Feature Selection
Copyright ©1999 Michael V. Boland