Two methods of classification were then applied to the feature data. The first of these was a classification tree [11]. The goal of this classifier is to divide the multidimensional feature space with decision boundaries (in this case linear and parallel to the feature axes) such that images of each class are largely separated from each other. The appealing characteristic of this classifier is that it generates an interpretable tree structure as output that includes rules for correctly recognizing each class of input. By following a particular sample down the tree, it is also possible to determine which features are able to discriminate that sample from those of other classes. A classification tree was generated using all of the training data with the default options of the S-Plus tree function. >>>>
Once the tree was generated, the test data were applied to it and performance was assessed by generating a confusion matrix from the resulting classifications (Table 2.4). A confusion matrix is generated by determining where a classifier is `confused' about the classification of particular images. The row of a particular entry indicates the true classification of those images while the column represents the class to which those images were assigned by the classifier. Non-zero values in the off-diagonal elements of the matrix therefore indicate mistakes made by the classifier. The average performance of the classification tree using Zernike features was 65%, where average performance is calculated as the mean of the values along the diagonal of the confusion matrix. The performance is acceptable for all classes except tubulin, which was frequently confused with LAMP2. The results suggest that the classification tree was either trained to recognize the training data too closely or that the tubulin and LAMP2 classes are similar to one another when described with Zernike moments. Using the prune.tree() function of S-Plus and an empirically-chosen cost-complexity measure, it was possible to increase the performance of the classification tree such that all classes were recognized at a rate greater than 50% and the average classification rate was 63% (data not shown). Because the performance was not entirely satisfactory and because the pruning required was not systematic, a more sophisticated classifier was researched and implemented. >>>>
>>>>
True | Output of the Classification Tree | |||||
Classification | Giantin | Hoechst | LAMP2 | NOP4 | Tubulin | Unknown |
Giantin | 80% | 3% | 7% | 7% | 0% | 3% |
Hoechst | 20% | 80% | 0% | 0% | 0% | 0% |
LAMP2 | 10% | 3% | 62% | 10% | 15% | 0% |
NOP4 | 0% | 0% | 25% | 75% | 0% | 0% |
Tubulin | 0% | 0% | 69% | 0% | 27% | 4% |
A classifier that is widely used and is implemented in various commercial and freely distributed software is the back-propagation neural network (BPNN) [42]. The BPNN was chosen as the second classifier because it is able to generate decision boundaries that are significantly more complex than the rectilinear boundaries of the classification tree [13]. A disadvantage to the BPNN is that the ready interpretability of the classification tree is lost. It is not possible, for example, to easily determine which features are being used to discriminate which classes. The BPNN was implemented in PDP++ as a single hidden layer network, with 49 inputs (one for each Zernike feature), 20 hidden nodes, and 5 output nodes (one for each class of image.) The network was fully connected between layers. >>>>
The network was trained using the train/stop/evaluate partitions of
the data such that only results from the training data set were used
to modify the network weights. To visualize the training process, the
sum of squared error between the desired value of the output nodes and
the actual value of those nodes was calculated at regular intervals.
The error for the training data was calculated after each training
epoch and the error for the stop data was determined after every third
training epoch. One epoch of training is defined as a single pass
through all of the samples in the training data. In order to prevent
over training and therefore "memorization" of the training data,
training was stopped when the sum of squared error value for the stop
data was at a minimum. At this point, the evaluation data were
applied to the network and the output node of the network with the
largest value was defined as the classification result for each
evaluation example. Results are shown in Table
2.5, along with an average classification rate
and its 95% confidence interval. The average rate of correct
classification for this method, >>>>
,
is better than the same
measure for the classification tree. It can be concluded that the
BPNN is an improvement over the classification tree in terms of its
ability to classify the images in this data set.
>>>>
>>>>
True | Output of the BPNN | ||||
Classification | Giantin | Hoechst | LAMP2 | NOP4 | Tubulin |
Giantin | 97% | 0% | 3% | 0% | 0% |
Hoechst | 3% | 93% | 0% | 3% | 0% |
LAMP2 | 12% | 2% | 70% | 10% | 7% |
NOP4 | 0% | 0% | 0% | 88% | 13% |
Tubulin | 0% | 0% | 12% | 4% | 85% |
If this average classification rate seems inadequate, the following
should be noted. First, a random classifier (one that is completely
unable to discriminate between the image classes) would be expected to
produce an average classification rate of only 20%. Second, it is
possible to take advantage of the nature of the samples used for
imaging to improve on this result. Specifically, if one prepares a
homogeneous sample from a single class (i.e., identically prepared
cells) and uses a majority rule classification scheme where the sample
is classified using the result obtained for the majority of individual
cells studied, it is possible to improve the classification rate.
Treating the single-cell classifications as Bernoulli random variables
(i.e., each cell is classified correctly or it is not) results in the
following formula for the probability that the majority classification
is correct
>>>>
The results above utilized more than 300 images generated using five different labels. The acquisition of these images therefore represents a significant investment of time and resources. To gain some insight into how few images per class might be needed to train a useful classifier, smaller training sets were generated by taking the first N samples from each class. Ten samples per class proved to be too few as the LAMP2 images were classified correctly only 40% of the time (data not shown), violating the requirement for a minimum of 50% correct classification discussed in the analysis surrounding Equation 2.16. Using 15 samples per class, however, provided reasonable results. In this case the correct classification rates were 83% for giantin, 93% for Hoechst, 62% for LAMP2, 63% for NOP4, and 73% for tubulin, for an average rate of 75%. These values, while not ideal for single cell classification, can certainly be useful with the majority rule classification approach described above. For the localization patterns discussed here, these results also indicate that it is possible to train useful classifiers using less than 100 total images for five classes. >>>>
>>>>