next up previous contents
Next: Classification Using Haralick Texture Up: Results Previous: Zernike Feature Extraction and

Classification Using Zernike Features

Two methods of classification were then applied to the feature data. The first of these was a classification tree [11]. The goal of this classifier is to divide the multidimensional feature space with decision boundaries (in this case linear and parallel to the feature axes) such that images of each class are largely separated from each other. The appealing characteristic of this classifier is that it generates an interpretable tree structure as output that includes rules for correctly recognizing each class of input. By following a particular sample down the tree, it is also possible to determine which features are able to discriminate that sample from those of other classes. A classification tree was generated using all of the training data with the default options of the S-Plus tree function.

Once the tree was generated, the test data were applied to it and performance was assessed by generating a confusion matrix from the resulting classifications (Table 2.4). A confusion matrix is generated by determining where a classifier is `confused' about the classification of particular images. The row of a particular entry indicates the true classification of those images while the column represents the class to which those images were assigned by the classifier. Non-zero values in the off-diagonal elements of the matrix therefore indicate mistakes made by the classifier. The average performance of the classification tree using Zernike features was 65%, where average performance is calculated as the mean of the values along the diagonal of the confusion matrix. The performance is acceptable for all classes except tubulin, which was frequently confused with LAMP2. The results suggest that the classification tree was either trained to recognize the training data too closely or that the tubulin and LAMP2 classes are similar to one another when described with Zernike moments. Using the prune.tree() function of S-Plus and an empirically-chosen cost-complexity measure, it was possible to increase the performance of the classification tree such that all classes were recognized at a rate greater than 50% and the average classification rate was 63% (data not shown). Because the performance was not entirely satisfactory and because the pruning required was not systematic, a more sophisticated classifier was researched and implemented.

Table 2.4: Confusion matrix generated from the output of a classification tree trained and tested with the Zernike features. The average classification rate is $65\pm7.5\%$ (mean $\pm$ 95% confidence).
True Output of the Classification Tree
Classification Giantin Hoechst LAMP2 NOP4 Tubulin Unknown
Giantin 80% 3% 7% 7% 0% 3%
Hoechst 20% 80% 0% 0% 0% 0%
LAMP2 10% 3% 62% 10% 15% 0%
NOP4 0% 0% 25% 75% 0% 0%
Tubulin 0% 0% 69% 0% 27% 4%

A classifier that is widely used and is implemented in various commercial and freely distributed software is the back-propagation neural network (BPNN) [42]. The BPNN was chosen as the second classifier because it is able to generate decision boundaries that are significantly more complex than the rectilinear boundaries of the classification tree [13]. A disadvantage to the BPNN is that the ready interpretability of the classification tree is lost. It is not possible, for example, to easily determine which features are being used to discriminate which classes. The BPNN was implemented in PDP++ as a single hidden layer network, with 49 inputs (one for each Zernike feature), 20 hidden nodes, and 5 output nodes (one for each class of image.) The network was fully connected between layers.

The network was trained using the train/stop/evaluate partitions of the data such that only results from the training data set were used to modify the network weights. To visualize the training process, the sum of squared error between the desired value of the output nodes and the actual value of those nodes was calculated at regular intervals. The error for the training data was calculated after each training epoch and the error for the stop data was determined after every third training epoch. One epoch of training is defined as a single pass through all of the samples in the training data. In order to prevent over training and therefore "memorization" of the training data, training was stopped when the sum of squared error value for the stop data was at a minimum. At this point, the evaluation data were applied to the network and the output node of the network with the largest value was defined as the classification result for each evaluation example. Results are shown in Table 2.5, along with an average classification rate and its 95% confidence interval. The average rate of correct classification for this method, $87\pm5.4\%$, is better than the same measure for the classification tree. It can be concluded that the BPNN is an improvement over the classification tree in terms of its ability to classify the images in this data set.

Table 2.5: Confusion matrix generated from the output of a back-propagation neural network trained and tested with the Zernike features. The average classification rate is $87\pm5.4\%$ (mean $\pm$ 95% confidence interval). The performance across the 8 test sets ranged from 67-97%. Average performance on the training data was $94\pm3.6\%$. 19980326
True Output of the BPNN
Classification Giantin Hoechst LAMP2 NOP4 Tubulin
Giantin 97% 0% 3% 0% 0%
Hoechst 3% 93% 0% 3% 0%
LAMP2 12% 2% 70% 10% 7%
NOP4 0% 0% 0% 88% 13%
Tubulin 0% 0% 12% 4% 85%

If this average classification rate seems inadequate, the following should be noted. First, a random classifier (one that is completely unable to discriminate between the image classes) would be expected to produce an average classification rate of only 20%. Second, it is possible to take advantage of the nature of the samples used for imaging to improve on this result. Specifically, if one prepares a homogeneous sample from a single class (i.e., identically prepared cells) and uses a majority rule classification scheme where the sample is classified using the result obtained for the majority of individual cells studied, it is possible to improve the classification rate. Treating the single-cell classifications as Bernoulli random variables (i.e., each cell is classified correctly or it is not) results in the following formula for the probability that the majority classification is correct

 \begin{displaymath}P_{majority}(n) = \sum_{x=\left\lfloor\frac{n}{2}\right\rfloor+1}^{n}
{n \choose x}p^x(1-p)^{(n-x)}
\end{displaymath} (2.16)

where n is the number of cells examined and p is the probability of a correct classification of a single image. This analysis relies on the fact that there is no class for which the classifier achieves less than 50% correct classification. This level of performance is easily met by the back-propagation neural network classifier. For a sample size of 10 cells with a single-cell classification rate of 87%, a majority rule classifier will result in a 99% correct classification rate for the sample as a whole.

The results above utilized more than 300 images generated using five different labels. The acquisition of these images therefore represents a significant investment of time and resources. To gain some insight into how few images per class might be needed to train a useful classifier, smaller training sets were generated by taking the first N samples from each class. Ten samples per class proved to be too few as the LAMP2 images were classified correctly only 40% of the time (data not shown), violating the requirement for a minimum of 50% correct classification discussed in the analysis surrounding Equation 2.16. Using 15 samples per class, however, provided reasonable results. In this case the correct classification rates were 83% for giantin, 93% for Hoechst, 62% for LAMP2, 63% for NOP4, and 73% for tubulin, for an average rate of 75%. These values, while not ideal for single cell classification, can certainly be useful with the majority rule classification approach described above. For the localization patterns discussed here, these results also indicate that it is possible to train useful classifiers using less than 100 total images for five classes.

next up previous contents
Next: Classification Using Haralick Texture Up: Results Previous: Zernike Feature Extraction and
Copyright ©1999 Michael V. Boland