next up previous contents
Next: A Ten Class Problem Up: A Five Class Problem Previous: Reduced Classifier Complexity

Discussion

At this point, it is possible to assess progress toward the overall goal of this work, as set forth in Chapter 1: Is it possible to numerically describe protein localization patterns in cultured eukaryotic cells in a way that is biologically useful?

Given the results presented above, it is indeed possible to describe protein localization numerically, and to then use that information to recognize previously unseen instances of known patterns. More specific conclusions can be made regarding the data set used to produce these results, the features used to describe the localization patterns, and the classifiers used to classify the patterns.

While the CHO data set proved adequate for an initial proof-of-concept, it has some limitations that should be addressed in future work. First of all, the five classes are visually too dissimilar. Although the patterns could be recognized reliably and automatically, members of the biological community frequently questioned the utility of a system that could distinguish patterns they could distinguish themselves (implying an expectation that the automated system should be able to do better than a trained expert). Regardless of these expectations, the systems developed thus far have clear application to some classes of biological problems. Chief among these is the field of high-throughput screening where the ability to discriminate from among a handful of localization patterns is useful. A second, and related improvement that should be made to future data sets is to generate more classes. Since it is a long-term goal to be able to describe a large number of localization patterns (the exact number of these patterns is not known), the methods need to be tested against more of those patterns. A third limitation of the CHO data relates to the cells themselves. Because they tend to be small and do not spread well on coverslips, CHO cells do not provide good images of protein localization patterns. Furthermore, because hamsters are not among the most common species used in research, the number of monoclonal antibodies available against CHO proteins is relatively small. This lack of antibodies limits the number of localization patterns that can readily be studied in CHO cells. Finally, the results above clearly indicate that more data are needed for each class. The large ranges of BPNN performance across the 8 test sets (see Tables 2.5, 2.6, and 2.8, for example) are due to the very small numbers of images that were available in each class.

In terms of the feature sets explored with these data, the Haralick texture features are preferable for at least two reasons. First, they result in a lower dimension feature space (13-dimensional vs. 49-dimensional for the Zernike moments) without degrading classifier performance. Lower dimensional systems are always preferable to higher dimensional ones because they help to avoid the ``curse of dimensionality''[10, p. 95]. Second, the Haralick features are able to be used with less complex back-propagation neural networks. Even when the number of hidden nodes was reduced to five, the Haralick features could still discriminate the five classes nearly as well as they could with 20 hidden nodes. This aspect of the Haralick features is important because it means that there are less BPNN weights to calculate during the training process. One way that the Haralick and Zernike features could be improved upon is to supplement them with features that are motivated by more intuitive descriptions of protein localization. For example, having a feature that captures the amount by which a particular pattern overlaps with the nucleus of the cell may prove more useful than knowing the value of the Zernike moments for that same pattern.

Finally, these results indicate that the BPNN is a preferable classifier to the classification tree. Although the BPNN does not provide the interpretability of the classification tree, its better performance makes it a preferred choice. The major drawback to the BPNN is the impracticality of optimizing all of its various parameters: learning rate, momentum, number of hidden nodes, etc. Empirically, this limitation is not terribly important since the BPNN was able to be trained and to generalize well after simply selecting reasonable values for each parameter.

Each of these lessons (data set content and complexity, feature set selection, and classifier choice) was learned and in turn applied to the next phase of this work.


next up previous contents
Next: A Ten Class Problem Up: A Five Class Problem Previous: Reduced Classifier Complexity
Copyright ©1999 Michael V. Boland
1999-09-18