One fundamental change in the features calculated for the 10 class HeLa data, as compared to the 5 class CHO data, is that some of the new ad hoc features require the existence of an image depicting the localization of DNA in each cell. Because the nucleus is a relatively homogeneous structure in the cell, it is a good candidate for providing some degree of normalization to any features that are intended to capture information about the localization of a cellular protein. Furthermore, as the dominant structure in the cell, many other organelles are organized around the nucleus. In HeLa cells for instance, the Golgi apparatus resides next to the nucleus, the ER tends to be concentrated to one side of the nucleus, the mitochondria and microtubules (tubulin) are most concentrated near the nucleus, and nucleoli, by definition, reside entirely within the nucleus. Based on this knowledge, features were included in the ad hoc set that describe the localization of proteins relative to the nucleus. >>>>
To determine whether the DNA-related features were useful for classifying the HeLa data, a subset of all 84 features was created in which all DNA-related features were removed. The result was a 78 element feature set. These features were used to train and test a BPNN and the results are summarized in Table 3.15. >>>>
>>>>
True | Output of the Classifier | |||||||||
Classification | DNA | ER | Giant. | GPP | LAMP | Mito. | Nucle. | Actin | TfR | Tubul. |
DNA | 99% | 1% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% |
---|---|---|---|---|---|---|---|---|---|---|
ER | 1% | 83% | 3% | 0% | 0% | 5% | 0% | 0% | 0% | 6% |
Giantin | 0% | 2% | 74% | 20% | 1% | 1% | 1% | 0% | 0% | 0% |
GPP130 | 2% | 0% | 17% | 75% | 1% | 0% | 3% | 0% | 2% | 0% |
LAMP2 | 0% | 1% | 4% | 2% | 71% | 2% | 5% | 0% | 15% | 0% |
Mito. | 0% | 9% | 2% | 0% | 3% | 74% | 0% | 0% | 2% | 9% |
Nucleolin | 1% | 0% | 2% | 2% | 0% | 2% | 93% | 0% | 0% | 0% |
Actin | 0% | 0% | 0% | 0% | 0% | 2% | 0% | 91% | 1% | 6% |
TfR | 0% | 5% | 2% | 0% | 24% | 3% | 0% | 5% | 53% | 7% |
Tubulin | 0% | 6% | 0% | 0% | 1% | 6% | 1% | 4% | 7% | 75% |
Interestingly, there is only a small difference in the overall classification rate between the feature sets with (81%) and without (79%) DNA information. This drop is also not due to decreased performance on a small number of classes, but rather due to 1-3% drops in all classes except for DNA and actin. As can be seen in Table 3.16, the BPNN with thresholded outputs behaves similarly in that its performance is only slightly degraded without the DNA features. The kNN performance (see Table 3.17) follows the same trend as the BPNN results, but does show a more significant drop (10 percentage points) in the classification rate for nucleolin. >>>>
One conclusion that can be drawn from all of these results is that the contribution of the DNA features to discrimination of these 10 classes is small. This is somewhat surprising, however, given that three of the 37 ``best'' features listed in Table 3.11 require a DNA image. Apparently there are other features in the complete feature set which can capture some of the discriminating information contained in the DNA-requiring features. Based on the stepwise discriminant analysis results, however, these other features must not be as good at separating classes. While overwhelming evidence does not exist supporting the retention of the DNA features, their continued use is justifiable. First of all, they do provide an increase in overall classification rate, however modest. Second, three of them were in the top 37 features selected via stepwise discriminant analysis, indicating that they have some ability to separate classes. Finally, even though their utility is in question for these 10 classes, the DNA features should be retained as descriptors of protein localization because they contain information that is expected to be more useful on future patterns. >>>>
>>>>
True | Output of the Classifier | ||||||||||
Classification | DNA | ER | Giant. | GPP | LAMP | Mito. | Nucle. | Actin | TfR | Tubul. | Unk. |
DNA | 99% | 1% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 1% |
---|---|---|---|---|---|---|---|---|---|---|---|
(99%) | |||||||||||
ER | 0% | 76% | 2% | 0% | 0% | 2% | 0% | 0% | 0% | 5% | 15% |
(90%) | |||||||||||
Giantin | 0% | 1% | 66% | 14% | 0% | 0% | 1% | 0% | 0% | 0% | 18% |
(80%) | |||||||||||
GPP130 | 0% | 0% | 14% | 69% | 1% | 0% | 1% | 0% | 1% | 0% | 14% |
(80%) | |||||||||||
LAMP2 | 0% | 0% | 3% | 1% | 55% | 1% | 1% | 0% | 10% | 0% | 28% |
(77%) | |||||||||||
Mito. | 0% | 7% | 2% | 0% | 1% | 62% | 0% | 0% | 1% | 5% | 22% |
(79%) | |||||||||||
Nucleolin | 0% | 0% | 0% | 2% | 0% | 0% | 88% | 0% | 0% | 0% | 10% |
(97%) | |||||||||||
Actin | 0% | 0% | 0% | 0% | 0% | 1% | 0% | 84% | 1% | 3% | 12% |
(95%) | |||||||||||
TfR | 0% | 3% | 0% | 0% | 16% | 1% | 0% | 2% | 43% | 3% | 31% |
(62%) | |||||||||||
Tubulin | 0% | 3% | 0% | 0% | 0% | 4% | 1% | 1% | 2% | 60% | 29% |
(84%) |
>>>>
True | Output of the Classifier | ||||||||||
Classification | DNA | ER | Giant. | GPP | LAMP | Mito. | Nucle. | Actin | TfR | Tubul. | Unk. |
DNA | 99% | 1% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 1% |
---|---|---|---|---|---|---|---|---|---|---|---|
(99%) | |||||||||||
ER | 0% | 85% | 0% | 0% | 3% | 3% | 0% | 0% | 0% | 7% | 2% |
(86%) | |||||||||||
Giantin | 0% | 2% | 64% | 16% | 1% | 1% | 1% | 0% | 0% | 0% | 15% |
(75%) | |||||||||||
GPP130 | 0% | 0% | 16% | 64% | 5% | 0% | 3% | 0% | 1% | 0% | 11% |
(72%) | |||||||||||
LAMP2 | 0% | 1% | 2% | 3% | 63% | 0% | 2% | 0% | 6% | 0% | 21% |
(80%) | |||||||||||
Mito. | 0% | 14% | 0% | 0% | 3% | 48% | 0% | 1% | 4% | 15% | 16% |
(57%) | |||||||||||
Nucleolin | 2% | 0% | 2% | 4% | 10% | 0% | 70% | 0% | 4% | 0% | 7% |
(75%) | |||||||||||
Actin | 0% | 0% | 0% | 0% | 0% | 2% | 0% | 72% | 0% | 12% | 14% |
(83%) | |||||||||||
TfR | 0% | 7% | 0% | 1% | 25% | 5% | 0% | 4% | 28% | 8% | 22% |
(36%) | |||||||||||
Tubulin | 0% | 8% | 0% | 2% | 2% | 7% | 0% | 2% | 5% | 62% | 13% |
(71%) |
>>>>