next up previous contents
Next: Classification of Images at Up: Results Previous: Classification with the ``best''

Classification Without a DNA Image

One fundamental change in the features calculated for the 10 class HeLa data, as compared to the 5 class CHO data, is that some of the new ad hoc features require the existence of an image depicting the localization of DNA in each cell. Because the nucleus is a relatively homogeneous structure in the cell, it is a good candidate for providing some degree of normalization to any features that are intended to capture information about the localization of a cellular protein. Furthermore, as the dominant structure in the cell, many other organelles are organized around the nucleus. In HeLa cells for instance, the Golgi apparatus resides next to the nucleus, the ER tends to be concentrated to one side of the nucleus, the mitochondria and microtubules (tubulin) are most concentrated near the nucleus, and nucleoli, by definition, reside entirely within the nucleus. Based on this knowledge, features were included in the ad hoc set that describe the localization of proteins relative to the nucleus.

To determine whether the DNA-related features were useful for classifying the HeLa data, a subset of all 84 features was created in which all DNA-related features were removed. The result was a 78 element feature set. These features were used to train and test a BPNN and the results are summarized in Table 3.15.


  
Table 3.15: Performance on the test data of a BPNN using all features except those dependent on the DNA image, and no thresholding of the network outputs. The average rate of correct classification is $79\pm5\%$ (mean $\pm$ 95% CI) with a variance of 2.8 across all 10 networks. The average performance on the training data is $94\%$ with a variance of 3.0. ( 19990527)
True Output of the Classifier
Classification DNA ER Giant. GPP LAMP Mito. Nucle. Actin TfR Tubul.
                     
DNA 99% 1% 0% 0% 0% 0% 0% 0% 0% 0%
ER 1% 83% 3% 0% 0% 5% 0% 0% 0% 6%
Giantin 0% 2% 74% 20% 1% 1% 1% 0% 0% 0%
GPP130 2% 0% 17% 75% 1% 0% 3% 0% 2% 0%
LAMP2 0% 1% 4% 2% 71% 2% 5% 0% 15% 0%
Mito. 0% 9% 2% 0% 3% 74% 0% 0% 2% 9%
Nucleolin 1% 0% 2% 2% 0% 2% 93% 0% 0% 0%
Actin 0% 0% 0% 0% 0% 2% 0% 91% 1% 6%
TfR 0% 5% 2% 0% 24% 3% 0% 5% 53% 7%
Tubulin 0% 6% 0% 0% 1% 6% 1% 4% 7% 75%

Interestingly, there is only a small difference in the overall classification rate between the feature sets with (81%) and without (79%) DNA information. This drop is also not due to decreased performance on a small number of classes, but rather due to 1-3% drops in all classes except for DNA and actin. As can be seen in Table 3.16, the BPNN with thresholded outputs behaves similarly in that its performance is only slightly degraded without the DNA features. The kNN performance (see Table 3.17) follows the same trend as the BPNN results, but does show a more significant drop (10 percentage points) in the classification rate for nucleolin.

One conclusion that can be drawn from all of these results is that the contribution of the DNA features to discrimination of these 10 classes is small. This is somewhat surprising, however, given that three of the 37 ``best'' features listed in Table 3.11 require a DNA image. Apparently there are other features in the complete feature set which can capture some of the discriminating information contained in the DNA-requiring features. Based on the stepwise discriminant analysis results, however, these other features must not be as good at separating classes. While overwhelming evidence does not exist supporting the retention of the DNA features, their continued use is justifiable. First of all, they do provide an increase in overall classification rate, however modest. Second, three of them were in the top 37 features selected via stepwise discriminant analysis, indicating that they have some ability to separate classes. Finally, even though their utility is in question for these 10 classes, the DNA features should be retained as descriptors of protein localization because they contain information that is expected to be more useful on future patterns.


  
Table 3.16: Performance on the test data of a BPNN using all features except those dependent on a DNA image, and with thresholding of the network outputs. The average rate of correct classification is $70\pm5.5\%$ (mean $\pm$ 95% CI) with a variance of 12.1 across all 10 networks for all samples and $84\%$ (variance of 10) for samples that are not classified as unknown (average of values in parentheses, below). The average performance on the corresponding training data is $88\pm3.2\%$ with a variance of 17 for all samples and $96\%$ (variance of 1.5) for those not placed in the unknown category. The percentages of non-unknown samples that were classified correctly are included in parentheses. ( 19990527)
True Output of the Classifier
Classification DNA ER Giant. GPP LAMP Mito. Nucle. Actin TfR Tubul. Unk.
                       
DNA 99% 1% 0% 0% 0% 0% 0% 0% 0% 0% 1%
  (99%)                    
ER 0% 76% 2% 0% 0% 2% 0% 0% 0% 5% 15%
    (90%)                  
Giantin 0% 1% 66% 14% 0% 0% 1% 0% 0% 0% 18%
      (80%)                
GPP130 0% 0% 14% 69% 1% 0% 1% 0% 1% 0% 14%
        (80%)              
LAMP2 0% 0% 3% 1% 55% 1% 1% 0% 10% 0% 28%
          (77%)            
Mito. 0% 7% 2% 0% 1% 62% 0% 0% 1% 5% 22%
            (79%)          
Nucleolin 0% 0% 0% 2% 0% 0% 88% 0% 0% 0% 10%
              (97%)        
Actin 0% 0% 0% 0% 0% 1% 0% 84% 1% 3% 12%
                (95%)      
TfR 0% 3% 0% 0% 16% 1% 0% 2% 43% 3% 31%
                  (62%)    
Tubulin 0% 3% 0% 0% 0% 4% 1% 1% 2% 60% 29%
                    (84%)  


  
Table 3.17: Performance on the test data of a kNN classifier using all features except those dependent on a DNA image. The average rate of correct classification is $66\pm5.7\%$ (mean $\pm$ 95% CI) with a variance of 6.3 across all 10 classifiers for all samples and $75\%$ (variance of 2.6) for those samples not classified as unknown. The percentage of non-unknown samples that were classified correctly are included in parentheses. ( 19990607)
True Output of the Classifier
Classification DNA ER Giant. GPP LAMP Mito. Nucle. Actin TfR Tubul. Unk.
                       
DNA 99% 1% 0% 0% 0% 0% 0% 0% 0% 0% 1%
  (99%)                    
ER 0% 85% 0% 0% 3% 3% 0% 0% 0% 7% 2%
    (86%)                  
Giantin 0% 2% 64% 16% 1% 1% 1% 0% 0% 0% 15%
      (75%)                
GPP130 0% 0% 16% 64% 5% 0% 3% 0% 1% 0% 11%
        (72%)              
LAMP2 0% 1% 2% 3% 63% 0% 2% 0% 6% 0% 21%
          (80%)            
Mito. 0% 14% 0% 0% 3% 48% 0% 1% 4% 15% 16%
            (57%)          
Nucleolin 2% 0% 2% 4% 10% 0% 70% 0% 4% 0% 7%
              (75%)        
Actin 0% 0% 0% 0% 0% 2% 0% 72% 0% 12% 14%
                (83%)      
TfR 0% 7% 0% 1% 25% 5% 0% 4% 28% 8% 22%
                  (36%)    
Tubulin 0% 8% 0% 2% 2% 7% 0% 2% 5% 62% 13%
                    (71%)  


next up previous contents
Next: Classification of Images at Up: Results Previous: Classification with the ``best''
Copyright ©1999 Michael V. Boland
1999-09-18