next up previous contents
Next: Classification Without a DNA Up: Results Previous: Classification with the ad

   
Classification with the ``best'' Features

Given that the performance of a BPNN is nearly as good with just the ad hoc features as it is with all of the features, it is a reasonable goal to find an ``optimal'' subset of the features. Optimal is placed in quotes because the identification of the true best subset of features requires a search of all such subsets. As one might expect, this approach has been identified as an NP-complete problem and therefore only suboptimal solutions are readily available. Such solutions typically use some criterion other than classification rate (but related to it) to define a best subset. One of these approaches is stepwise discriminant analysis (SDA). The goal of SDA is to identify those variables in a system containing several classes that are best able to separate the classes from one another while at the same time keeping the classes themselves as tightly clustered as possible (see Section 2.2.5, p. [*] for details).

Stepwise discriminant analysis was applied to the complete 10 class HeLa data set with all 84 Zernike, Haralick, and ad hoc features. Using the default significance level (p=0.15) for the tests on the F-statistics used by SDA, 54 features were returned as contributing significantly to the separation of the classes. To further reduce the number of features used in classification, the features output from the stepwise discriminant analysis were considered to be an ordered list in which the ``best'' features were at the top. A BPNN was trained and tested using subsets of the features returned by SDA. The results of these trials are summarized in Table 3.10. The number 37 in Table 3.10 was not chosen arbitrarily, but was selected because that was the number of features with p-values less than 0.0001. In other words, these features are very unlikely to produce the values they do for the F-statistic if, in fact, the null hypothesis is true (i.e., the class means are identical). The 37 best features are listed in Table 3.11.


  
Table 3.10: Average classification rate for BPNNs trained and tested with the indicated number of features taken from the list returned by a stepwise discriminant analysis. The stepwise discriminant analysis identified a total of 54 variables as significantly contributing to the separation of the underlying classes at the p=0.15 level.
Number of Features Classification Rate
  (mean $\pm$ 95% CI)
5 $73.6\pm5.3\%$
10 $81.4\pm4.7\%$
15 $80.8\pm4.8\%$
20 $82.5\pm4.6\%$
37 $83.4\pm4.5\%$
54 $81.8\pm4.7\%$


  
Table 3.11: List of the 37 best features from the Zernike (Z), Haralick (H), and ad hoc (A) sets as determined using stepwise discriminant analysis.
  1. A: The average number of pixels per object
  2. Z: Z4,0
  3. A: The fraction of the protein fluorescence that co-localizes with DNA
  4. Z: Z2,0
  5. H: Information measure of correlation 1
  6. A: The average object distance to the center of fluorescence
  7. A: The Euler number of the image
  8. H: Sum entropy
  9. A: The fraction of the convex hull occupied by protein fluorescence
  10. A: The fraction of non-zero pixels that are along an edge
  11. A: The ratio of the largest to the smallest object to DNA COF distance
  12. Z: Z8,0
  13. Z: Z12,2
  14. Z: Z12,0
  15. H: Information measure of correlation 2
  16. H: Correlation
  17. Z: Z7,1
  18. Z: Z4,2
  19. A: The ratio of the size of the largest object to the smallest
  20. A: The ratio of the largest to smallest value in a histogram of gradient direction
  21. A: The average object distance from the DNA COF
  22. H: Angular second moment
  23. H: Contrast
  24. H: Sum variance
  25. H: Sum average
  26. Z: Z8,6
  27. A: The number of fluorescent objects in the image
  28. H: Difference Variance
  29. A: The ratio of the largest to smallest object--COF distance
  30. Z: Z10,0
  31. Z: Z1,1
  32. A: The variance of object distances from the COF
  33. Z: Z11,1
  34. A: The variances of the number of above-threshold pixels per object
  35. H: Sum of squares
  36. H: Difference entropy
  37. H: Inverse difference moment

Corrigendum - 10 April 2001, Michael Boland: An error was made in creating Table 3.11 -- item 26 should be deleted, items 27-37 should be shifted up in the list, and Z8,8 should be added as item 37. The corrected table is below. Note that all analysis was done correctly, only the entries in this table are incorrect.

Table 3.11: List of the 37 best features from the Zernike (Z), Haralick (H), and ad hoc (A) sets as determined using stepwise discriminant analysis.
  1. A: The average number of pixels per object
  2. Z: Z4,0
  3. A: The fraction of the protein fluorescence that co-localizes with DNA
  4. Z: Z2,0
  5. H: Information measure of correlation 1
  6. A: The average object distance to the center of fluorescence
  7. A: The Euler number of the image
  8. H: Sum entropy
  9. A: The fraction of the convex hull occupied by protein fluorescence
  10. A: The fraction of non-zero pixels that are along an edge
  11. A: The ratio of the largest to the smallest object to DNA COF distance
  12. Z: Z8,0
  13. Z: Z12,2
  14. Z: Z12,0
  15. H: Information measure of correlation 2
  16. H: Correlation
  17. Z: Z7,1
  18. Z: Z4,2
  19. A: The ratio of the size of the largest object to the smallest
  20. A: The ratio of the largest to smallest value in a histogram of gradient direction
  21. A: The average object distance from the DNA COF
  22. H: Angular second moment
  23. H: Contrast
  24. H: Sum variance
  25. H: Sum average
  26. A: The number of fluorescent objects in the image
  27. H: Difference Variance
  28. A: The ratio of the largest to smallest object--COF distance
  29. Z: Z10,0
  30. Z: Z1,1
  31. A: The variance of object distances from the COF
  32. Z: Z11,1
  33. A: The variances of the number of above-threshold pixels per object
  34. H: Sum of squares
  35. H: Difference entropy
  36. H: Inverse difference moment
  37. Z: Z8,8


Since the top 37 features from SDA provided the best classification rate of any number of features tested, their performance was compared to previous results. The results of training and testing a BPNN with the 37 best features are summarized in Table 3.12. Overall, these 37 features provide slightly better performance than the complete 84 element feature set (83% vs. 81%). This improvement comes largely through increases in the correct classification of actin (96% vs. 91%), transferrin receptor (62% vs. 55%), and tubulin (81% vs. 77%). Despite the improved performance, however, these features are also not able to completely distinguish giantin from GPP130 and transferrin receptor from LAMP2.


  
Table 3.12: Performance on the test data of a BPNN using the 37 best features as determined using stepwise discriminant analysis, and with no thresholding of the network outputs. The average rate of correct classification is $83\pm4.6\%$ (mean $\pm$ 95% CI) with a variance of 4.2 across all 10 networks. The average performance on the training data is $95\pm2.2\%$ with a variance of 2.8. ( 19990608).
True Output of the Classifier
Classification DNA ER Giant. GPP LAMP Mito. Nucle. Actin TfR Tubul.
                     
DNA 99% 1% 0% 0% 0% 0% 0% 0% 0% 0%
ER 0% 87% 2% 0% 1% 7% 0% 0% 2% 2%
Giantin 0% 1% 77% 19% 1% 0% 1% 0% 1% 0%
GPP130 0% 0% 16% 78% 2% 1% 1% 0% 1% 0%
LAMP2 0% 1% 5% 2% 74% 1% 1% 0% 16% 1%
Mito. 0% 8% 2% 0% 2% 79% 0% 1% 2% 6%
Nucleolin 1% 0% 1% 2% 0% 0% 95% 0% 0% 0%
Actin 0% 0% 0% 0% 0% 1% 0% 96% 0% 2%
TfR 0% 5% 1% 1% 20% 3% 0% 2% 62% 6%
Tubulin 0% 4% 0% 0% 0% 8% 0% 1% 5% 81%

Given the performance of the 37 best features using a BPNN without thresholding, the results obtained using thresholded outputs are not surprising (see Table 3.13). Again, the overall performance is increased slightly, and some classes (mitochondria, actin, transferrin receptor, and tubulin) show modest gains compared to the all-features result.


  
Table 3.13: Performance on the test data of a BPNN using the 37 best features as determined using stepwise discriminant analysis and with thresholding of the network outputs. The average rate of correct classification is $74\pm5.3\%$ (mean $\pm$ 95% CI) with a variance of 15.3 across all 10 networks for all samples and $88\%$ (variance of 3.4) for samples that are not classified as unknown (average of values in parentheses, below). The average performance on the corresponding training data is $89\pm3.1\%$ with a variance of 15 for all samples and $97\%$ (variance of 1.3) for those not placed in the unknown category. The percentage of non-unknown samples that were classified correctly are included in parentheses. ( 19990608).
True Output of the Classifier
Classification DNA ER Giant. GPP LAMP Mito. Nucle. Actin TfR Tubul. Unk.
                       
DNA 98% 1% 0% 0% 0% 0% 0% 0% 0% 0% 1%
  (99%)                    
ER 0% 79% 0% 0% 0% 3% 0% 0% 0% 1% 16%
    (94%)                  
Giantin 0% 0% 68% 15% 0% 0% 0% 0% 1% 0% 16%
      (81%)                
GPP130 0% 0% 12% 70% 1% 1% 1% 0% 1% 0% 14%
        (82%)              
LAMP2 0% 0% 4% 1% 57% 0% 1% 0% 6% 0% 30%
          (81%)            
Mito. 0% 5% 2% 0% 1% 71% 0% 0% 1% 2% 20%
            (88%)          
Nucleolin 0% 0% 0% 2% 0% 0% 90% 0% 0% 0% 7%
              (97%)        
Actin 0% 0% 0% 0% 0% 0% 0% 92% 0% 2% 6%
                (98%)      
TfR 0% 1% 0% 0% 15% 1% 0% 0% 49% 1% 33%
                  (73%)    
Tubulin 0% 2% 0% 0% 0% 4% 0% 0% 2% 69% 23%
                    (90%)  

The performance of the kNN classifier using the 37 best features is again similar to that obtained with all features (see Table 3.14). The overall performance is 4-5% better, with a significant increase in the classification rate for actin patterns.

Based on both the BPNN and kNN results, it is possible to conclude that the first 37 features returned by stepwise discriminant analysis are a better feature set than the complete 84 feature set tested above. This conclusion is based not so much on the overall classification rate, which is only slightly improved with 37 features, but rather on the decrease in the total number of features used for classification. It is known [10, p. 95] that the ``curse of dimensionality'' is a real effect and it is therefore desirable to reduce the dimensionality of a feature based classification problem whenever possible. These results indicate that reducing the dimensionality of this problem is, in fact, beneficial.


  
Table 3.14: Performance on the test data of a kNN classifier using the 37 best features as determined using stepwise discriminant analysis. The average rate of correct classification is $73\pm5.4\%$ (mean $\pm$ 95% CI) with a variance of 3.2 across the 10 classifiers for all samples and $81\%$ (variance of 4.4) for those samples not classified as unknown. The percentages of non-unknown samples that were classified correctly are included in parentheses. ( 19990607)
True Output of the Classifier
Classification DNA ER Giant. GPP LAMP Mito. Nucle. Actin TfR Tubul. Unk.
                       
DNA 97% 1% 0% 0% 0% 0% 0% 0% 0% 0% 2%
  (99%)                    
ER 0% 84% 0% 0% 1% 4% 0% 0% 0% 3% 8%
    (91%)                  
Giantin 0% 1% 71% 13% 1% 1% 0% 0% 0% 0% 11%
      (81%)                
GPP130 0% 0% 15% 69% 5% 0% 1% 0% 2% 0% 8%
        (74%)              
LAMP2 0% 1% 3% 2% 58% 1% 2% 0% 8% 1% 23%
          (76%)            
Mito. 0% 11% 2% 0% 3% 67% 0% 2% 1% 9% 6%
            (71%)          
Nucleolin 0% 0% 2% 2% 3% 0% 90% 0% 0% 0% 3%
              (93%)        
Actin 0% 0% 0% 0% 0% 0% 0% 91% 0% 5% 4%
                (95%)      
TfR 0% 4% 1% 1% 19% 5% 0% 7% 33% 10% 21%
                  (42%)    
Tubulin 0% 5% 0% 1% 1% 8% 0% 4% 1% 67% 12%
                    (77%)  


next up previous contents
Next: Classification Without a DNA Up: Results Previous: Classification with the ad
Copyright ©1999 Michael V. Boland
1999-09-18