Feature Selection

In order to choose a subset of features from the combined Haralick and
Zernike sets described above, two feature selection methods were
applied to the training data. The first used was the `STEPDISC`
procedure in SAS (SAS Institute, Cary, NC, USA). The default
parameters of the procedure, which is an implementation of stepwise
discriminant analysis (SDA) [30], were used.

The goal of stepwise discriminant analysis is to sequentially identify
those variables (features) that maximize a criterion which describes their
ability to separate classes from one another while at the same time
keeping the individual classes as tightly clustered as possible. The
criterion used is Wilks'
which is defined as

where is a vector of the features that are currently included in the system,

(2.8) |

is the matrix of within-groups sums of squares and cross products for the features under consideration, and

(2.9) |

is the matrix of total sums of squares and cross products.

Low values of
indicate features that better discriminate the
classes. To accommodate the stepwise nature of the process, the
partial
statistic is used. This statistic describes the
increase in the discrimination ability of a system after adding a new
feature, *x*_{p+1}

To facilitate the ability to decide whether adding a new feature to
the system will increase discrimination *significantly*, Wilks'
partial-
is converted to an F-statistic for which it is
possible to assign a level of statistical significance: what is the
probability, given the null hypothesis that there is no separation
between groups, that one would obtain a value larger than

where

The process of stepwise discriminant analysis involves the following steps [31,32]:

- 1.
- Calculate the within-groups (
**W**) and total (**T**) sums of squares and cross-products matrices for all features. Let*w*_{ii}=**W**(*i*,*i*),*t*_{ii}=**T**(*i*,*i*), and .*V*_{i}is therefore analogous to Wilks' . - 2.
- Calculate the F-to-remove statistic for each feature
*i*already included in the system:

The value of*F*_{remove}(*i*) is used to calculate a significance level for an F random variable with degrees of freedom (*n*-*p*-*q*+1) and (*q*-1). The feature with the lowest*F*_{remove}value that also corresponds to a significance level (p) greater than an assigned threshold (p=0.15 in`STEPDISC`) is removed from the list of features. NOTE: this step is skipped when entering the first feature (i.e., when no features have yet been entered). - 3.
- If a feature was removed in step 2, the
**W**and**T**matrices must be updated to reflect that change. Jennrich [31] describes ``sweep'' operators that can be applied to these matrices such that a feature is changed from a dependent variable to a predictor variable (on entry), or from a predictor to dependent (on removal). This alleviates the need for recalculating the**W**and**T**matrices at each step in this process. The modified matrices are then passed to the next step. - 4.
- Calculate the F-to-enter statistic for each feature
*j*not already included:

This time, a significance level for an F random variable with degrees of freedom (*n*-*p*-*q*) and (*q*-1) is calculated, and that feature with the largest*F*_{enter}which has significance below a threshold (p=0.15 in`STEPDISC`) is included. - 5.
- Again, the
**W**and**T**matrices are updated using the sweep operator to change any included feature from dependent to predictor status. - 6.
- Return to step 2 with the modified
**W**and**T**matrices. When no features can be removed or entered (i.e., the significance tests all fail to achieve their respective thresholds), the process stops.

The second method for identifying a subset of features used a modified
version of the multiple discriminant analysis criterion described by
Duda and Hart [10, p. 120]. Features selected were those
that had the largest ratio of the variance of that feature calculated
using all samples in the training set to the sum of the variances of
that feature calculated for each class (i.e., image type) in the
training set

where