next up previous contents
Next: Classification Up: Materials and Methods Previous: Haralick texture features

Feature Selection

In order to choose a subset of features from the combined Haralick and Zernike sets described above, two feature selection methods were applied to the training data. The first used was the STEPDISC procedure in SAS (SAS Institute, Cary, NC, USA). The default parameters of the procedure, which is an implementation of stepwise discriminant analysis (SDA) [30], were used.

The goal of stepwise discriminant analysis is to sequentially identify those variables (features) that maximize a criterion which describes their ability to separate classes from one another while at the same time keeping the individual classes as tightly clustered as possible. The criterion used is Wilks' $\Lambda$ which is defined as

\end{displaymath} (2.7)

where $\mathbf{x}=[x_1,x_2,\dots,x_p]$ is a vector of the features that are currently included in the system,

\begin{displaymath}\mathbf{W}(i,j) = \sum_{g=1}^q\sum_{t=1}^{n_g}
\end{displaymath} (2.8)

is the matrix of within-groups sums of squares and cross products for the features under consideration, and

\begin{displaymath}\mathbf{T}(i,j) = \sum_{g=1}^q\sum_{t=1}^{n_g}
\end{displaymath} (2.9)

is the matrix of total sums of squares and cross products. q is the number of classes, ng is the number of samples in class g, xigt is the value of feature i for sample t of class g, $\bar{x}_{ig}$ is the mean of feature i over class g, and $\bar{x}_i$ is the mean of feature i over all classes.

Low values of $\Lambda$ indicate features that better discriminate the classes. To accommodate the stepwise nature of the process, the partial $\Lambda$ statistic is used. This statistic describes the increase in the discrimination ability of a system after adding a new feature, xp+1

 \begin{displaymath}\Lambda(x_{p+1}\cdot\mathbf{x}) =
\end{displaymath} (2.10)

To facilitate the ability to decide whether adding a new feature to the system will increase discrimination significantly, Wilks' partial-$\Lambda$ is converted to an F-statistic for which it is possible to assign a level of statistical significance: what is the probability, given the null hypothesis that there is no separation between groups, that one would obtain a value larger than

 \begin{displaymath}F = \left(\frac{n-q-p}{q-1}\right)\frac{1-\Lambda(x_{p+1}\cdot\mathbf{x})}
\end{displaymath} (2.11)

where n is the number of data samples in all classes, p is the number of features currently in the analysis, and q is the number of classes. Large values of F indicate better discrimination for a particular feature. This version of the F-statistic is called the F-to-enter criterion because it is used to decide whether feature xp+1 should be entered into the system. A corresponding value, the F-to-remove criterion is used to decide whether a feature should be taken out of the system.

The process of stepwise discriminant analysis involves the following steps [31,32]:

Calculate the within-groups ( W) and total ( T) sums of squares and cross-products matrices for all features. Let wii=W(i,i), tii=T(i,i), and $V_{i}=\frac{w_{ii}}{t_{ii}}$. Vi is therefore analogous to Wilks' $\Lambda$.
Calculate the F-to-remove statistic for each feature i already included in the system:  

 \begin{displaymath}F_{remove}(i) = \left(\frac{n-p-q+1}{q-1}\right)V_i-1
\end{displaymath} (2.12)

The value of Fremove(i) is used to calculate a significance level for an F random variable with degrees of freedom (n-p-q+1) and (q-1). The feature with the lowest Fremove value that also corresponds to a significance level (p) greater than an assigned threshold (p=0.15 in STEPDISC) is removed from the list of features. NOTE: this step is skipped when entering the first feature (i.e., when no features have yet been entered).

If a feature was removed in step 2, the W and T matrices must be updated to reflect that change. Jennrich [31] describes ``sweep'' operators that can be applied to these matrices such that a feature is changed from a dependent variable to a predictor variable (on entry), or from a predictor to dependent (on removal). This alleviates the need for recalculating the W and T matrices at each step in this process. The modified matrices are then passed to the next step.

Calculate the F-to-enter statistic for each feature j not already included:

 \begin{displaymath}F_{enter}(j) = \left(\frac{n-p-q}{q-1}\right)
\end{displaymath} (2.13)

This time, a significance level for an F random variable with degrees of freedom (n-p-q) and (q-1) is calculated, and that feature with the largest Fenter which has significance below a threshold (p=0.15 in STEPDISC) is included.

Again, the W and T matrices are updated using the sweep operator to change any included feature from dependent to predictor status.
Return to step 2 with the modified W and T matrices. When no features can be removed or entered (i.e., the significance tests all fail to achieve their respective thresholds), the process stops.

The second method for identifying a subset of features used a modified version of the multiple discriminant analysis criterion described by Duda and Hart [10, p. 120]. Features selected were those that had the largest ratio of the variance of that feature calculated using all samples in the training set to the sum of the variances of that feature calculated for each class (i.e., image type) in the training set

 \begin{displaymath}\frac{\mathrm{var}(\mathbf{f})}{\sum_{\forall c}{}\mathrm{var}(\mathbf{f_c})}
\end{displaymath} (2.14)

where fc contains only feature values from class c and f contains features from all image classes. The goal of this criterion is to identify features that widely separate the classes from one another (total variance) while keeping the classes themselves as tightly clustered as possible (sum of within class variances).

next up previous contents
Next: Classification Up: Materials and Methods Previous: Haralick texture features
Copyright ©1999 Michael V. Boland