DISCRIMINATE procedure

Performs discriminant analysis (L.H. Schmitt & P.G.N. Digby).


Options

PRINT = strings
Printed output from the analysis (counts, lrv, tests, icorrelations, correlations, adjustments, means, gdistances, scores, distances, newgroups, table); default coun

NROOTS = scalar
The number of dimensions to be used for printed and saved output, and used in calculating the distances and the allocation of units; default is to use the full dimensionality

REALLOCATE = string
Whether units from the training set are to be reallocated to groups (no, yes); default no

PLOT = strings
Features for the plots (means, mlabels, scores, polygons, confidencecircle); default mean, scor, poly (Note: * suppresses plotting)

YROOT = scalars
Specifies roots for plotting on y-axes

XROOT = scalars
Specifies roots for plotting on x-axes

TITLE = strings
Titles for plots

WINDOW = scalars
Windows for plots

SCREEN = strings
Action before each plot (keep, clear); default clea


Parameters

DATA = pointers
Each pointer contains a set of variates to be analysed

GROUPS = factors
Define groupings for the units in each training set, or missing values for the units to be allocated

NEWGROUPS = factors
Save allocations (and reallocations)

ALLOCATION = factors
Saves allocations to groups including those not present in the training set

MEANS = matrices or pointers
Saves scores for group means

SCORES = matrices or pointers
Saves scores for units

DISTANCES = matrices
Save unit to group-mean squared distances

LRV = LRVs
Save the LRVs from the canonical variates analyses

ADJUSTMENTS = matrices
Save adjustments to the canonical variates analyses

GDISTANCES = symmetric matrices
Saves the distances between groups


Description

DISCRIMINATE performs discriminant analysis (see, for example, Mardia, Kent & Bibby 1979).

   The input for the procedure is given by a pointer and a factor, specified by the DATA and GROUPS parameters, respectively. The pointer contains a set of variates defining the attributes of the units. Any unit with a missing value in any of the variates is excluded from the analysis. Units can also be excluded from the analysis by restricting the factor or variates; any such restrictions must be consistent (the rules here are exactly as used by the FSSPM directive). The factor specifies the pre-defined groupings of the units from which the allocation is derived (the "training set"); the units to be allocated by the analysis have missing factor values.

   Printed output is controlled by the option PRINT with settings:

    counts
to print tables of the number of units in each group with a complete set of observations;

    lrv
to print the canonical variate loadings, the latent roots and the trace;

    tests
to print chi-square tests as CVA does;

    icorrelations
to print the within-group correlation matrix of the input variates;

    correlations
to print the within-group correlations between the input and canonical variates;

    adjustments
to print the adjustments required to the canonical variate scores;

    means
to print canonical variate scores for the group means;

    gdistances
to print the inter-group distances as CVA does;

    scores
to print canonical variate scores for the units;

    distances
to print Mahalanobis squared distances between the units and the group means;

    newgroups
to print the initial grouping and the allocation of units to groups;

    table
to print tables of counts of allocations.

   The NROOTS option may be used to specify how many dimensions are to be printed and retained for the latent roots and vectors and for the scores of the means and units. The distances of the units from the group means, and thus the allocation of units, are also formed from the scores in the number of dimensions specified by NROOTS. By default results will be for the full dimensionality, i.e. the smaller of the number of variates and one less than the number of groups.

   The REALLOCATE option specifies whether the units in the training set are to be reallocated to groups by the procedure. If the default setting no is used then their group values, either printed or saved, will be missing.

   The PLOT option provides for group means, labels for group means, unit scores, group polygons enclosing units, and 95% confidence circles around group means. The YROOT and XROOT options specify the roots for the axes. The TITLE, WINDOW and SCREEN options allow further control of the plots. More than one plot can be output by having a list of scalars for YROOT. In this case, the values of XROOT, TITLE, WINDOW and SCREEN are cycled in parallel. A rug-like plot is drawn if only one root is extracted or if YROOT is set to a missing value.

   Results from the analysis can be saved using the parameters NEWGROUPS, ALLOCATION, MEANS, SCORES, DISTANCES, LRV, ADJUSTMENTS and GDISTANCES. The structures specified for these parameters need not be declared in advance. The default is to save MEANS and SCORES in matrices. However, if you declare either as a pointer, it will instead store the results as a data matrix (i.e. a pointer of variates corresponding to the columns of the matrix). The results correspond to p dimensions, where p is the smaller of either the number of variates, or the number of groups minus one.

 

Options: PRINT, NROOTS, REALLOCATE PLOT, YROOT, XROOT, TITLE, WINDOW, SCREEN.

Parameters: DATA, GROUPS, NEWGROUPS, ALLOCATION, MEANS, SCORES, DISTANCES, LRV, ADJUSTMENTS, GDISTANCES.


Method

A canonical variates analysis (CVA) is used to obtain the scores for the group means and the LRV containing the loadings (L), roots and trace; the analysis excludes units omitted by RESTRICT, or that have missing values in the data variates or the GROUPS factor. Scores are then calculated for all the units (i.e. ignoring any restrictions or missing values), using the formula

( X L ) + ( J A )

where X is a matrix containing the full set of units-by-variables data, J is a column vector of one's, and A is a row vector of adjustments required to place the scores for the units onto the same scale as those for the group means.

   Mahalanobis squared distances between the units and the group means are calculated from the canonical variate scores. Each unit is then allocated to the group for which it has the smallest Mahalanobis squared distance to the group mean.

   There are two internal procedures _DISAXSCALE and _DISENCLOSE.


Action with RESTRICT

The input variates and factor may be restricted. The restrictions must be identical. The canonical variates analysis is based only on the units not excluded by the restriction and having non-missing values for all data variates. Scores are calculated for all the units with a complete set of non-missing values, however these are based only on the non-excluded units: i.e. the adjustments for the canonical variate scores are calculated from the non-excluded units, and the loadings used to calculate the scores are those from the canonical variates analysis. If there is a restriction in place, the count setting of the PRINT option will produce two parallel tables, one with the number of units in the training set and another with the number of units if the data were not restricted. The table setting of the PRINT option will produce two tables, one using only those units present in the training set and another for those units excluded by the restriction.

   If the restriction results in levels of the GROUPS factor being unrepresented in the training set, the group centroids for these levels are estimated from the scores of the units that were excluded and the levels will be included in the GDISTANCE symmetric matrix. The DISTANCES parameter will include the distances to all the centroids, including those levels not in the training set. The ALLOCATION parameter will allocate to the nearest centroid even if it was not in the training set (as distinct from the NEWGROUPS factor).

   For levels and units in the training set, plotted means are marked with symbol 1 (×) and the units with symbol 3 (+). Means for levels and units excluded by the restriction are plotted with symbols 19 and 20 respectively. Units with a missing GROUPS value are plotted with symbol 18 if not in the excluded set otherwise symbol 21 is used. Polygons are not drawn around groups excluded from the training set by a restriction.


Reference

Mardia, K.V., Kent, J.T. & Bibby, J.M. (1979). Multivariate Analysis. Academic Press, London.