DISCRIMINATE procedure
Performs discriminant analysis (L.H. Schmitt & P.G.N. Digby).
Options
Parameters
Description
DISCRIMINATE performs discriminant analysis (see, for example, Mardia, Kent & Bibby 1979).
The input for the procedure is given by a pointer and a factor, specified by the DATA and GROUPS parameters, respectively. The pointer contains a set of variates defining the attributes of the units. Any unit with a missing value in any of the variates is excluded from the analysis. Units can also be excluded from the analysis by restricting the factor or variates; any such restrictions must be consistent (the rules here are exactly as used by the FSSPM directive). The factor specifies the pre-defined groupings of the units from which the allocation is derived (the "training set"); the units to be allocated by the analysis have missing factor values.
Printed output is controlled by the option PRINT with settings:
The NROOTS option may be used to specify how many dimensions are to be printed and retained for the latent roots and vectors and for the scores of the means and units. The distances of the units from the group means, and thus the allocation of units, are also formed from the scores in the number of dimensions specified by NROOTS. By default results will be for the full dimensionality, i.e. the smaller of the number of variates and one less than the number of groups.
The REALLOCATE option specifies whether the units in the training set are to be reallocated to groups by the procedure. If the default setting no is used then their group values, either printed or saved, will be missing.
The PLOT option provides for group means, labels for group means, unit scores, group polygons enclosing units, and 95% confidence circles around group means. The YROOT and XROOT options specify the roots for the axes. The TITLE, WINDOW and SCREEN options allow further control of the plots. More than one plot can be output by having a list of scalars for YROOT. In this case, the values of XROOT, TITLE, WINDOW and SCREEN are cycled in parallel. A rug-like plot is drawn if only one root is extracted or if YROOT is set to a missing value.
Results from the analysis can be saved using the parameters NEWGROUPS, ALLOCATION, MEANS, SCORES, DISTANCES, LRV, ADJUSTMENTS and GDISTANCES. The structures specified for these parameters need not be declared in advance. The default is to save MEANS and SCORES in matrices. However, if you declare either as a pointer, it will instead store the results as a data matrix (i.e. a pointer of variates corresponding to the columns of the matrix). The results correspond to p dimensions, where p is the smaller of either the number of variates, or the number of groups minus one.
Options: PRINT, NROOTS, REALLOCATE PLOT, YROOT, XROOT, TITLE, WINDOW, SCREEN.
Parameters: DATA, GROUPS, NEWGROUPS, ALLOCATION, MEANS, SCORES, DISTANCES, LRV, ADJUSTMENTS, GDISTANCES.
Method
A canonical variates analysis (CVA) is used to obtain the scores for the group means and the LRV containing the loadings (L), roots and trace; the analysis excludes units omitted by RESTRICT, or that have missing values in the data variates or the GROUPS factor. Scores are then calculated for all the units (i.e. ignoring any restrictions or missing values), using the formula
( X L ) + ( J A )
where X is a matrix containing the full set of units-by-variables data, J is a column vector of one's, and A is a row vector of adjustments required to place the scores for the units onto the same scale as those for the group means.
Mahalanobis squared distances between the units and the group means are calculated from the canonical variate scores. Each unit is then allocated to the group for which it has the smallest Mahalanobis squared distance to the group mean.
There are two internal procedures _DISAXSCALE and _DISENCLOSE.
Action with
RESTRICT
The input variates and factor may be restricted. The restrictions must be identical. The canonical variates analysis is based only on the units not excluded by the restriction and having non-missing values for all data variates. Scores are calculated for all the units with a complete set of non-missing values, however these are based only on the non-excluded units: i.e. the adjustments for the canonical variate scores are calculated from the non-excluded units, and the loadings used to calculate the scores are those from the canonical variates analysis. If there is a restriction in place, the count setting of the PRINT option will produce two parallel tables, one with the number of units in the training set and another with the number of units if the data were not restricted. The table setting of the PRINT option will produce two tables, one using only those units present in the training set and another for those units excluded by the restriction.
If the restriction results in levels of the GROUPS factor being unrepresented in the training set, the group centroids for these levels are estimated from the scores of the units that were excluded and the levels will be included in the GDISTANCE symmetric matrix. The DISTANCES parameter will include the distances to all the centroids, including those levels not in the training set. The ALLOCATION parameter will allocate to the nearest centroid even if it was not in the training set (as distinct from the NEWGROUPS factor).
For levels and units in the training set, plotted means are marked with symbol 1 (×) and the units with symbol 3 (+). Means for levels and units excluded by the restriction are plotted with symbols 19 and 20 respectively. Units with a missing GROUPS value are plotted with symbol 18 if not in the excluded set otherwise symbol 21 is used. Polygons are not drawn around groups excluded from the training set by a restriction.
Reference
Mardia, K.V., Kent, J.T. & Bibby, J.M. (1979). Multivariate Analysis. Academic Press, London.