CLUSTER directive

Forms a non-hierarchical classification.


Options

PRINT = strings
Printed output required (criterion, optimum, units, typical, initial); default * i.e. no printing

DATA = matrix or pointer
Data from which the classification is formed, supplied as a units-by-variates matrix or as a pointer containing the variates of the data matrix

CRITERION = string
Criterion for clustering (sums, predictive, within, Mahalanobis); default sums

INTERCHANGE = string
Permitted moves between groups (transfer, swop); default tran (implies swop also)

START = factor
Initial classification; default * i.e. splits the units, in order, into NGROUPS classes of nearly equal size

NSTARTS = scalar
Number of starting configurations to be used; default 10

SEED = scalar
Seed for the random numbers used to form random starting configurations; default 0


Parameters

NGROUPS = scalars
Numbers of classes into which the units are to be classified: note, the values of the scalars must be in descending order

GROUPS = factors
Saves the classification formed for each number of classes

CRITERIONVALUE = scalars
Saves the criterion values (representing within-class homogeneity)

BCRITERIONVALUE = scalars
Saves the subsidiary criterion values (representing between-class heterogeneity)

MEANS = matrices
Saves the variate means for the groups of each classification

PREDICTORS = matrices
Saves the group predictors from maximal predictive classification


Description

Printed output from CLUSTER directive is controlled by the PRINT option. This has the following possible settings.

    criterion
prints the optimal criterion value.

    optimum
prints the optimal classification.

    units
prints the data with the units ordered into the optimal classes.

    typical
prints a typical value for each class: for maximal predictive classification this is the class predictor; for the other methods it is the class mean.

    initial
if this is set the requested sections of output are also printed for the initial classification.

The DATA option supplies the data to be classified: the single structure must be either a matrix, with rows corresponding to the units and columns to the variables, or a pointer whose values are the identifiers of the variates in the data matrix. Note that CLUSTER always operates on a matrix, and so will copy the variate values into a matrix if you supply a pointer as input; thus for large data sets it is better to supply a matrix.

   The CRITERION option specifies which criterion CLUSTER is to optimize, the default being sums. The four settings are:

    sums
maximize the between-group sum of squares;

    predictive
maximal predictive classification;

    within
minimize the determinant of the pooled within-class dispersion matrix;

    mahalanobis
maximize the total Mahalanobis squared distance between the groups.

The INTERCHANGE option specifies which types of interchange (transfers or swops) are to be used. The default is transfer, which is taken to imply that both transfers and swops are used, since a swop is simply two transfers. If you set INTERCHANGE=swop, only swops are used. If INTERCHANGE=* the algorithm does not attempt to improve the classification from the initial classification; you might want this, in conjunction with the PRINT=initial setting, to display the results for an existing classification which you do not wish to improve.

   The START option can be used to supply a factor to define the initial classification. This might be constructed using the CLASSIFY procedure. If there are k classes, CLASSIFY finds the k units that are furthest apart in the multi-dimensional space defined by the data variates. These are then used as the nuclei for the classes, with each remaining unit being allocated to the class containing the nearest nucleus.

   Alternatively, if START is not specified, CLUSTER will try several random starts, and save the best classification that it finds. By default 10 starts are tried, but you can specify a different number using the NSTARTS option. The SEED option supplies the seed for the random numbers that are used to form the initial random groupings. The default of zero continues the existing sequence of random numbers if CLUSTER has already been used in the current GenStat job. If CLUSTER has not yet been used, GenStat picks a seed at random. Alternatively, if you set SEED=-1, a systematic initial grouping is used (and only one start). For example, with 97 units to be classified into 10 groups, this systematic grouping would put the first 10 units into the first group, the 11th to 20th into the second group, and so on; the last three groups would contain only nine units each.

   The first parameter, NGROUPS, is used to specify the number of classes to be formed. Any single-valued structure can be supplied here. Often you would want several classifications from a single data set, into different numbers of groups. In this case the NGROUPS parameter should be a list of the numbers of groups in descending order. For the initial classification of the second classification, CLUSTER takes the optimal classification from the first number of groups, and does some reallocation of units to make a smaller number of groups. This is repeated, as often as required, to provide initial classifications for all the later analyses; hence the need to specify the numbers in descending order. The GROUPS parameter is used to specify a list of identifiers of factors to save the optimal classifications. The CRITERIONVALUE parameter can specify a list scalars to saves the criterion values for each number of groups, and the subsidiary criterion values can be saved (also in scalars) using the BCRITERIONVALUE parameter. The MEANS parameter can save a matrix containing the means of the variates within the groups of each classification, and the PREDICTORS parameter can matrixes containing the group predictors from maximal predictive classifications.

 

Options: PRINT, DATA, CRITERION, INTERCHANGE, START, NSTARTS, SEED.

Parameters: NGROUPS, GROUPS, CRITERIONVALUE, BCRITERIONVALUE, MEANS, PREDICTORS.


Action with RESTRICT

Any restrictions, for example on variates in a DATA pointer, are ignored.


See also

Directives: FSIMILARITY, HCLUSTER, REDUCE.

Procedures: CLASSIFY, BCLASSIFICATION, CINTERACTION, MASCLUSTER.