BASSESS directive

Assesses potential splits for regression and classification trees.


Options

Y = variate or factor
Response variate for a regression tree, or factor specifying the groupings for a classification tree

SELECTION = dummy
Returns the identifier of X variate or factor used in the best split

TEST = expression
Logical expression representing the best split

MAXSPLITPOINT = scalar or variate
When SELECTION is a variate or a factor with ordered levels this returns a scalar containing the boundary between the two splits, when the SELECTION is a factor with unordered levels it returns a variate containing the levels allocated to the first split

MAXCRITERION = scalar
Maximum value obtained for the selection criterion

NOSELECTION = scalar
Returns the value 1 if no split has been selected, otherwise 0

FMETHOD = string
Selection method to use when Y is a factor (Gini, MPI); default Gini

ANTIENDCUTFACTOR = string
Anti-end-cut factor to use when Y is a factor (classnumber, reciprocalentropy); default * i.e. none

WEIGHTS = variate
Weights; default * i.e. all weights 1

TOLERANCE = scalar
Tolerance multiplier used e.g. to check for equality of x-values; default * i.e. set automatically for the implementation concerned


Parameters

X = variates or factors
Lists the variates to be investigated in the design; these need not be supplied if none of the other parameters are required

ORDERED = strings
Whether factor levels are ordered (yes, no); default no

SPLITPOINT = scalars or variates
Saves details of the best split found for each X variable; when X is a variate or a factor with ordered levels this returns a scalar containing the boundary between the two splits, when the X is a factor with unordered levels it returns a variate containing the levels allocated to the first split

CRITERIONVALUE = scalars
Saves the value of the selection criterion for the best split found for each X variable


Description

BASSESS selects splits for use when constructing classification or regression trees. The Y option specifies the factor defining the groupings for a classification tree, or the response variate for a regression tree. The x-variables that are available to make the split are supplied by the X parameter. They can be variates, or factors with either ordered or unordered levels as indicated by the ORDERED parameter. For example, a factor called Dose with levels for example 1, 1.5, 2 and 2.5 would usually be treated as having ordered levels, whereas levels labelled 'Morphine', 'Amidone', 'Phenadoxone' and 'Pethidine' of a factor called Drug would be regarded as unordered.

   In a regression tree, the accuracy of each node is the squared distance of the values of the response variate from their mean for the observations at the node, divided by the total number of observations. The potential splits are assessed by their effect on the accuracy, that is the difference between the initial accuracy and the sum of the accuracies of the two successor nodes resulting from the split.

   For a classification tree, the FMETHOD option allows one of two selection criteria to be requested, either Gini information or the MPI (mean posterior improvement) criterion of Taylor & Silverman (1993). The default is to use Gini information. The ANTIENDCUTFACTOR option allows you to request use of adaptive anti-end-cut factors as devised by Taylor & Silverman (1993, Section 5). Further details are given in the Methods section. By default no adaptive factors are used.

   The SPLITPOINT parameter can be used to save details of the best split found for each X variable. When X is a variate or a factor with ordered levels, this returns a scalar containing the boundary between the two splits. Alternatively, when X is a factor with unordered levels, it returns a variate containing the levels allocated to the first split. The CRITERIONVALUE parameter saves the value of the selection criterion for the best split found for each X variable.

   The SELECTION can be set to a dummy to store the identifier of the X variate or factor used in the best split, and the MAXSPLITPOINT option can save details of the best split, similarly to the SPLITPOINT parameter. The MAXCRITERION option saves the maximum value obtained for the selection criterion, and the NOSELECTION saves a scalar containing the value 0 if a split could be selected or 1 if no further splitting was possible. You can save a logical expression representing the best split using the TEST option. So, for example, you can put

BASSESS [Y=Yvar; TEST=Test; ...]

RESTRICT Yvar; #Test == 1

PRINT Yvar

to print the y-values of the individuals in the first successor set. BASSESS takes account of restrictions on Y or on any of the X variates or factors. So you also could now use BASSESS to find the best split on that set.

   The WEIGHTS option can supply a variate of weights for the observations. This could be used to supply prior probabilities, or to emphasize units that are perceived as being especially important.

   Finally, the TOLERANCE option can be used to modify the tolerance multiplier used internally for example to check for equality of x-values. By default this is set automatically to a value appropriate for the GenStat implementation concerned.

 

Options: Y, SELECTION, TEST, MAXSPLITPOINT, MAXCRITERION, NOSELECTION, FMETHOD, ANTIENDCUTFACTOR, WEIGHTS, TOLERANCE.

Parameters: X, ORDERED, SPLITPOINT, CRITERIONVALUE.


Method

Further general information about classification and regression trees can be found in Breiman et al. (1984). The methods used by BASSESS for classification trees are based on Taylor & Silverman (1993). The Gini setting of the FMETHOD option uses the change in Gini information:

G = (1 - ∑k αk2) - (∑k β1k) × (1 - ∑k β1k2) - (∑k β2k) × (1 - ∑k β2k2)

where αk is the proportion of individuals in the original set that are in group k, and βik is the proportion of individuals in successor set i (i = 1 or 2) that are in group k. The aim here is to split the individuals into sets to maximize differences between the within-set group probabilities. An equivalent formula (Taylor & Silverman 1993, Section 4) is

G = (p1 × p2) × { ∑k β1k2 + ∑k β2k2 - ∑k ( β1k × β2k ) }

where pi = ∑k βik. The alternative MPI (mean posterior improvement) criterion concentrates more on making the group probabilities differ between the successor sets:

MPI = (p1 × p2) × { 1 - ∑k (( β1k × β2k) / ( β1k + β2k)) }

   Taylor & Silverman (1993) note that the term (p1 × p2) aims to generate successor sets of similar size, and refer to it as the anti-end-cut factor because it aims to avoid sets being produced with only a small number of individuals. They suggest that this should vary according to the complexity of the problem, and instead become

min { p1 × p2, plow × (1 - plow) }

where plow is the recriprocal of the number of groups in the initial set for the classnumber setting of the ANTIENDCUTFACTOR option, and

min { 0.5, 1 / ( ∑k αk2) }

for the reciprocalentropy setting. The idea is to encourage splits that lead to terminal modes - and to take accounts of the fact that these are more likely to be generated as the number of groups becomes small.


Action with RESTRICT

You can request that BASSESS operate on only a subset of the units by applying a restriction to the Y variate or factor, or to any of the X variates or factors, or to the WEIGHTS variate.


References

Breiman, L., Friedman, J.H., Olshen, R.A. & Stone, C.J. (1984). Classification and Regression Trees. Wadsworth, Monterey.

Taylor, P.C. & Silverman, B.W. (1993). Block diagrams and splitting criteria for classification trees. Statistics and Computing, 3, 147-161.