BPRUNE procedure
Prunes a tree using minimal cost complexity (R.W. Payne).
Option
Parameters
Description
The construction of a classification tree or a regression tree generally results in over fitting, that is it continues to extend the branches of the tree beyond the point that can be justified statistically. The solution is to prune the tree to remove the uninformative sub-branches.
The tree to be pruned is specified by the TREE parameter. BPRUNE assumes that there is an accuracy figure R(t) available for each node t of the tree. By default this is assumed to be stored with the tree itself, but you can specify other values using the ACCURACY parameter. This should be set to a pointer whose suffixes are the same as the numbers of the nodes in the tree, and whose elements are scalars storing the relevant accuracy values.
For a classification tree the accuracy measures the impurity of the subset of individuals at that node (how far it is from being homogeneous i.e. with individuals from a single group). For a regression tree it is the average squared distance of the values of the response variate from their mean for the subset of observations at that node. The accuracy R(T) of the whole tree T is the sum of the accuracies of its terminal nodes.
BPRUNE uses the principle of minimal cost complexity (Breiman et al. 1984, Chapter 3) to produce a sequence of pruned trees. At each stage it prunes at the node which is the weakest link. Define R(Tt) to be the accuracy of the subtree with root at node t, and nterm(t) to be its number of terminal nodes. The weakest link is then the node for which
(R(t) - R(Tt)) / (nterm(t) - 1)
is a minimum. The pruned trees can be saved, in a pointer, using the NEWTREES parameter. Their accuracies can be saved (in a variate) using the RTPRUNED parameter, and their numbers of terminal nodes can be saved (also in a variate) using the NTERMINAL parameter.
Printed output is controlled by the PRINT option, with settings:
The plot of RTPRUNED against NTERMINAL demonstates the trade-off between accuracy and complexity (number of terminal nodes). It should show an initial rapid decrease, followed by a long flat region, and then often a gradual increase. The aim is to select a tree that is accurate but not over-complex. One possibility is to take the tree at the point where the graph levels off. However, RTPRUNED contains only an estimate of the accuracy of the trees. So Breiman et al. (1984) recommend taking a tree a little above that (in fact at one standard error of RTPRUNED above the minimium point in the graph: see Chapters 3 and 11). In practice though a small amount of over-fitting should not be a problem, so the exact choice of pruned tree should not be crucial.
Option: PRINT.
Parameter: TREE, ACCURACY, NEWTREES, RTPRUNED, NTERMINAL.
Method
BPRUNE uses the BSCAN function to move around the tree, function BBELOW to obtain the numbers of the nodes below each node, and directive BCUT to perform each pruning operation.
Reference
Breiman, L., Friedman, J.H., Olshen, R.A. & Stone, C.J. (1984). Classification and Regression Trees. Wadsworth, Monterey.