RSEARCH procedure

Helps search through models for a regression or generalized linear model (P.W. Goedhart).


Options

PRINT = string
Printed output required (model, results); default mode, resu

METHOD = strings
Model selection method to employ (allpossible, forward, backward, fstepwise, bstepwise, accumulated, pooled); default allp

FORCED = formula
Model formula to include in every model; default *

CONSTANT = string
How to treat the constant (estimate, omit); default esti

FACTORIAL = scalar
Limit for expansion of all model terms; default 3

DENOMINATOR = string
Whether to base ratios in accumulated summaries on rms from model with smallest residual ss or smallest residual ms (ss, ms); default ss

INRATIO = scalar
Criterion for inclusion of terms for forward selection, backward elimination and stepwise regression; default 1.0

OUTRATIO = scalar
Criterion for exclusion of terms for forward selection, backward elimination and stepwise regression; default 1.0

MAXCYCLE = scalar
Limit on number of times to repeat stepwise selection methods, unless no change is made; default 50

CRITERION = string
Criterion for selecting best models among all possible models (r2, adjusted, cp, ep, aic, sic, meandeviance, deviance); default adju

EXTRA = string
Criterion which is also printed for the selected best models (r2, adjusted, cp, ep, aic, sic, meandeviance, deviance); default cp when DISPERSION=*, and mean otherwise

AFACTORIAL = scalar
Limit for expansion of FREE model terms for the fitting of all possible models; default 3

PENALTY = scalar
Penalty for Mallows Cp and Akaike's information criterion Aic; default 2

NTERMS = scalar
Limit on the number of terms to be fitted when fitting all possible models; default 16

NBESTMODELS = scalar
Number of best models printed for each subset size; default 8

FINALMODELS = pointer
Pointer to save the final models for forward, backward, fstepwise and bstepwise regression methods

ALLMODELS = pointer
Pointer to save formulae for all possible regression models containing the fitted terms of all the models; every formula includes the FORCED formula if set

RESULTS = pointer
Pointer to save variates for all possible regression models containing the criteria (r2, adjusted, cp, ep, aic, sic, deviance, meandeviance), degrees of freedom for residual and the total number of fitted parameters p

STATISTICS = pointer
Pointer to save variates for all possible regression models containing the test statistics. These are F-to-delete statistics (i.e. deviance ratios) when the DISPERSION option of the MODEL directive is set to *, and Chi-square-to-delete statistics (i.e. deviance differences scaled by the dispersion parameter) for a fixed dispersion parameter

DF = pointer
Pointer to save variates for all possible regression models containing the degrees of freedom for the numerator of the test statistics

PROBABILITIES = pointer
Pointer to save variates for all possible regression models containing the probabilities of the test statistics


Parameter

FREE = formula
Model formula specifying the candidate model terms


Description

There are various methods for choosing a regression model when there are many candidate model terms, see e.g. Montgomery & Peck (1992) or Miller (1990). The STEP directive provides forward selection, backward elimination and stepwise regression. However these methods result in only one model and alternative models, with an equivalent or even better fit, are easily overlooked. Especially in observational studies with many non-orthogonal terms there are frequently a number of alternative models, and then selection of just one well-fitting model is unsatisfactory and possibly misleading. A preferable method is to fit all possible regression models, and to evaluate these according to some criterion. In this way a number of best regression models can be selected. However the fitting of all possible regression models is very computer intensive. It should also be used with caution, because models can be selected which appear to have a lot of explanatory power, but contain noise variables only, see e.g. Flack & Chang (1987). This may occur particularly when the number of parameters is large in comparison with the number of units, as illustrated by the example for RSEARCH. Terms should therefore not be selected on the basis of a statistical analysis alone.

   RSEARCH can be used to perform these model selection methods. The call to RSEARCH must be preceded by a MODEL statement which defines the response variate and, if required, all other aspects of a (generalized) linear model. Only one response variate is allowed unless the DISTRIBUTION option of MODEL is set to multinomial. The FREE parameter specifies the candidate model terms. These may include variates, factors, interactions and regression functions like POL and SSPLINE. The METHOD option controls which model selection methods are employed:

    accumulated
prints an accumulated analysis of deviance in which all model terms are added one by one to the model in the given order;

    pooled
prints an accumulated analysis of deviance in which terms with the same number of identifiers, e.g. main effects or two-factor interactions, are pooled;

    forward
prints an accumulated analysis of deviance resulting from forward selection;

    backward
prints an accumulated analysis of deviance resulting from backward elimination;

    fstepwise
prints an accumulated analysis of deviance resulting from stepwise regression starting with no candidate terms in the model;

    bstepwise
prints an accumulated analysis of deviance resulting from stepwise regression starting with all candidate terms in the model;

    allpossible
prints summary statistics for a number of best models among all possible models.

For each model with METHOD=allpossible, the selection criterion and the degrees of freedom of the included terms are printed. The probability for the hypothesis that an included term can be deleted as the last term is also printed. These probabilities are based on F-to-delete statistics (i.e. deviance ratios) when the DISPERSION option of the MODEL directive is set to *, and Chi-square-to-delete statistics (i.e. deviance differences scaled by the dispersion parameter) for a fixed dispersion parameter.

   It is sometimes desirable to include specific terms in every model. Such terms may be specified by means of the FORCED option. The FORCED model terms are always fitted first. The CONSTANT option controls whether the constant parameter is included in the model. The limit for expanding the FREE and FORCED model formulae can be set with the FACTORIAL option, which has default value 3. The PRINT option can be used to control the output from RSEARCH.

   The criteria for inclusion and exclusion of terms for forward selection, backward elimination and stepwise regression can be specified by the INRATIO and OUTRATIO options respectively. The MAXCYCLE option specifies the number of steps. These operate exactly as in the STEP directive. The DENOMINATOR option controls the way in which variance ratios are calculated in accumulated analysis of deviance summaries.

   All possible regression models are only fitted when the number of candidate FREE model terms does not exceed 16. If the FREE formula specifies a main effects model, i.e. a model without interactions, the main effects are the candidate terms. When the FREE formula contains interactions, first all terms marginal to an interaction are dropped from the FREE formula and are added to the FORCED formula. This ensures that the principle of marginality is never violated when the candidate terms are fitted in turn. The AFACTORIAL option can be used to limit the expansion of the FREE model terms for the fitting of all possible regression models. The expansion is limited in addition to the limitation imposed by the FACTORIAL option. As an example, the following calls to RSEARCH result in identical candidate model terms, namely a.b, a.c, b.c and d, for all possible regression models:

RSEARCH [METHOD=forward,backward,allpossible;\

        FACTORIAL=3; AFACTORIAL=2] a*b*c + d

RSEARCH [METHOD=forward,backward,allpossible;\

        FACTORIAL=2; AFACTORIAL=2; FORCED=a+b+c] a*b*c + d

However, forward selection starts with no terms in the first call and with the model a+b+c in the second call. Backward elimination starts with the full model including the three factor interaction a.b.c in the first call, while this term is not fitted in the second call.

   The CRITERION option controls the selection of the best models among all possible regression models. The criteria employed in RSEARCH are defined as follows:

    r2
100 × [1 - Dev / Dev0]

    adjusted
100 × [1 - (Dev / (n-p)) / (Dev0 / (n-p0))]

    cp
Dev / f + 2 × pn

    ep
Dev × (n+1) × (n-2) / [n × (n-p) × (n-p-1)]

    aic
Dev / f + 2 × p

    sic
Dev / f + Ln(n) × p

    deviance
Dev

    meandeviance
Dev / (n-p)

where

    Dev
is the deviance of the current model;

    Dev0
is the deviance of the null model;

    p
is the number of fitted parameters of the current model;

    p0
is the number of fitted parameters of the null model;

    n
is the number of units;

    f
is the dispersion parameter.

The null model is the model with only a constant term, which may include the fitting of a grouping factor for a within groups regression and/or the fitting of cut-points for an ordinal response model.

   The dispersion parameter f is specified by the DISPERSION option of the MODEL directive or, when DISPERSION is set to *, is estimated by the mean deviance of the model with all the candidate terms. In ordinary linear regression R², adjusted R² and Mallows Cp are widely used. When R² is used, there is no penalty for adding a term, i.e. R² always improves with the addition of a term. When adjusted R² or Cp is employed, there is a penalty for adding a term. Adjusted R² improves when the F-ratio due to the addition of the term is larger than 1, while Cp improves when the F-ratio is larger than 2. Clearly, Cp is the more conservative criterion and will tend to select models with fewer terms as compared to R² and adjusted R². Minimizing Cp minimizes the mean squared error of prediction in ordinary linear regression in the case where predictions will be made at the same values as are present in the current data set. Models with negligible bias have Cp » p. For predictions at new random values, as is common in observational studies, Ep estimates the mean squared error of prediction; then Ep should be minimized. Thompson (1978) and Miller (1990) discuss Cp and Ep in detail.

   Criteria suggested for generalized linear models are the Akaike information criterion (Aic) and the Schwarz information criterion (Sic). The definition of both criteria used here is different from that in the literature. The deviance is used instead of the maximum value of the log-likelihood, which implies a constant shift for distributions without dispersion parameter. Moreover, in the spirit of generalized linear models, the deviance is scaled by the dispersion parameter. This makes Aic equivalent to Cp. Clearly, Sic is the more conservative criterion, especially when the number of units is large.

   Note that the best models have a small Cp, Ep, Aic, Sic, deviance and mean deviance, but a large R² and adjusted R². The default penalty of 2 in the definition of Cp and Aic can be altered by setting the PENALTY option, in which case Cp and Aic improves when the F-ratio is larger than PENALTY. The EXTRA option specifies an extra criterion which is printed alongside the selection criterion. The default for CRITERION is adjusted. The default for EXTRA is cp when DISPERSION is set to *, and meandeviance otherwise.

   The NTERMS option specifies the maximum number of candidate terms in a model. This can be used when only models with few candidate terms are relevant or to reduce the computational burden. For example with 12 candidate terms there are 4096 different models, while there are only 299 models with maximally three terms. Specifying NTERMS=3 then saves a considerable amount of computing time. The NBESTMODELS option specifies the number of best models within each subset size for which summary statistics are printed.

   The FINALMODEL option can be used to save the last models for forward selection, backward elimination and fstepwise and bstepwise regression. Results of the fitting of all possible regression models can be saved by means of the parameters ALLMODELS, RESULTS, STATISTICS, DF and PROBABILITIES. This saves results from all the fitted models not only from those that are printed. This includes the constant model.

   All regression warnings are suppressed. This is to prevent the printing of long lists of similar warnings like "Iterative weights have become 0, or have been held at a limit". Note that the printed output of all possible regression models is adjusted to the width of the output file.

 

Options: PRINT, METHOD, FORCED, CONSTANT, FACTORIAL, DENOMINATOR, INRATIO, OUTRATIO, MAXCYCLE, CRITERION, EXTRA, AFACTORIAL, PENALTY, NTERMS, NBESTMODELS, FINALMODELS, ALLMODELS, RESULTS, STATISTICS, DF, PROBABILITIES.

Parameters: FREE.


Method

First the FREE and FORCED formulae are checked using subsidiary procedure _RSEARCHCHECK, and terms that appear in both are dropped from the FREE formula. Then the full model is fitted and aliased predictors are dropped from both formulae. Forward selection, backward elimination and stepwise regression are straightforward implemented using the STEP directive.

   The fitting of all possible regression models uses a sequence of models in which, within each subset size, every model is fitted by dropping one term from the previous model and adding another term. Test statistics are calculated as though the tested term is the last term to enter the model. When the DISPERSION option of the MODEL directive is set to *, terms are tested by means of F-to-delete statistics, which are deviance ratios. For a fixed dispersion parameter Chi-square-to-delete statistics, i.e. deviance differences scaled by the dispersion parameter, are used to calculate probabilities.

   Smoothing splines are not allowed in the FREE formula for METHOD=allpossible due to a limitation of the FCLASSIFICATION directive.


Action with RESTRICT

Factors and variates in the FREE and FORCED formulae should not be restricted. Any restriction applied to vectors used in the MODEL statement applies also to the results from RSEARCH.


References

Flack, V.F. & Chang, P.C. (1987). Frequency of selecting noise variables in subset regression analysis: a simulation study. The American Statistician, 41, 84-86.

Miller, A.J. (1990). Subset Selection in Regression. Chapman & Hall, London.

Montgomery, D.C. & Peck, E.A. (1992). Introduction to Linear Regression Analysis (second edition). Wiley, New York.

Thompson, M.L. (1978). Selection of variables in multiple regression: Part I. A review and evaluation. International Statistical Review, 46, 1-19.