RCHECK procedure

Checks the fit of a linear, generalized linear or nonlinear regression (P.W. Lane, R. Cunningham & C. Donnelly).


Options

PRINT = strings
What to print (index, y, residuals, leverages, Cook); default *

RMETHOD = string
Type of residual to use (deviance, Pearson, simple, deletion); default * i.e. as set in MODEL

INDEX = variate or factor
Which variable to use as index; default !(1...n)

ENVELOPE = string
Type of envelope with Normal and half-Normal plots (none, rough, smooth, asymptotic); default none

PROBABILITY = scalar
Approximate probability level for envelope; default 0.95

NSIMULATIONS = scalar
How many simulations to generate for rough or smooth envelopes; default (1+PROB)/(1-PROB)

SHADE = string
Whether to show shaded envelope rather than boundaries (no, yes); default no

RESIDUALS = variate
To store chosen type of residuals; default *

LEVERAGES = variate
To store leverages; default *

COOK = variate
To store modified Cook's statistics; default *

GRAPHICS = string
Type of graphics to use (lineprinter, highresolution); default high

TITLE = text
Title for graph; default identifier of response

WINDOW = numbers
Window or series of windows in which to display graphs; default 4, or 5...8 for composite

SCREEN = string
Treatment of previous graphics screen (clear, keep); default clea

SAVE = regression save structure
Specifies which model to check; default *


Parameters

YSTATISTIC = strings
What to display in the graph (residuals, Cook, leverages); default resi

XMETHOD = strings
What type of graph (fittedvalues, index, normal, halfnormal, histogram, composite); default comp


Description

Procedure RCHECK provides "diagnostic" information for checking the fit of regression models. Those directives make some checks, such as for large residuals and influential points, and give access to simple and standardized residuals and leverages through directive RKEEP. The RCHECK procedure automatically accesses these quantities via RKEEP and in addition can calculate deletion residuals and modified Cook's statistics. A range of graphs can then be drawn to help check the fit of the regression model. The defaults are intended to provide a sensible display from the simple command

RCHECK

following the fit of a regression model.

   The procedure is controlled by the YSTATISTIC and XMETHOD parameters. These can be set to display various types of residuals, as specified by the RMETHOD option; the default is the setting of this option in the MODEL command in force when the model was fitted. In addition, the absolute residuals, the leverages, or the modified Cook's statistics can be displayed. Each of these sets of statistics can be plotted against the fitted values or against an index variable; by default, the index just orders the values in the order of the units. The statistics can also be shown as Normal or half-Normal plots, or as a histogram (the Normal plot for absolute residuals being the same as the half-Normal plot). A set of four such plots is displayed as a composite picture: histogram, plot against fitted values, Normal plot and half-Normal plot (with an index plot replacing the Normal plot for absolute residuals). Graphs can be displayed in line-printer style by setting the GRAPHICS option, though some features are not then available.

   The chosen type of residuals, the leverages and Cook's statistics can be printed, or stored in variates using the RESIDUALS option.

   Plots of the residuals against fitted values or an index variable are displayed with a smoothed line fitted through the points, to indicate any potential trend.

   Normal and half-Normal plots can be enhanced with an "envelope" by setting the ENVELOPE option. The rough setting produces an upper and lower bound for the values, and a median line, produced by simulation. The bounds correspond approximately to individual confidence intervals for each value, with probability as set by the PROBABILITY option (default 95%). The number of simulations by default is the minimum to allow estimation of the required limits: this is (1+PROBABILITY) / (1-PROBABILITY). A larger number of simulations can be requested with the NSIMULATIONS option, to give better estimates at the expense of more computing time. The smooth setting requests that the bounds are smoothed, using a cubic smooting spline with 4 d.f. The asymptotic setting produces bounds calculated from the asymptotic distribution of Normal order statistics. The envelope for all these settings can be displayed as a shaded region rather than as a set of three lines by setting the SHADE option to yes.

   Envelopes cannot be calculated for nonlinear models or curves, nor for generalized linear models with inverse Normal, negative binomial, geometric, multinomial or calculated distributions. Nor can they be produced for deletion residuals or Cook's statistics; they are not appropriate for leverages, which have no associated distributional assumption.

   The graphical displays can be controlled as usual using the TITLE and SCREEN options. The WINDOW option can be used to select a defined windows for high-resolution plots. Otherwise window 4 is used for a single plot or windows 5-8 for composite plots. These are redefined if necessary to fill the frame.

   The colours and symbols used in the displays can be controlled by setting the attributes of the following pens with the PEN directive before calling the procedure:

    pen 2
zero lines in fitted-value, Normal and index plots;

    pen 3
points and histogram bars;

    pen 4
shading of envelopes;

    pen 5
smooth line in fitted-value and index plots of residuals, and envelope bounds if unshaded.

   The procedure exits if there are fewer than four observations, or fewer than two non-missing standardized residuals.

 

Options: PRINT, RMETHOD, INDEX, ENVELOPE, NSIMULATIONS, PROBABILITY, SHADE, RESIDUALS, LEVERAGES, COOK, GRAPHICS, TITLE, WINDOW, SCREEN, SAVE.

Parameters: YSTATISTIC, XMETHOD.


Method

Standardized residuals and leverages are accessed using RKEEP from the latest fitted regression model, or from that specified by the SAVE option. Deletion residuals di are calculated for linear models as follows:

di = ri /√((n-p-ri2)/(n-p-1))

where ri are the standardized residuals, n is the number of observations, and p is the number of parameters in the model. For generalized linear models,

di = SIGN(rdi) × √((1-li) × rdi2 + li) × rpi2)

where rdi and rpi are the standardized deviance and Pearson residuals respectively.

   Modified Cook's statistics ci are calculated as follows:

ci = ABS(di) × √{ (n-p) × li / (p × (1-li)) }

where li are the leverages. In Normal plots, the Normal quantiles are calculated as follows:

qi = NED( (i-0.375) / (n+0.25) )

while for a half-Normal plot they are given by

qi = NED( 0.5 + 0.5 × (i-0.375) / (n+0.25) )

   For generalized linear models, fitted values are transformed by an approximate variance-stabilizing transformation before use in graphs:

Poisson, multinomial, negative binomial and geometric 2 × SQRT(fitted)

    binomial, Bernoulli
2 × ANG(100 × fitted / nbinomial)

    gamma, exponential
LOG(fitted)

    inverse Normal
1 / fitted

   The smoothed line displayed for fitted-value or index plots is calculated as a straight line if the number n of distinct explanatory values is >3. Otherwise it is a cubic smoothing spline, with 2 d.f. for n>9, 3 for n>34 or 4 for n>59.

   For Normal linear models, envelopes are calculated by default from ns sets of Normal random numbers, where

ns = (1 + PROBABILITY) / (1 - PROBABILITY).

If the number of observations is less than 100, the values are transformed using the projection matrix to induce the observed correlation pattern of the data; for larger datasets, no transformation is done. The values are then ordered and the minimum and maximum values determine the envelope boundaries. If ns is set by the NSIMULATIONS option, the boundaries are calculated with the QUANTILES function from the ns values generated for each ordered residual. For generalized linear models, ns sets of values of the response variate are generated from the distribution, with parameters estimated from the current fit. The model is refitted to each set, and the residuals extracted and dealt with as for the transformed Normal values above.


Action with RESTRICT

Restrictions applied to vectors used in the regression apply also to the RCHECK procedure. Values of diagnostic quantities are set to missing for all excluded units.