PREDICT directive
Forms predictions from a linear or generalized linear model.
Options
Parameters
Description
The PREDICT directive can be used after the FIT directive to summarize the results of the regression, by using the fitted relationship to predict the values of the response variate at particular values of the explanatory variables. CLASSIFY, the first parameter of PREDICT, specifies those variates or factors in the current regression model whose effects you want to summarize. Any variate or factor in the current model that you do not include will be standardized in some way, as described below.
The LEVELS parameter specifies values at which the summaries are to be calculated, for each of the structures in the CLASSIFY list. For factors, you can select some or all of the levels, while for variates you can specify any set of values. A single level or value is represented by a scalar; several levels or values must be combined into a variate (which may of course be unnamed). A missing value in the LEVELS parameter is taken by GenStat to stand for all the levels of a factor, or for the mean value of a variate. The PARALLEL parameter allows you to indicate a set of factors and/or variates whose values change in parallel. Each of these should have same number of values specified for it by the LEVELS parameter of PREDICT. The predictions are then formed for each corresponding set of values rather than for every combination of these values.
You can best understand how GenStat forms predictions by regarding its calculations as consisting of two steps. The first step, referred to below as Step A, is to calculate the full table of predictions, classified by every factor in the current model. For any variate in the model, the predictions are formed at its mean, unless you have specified some other values using the LEVELS parameter; if so, these are then taken as a further classification of the table of predictions. The second step, referred to as Step B, is to average the full table of predictions over the classifications that do not appear in the CLASSIFY parameter: you can control the type of averaging using the COMBINATIONS, ADJUSTMENT and WEIGHTS options. By default, the predictions are made at the mean of any offset variate, but option OFFSET can be used to specify another value at which the predictions should be made instead.
Printed output is controlled by settings of the PRINT option:
By default descriptions, predictions and standard errors are printed. The standard errors (and sed's) are relevant for the predictions when considered as means of those data that have been analysed, with the means formed according to the averaging policy defined by the options of PREDICT. The word prediction is used because these are predictions of what the means would have been if the factor levels been replicated differently in the data; see Lane & Nelder (1982) for more details. The LSDLEVEL option specifies the significance level (%) to use in the calculation of least significant differences (default 5%).
By default, the standard errors (and sed's) are not augmented by any component corresponding to the estimated variability of a new observation. However, you can set option SCOPE=new to request that the variance of predictions should be calculated on the basis of forecasting new observations rather than of summarizing the data to which the model has been fitted. This setting cannot be used if the predictions are to be standardized for the effects of any factors in the model; in other words, all factors in the current model must be listed in the CLASSIFY parameter of the PREDICT statement. In addition, it cannot be used when making predictions from generalized linear models with option BACKTRANSFORMATION=none, nor with weighted regression. The effect of SCOPE=new is to form variances for each predicted value by combining the variance of the estimated mean value of the prediction (as produced for SCOPE=data) together with the estimated variance of a new observation with the same values of explanatory variates and factors:
"new" variance = "data" variance + (dispersion × variance function)
The DISPERSION and DMETHOD options allow you to change the method by which the variance of the distribution of the response values is obtained for calculating the standard errors. These options operate like the corresponding options of MODEL (except that they apply only to the current statement). The default is to use the method as originally defined by the MODEL statement.
The NBINOMIAL parameter can be used to supply the total number of trials to be used for prediction with a binomial distribution when option BACKTRANSFORMATION is set to link. If you provide a value n greater than one, GenStat will predict the number of "successes" out of n. The default, NBINOMIAL=1, causes GenStat to predict the proportion of successes.
You can send the output to another channel, or to a text structure, by setting the CHANNEL option.
The COMBINATIONS option specifies which cells of the full table in Step A are to be filled for averaging in Step B. The default, COMBINATIONS=estimable, uses all the cells other than those that involve parameters that cannot be estimated, for example because of aliasing. Alternatively, you can set COMBINATIONS=present to exclude cells for factor combinations that do not occur in the data, or COMBINATIONS=full to use all the cells. When COMBINATIONS=estimable or COMBINATIONS=present the LEVELS parameter is overruled. Any subsets of factor levels in the LEVELS parameter are ignored, and predictions are formed for all the factor levels that occur in the data or are estimable. Likewise, the full table cannot then be classified by any sets of values of variates; the LEVELS parameter must then supply only single values for variates.
The ADJUSTMENT and WEIGHTS options define how the averaging is done in Step B. Values in the full table produced in Step A are averaged with respect to all those factors that you have not included in the settings of the CLASSIFY parameter. By default, the levels of any such factor are combined with what we call marginal weights: that is, by the number of occurrences of each of its levels in the whole dataset. The ADJUSTMENT and WEIGHTS options allow you to change the weights. The setting ADJUSTMENT=equal specifies that the levels are to be weighted equally. The WEIGHTS option is more powerful than the ADJUSTMENT option, allowing you to specify an explicit table of weights. This table can be classified by any, or all, of the factors over whose levels the predictions are to be averaged; the levels of remaining factors will be weighted according to the ADJUSTMENT option. Moreover, you can classify the weights by the factors in the CLASSIFY parameter as well, to provide different weightings for different combinations of levels of these factors. If you supply explicit weights in the WEIGHTS option, any setting of the COMBINATIONS option is ignored. You will find explicit weights useful in particular when you have population estimates of the proportions of each level of a factor - proportions which may not be matched well in the available data.
If a model contains any aliased parameters, predicted values cannot be formed for some cells of the full table without assuming a value for the aliased parameters. With the default setting, COMBINATIONS=estimable, no predictions are formed for these cells. When COMBINATIONS=full, if the aliased parameters simply represent effects of variates that are correlated with other explanatory variables in the model, it may be sufficient just to ignore them. This can be done by setting the ALIASING option to ignore. The aliased parameters are then taken to be zero, and fitted values are calculated for all cells of the table from the remaining parameters in the model. Aliasing can also occur if there are some combinations of factors that do not occur in the data, and here it may be more sensible to set option COMBINATIONS=present so that these cells are all excluded from the calculation of predictions. The final way to overcome aliasing is to supply explicit weights using the WEIGHTS option.
Averaging is usually the appropriate way of combining predicted values over levels of a factor. But sometimes summation is needed, for example in the analysis of counts by log-linear models. You can achieve this by setting the METHOD option to total. The rules about weights and so on still apply. In a generalized linear model, averaging is done by default on the scale of the original response variable, not on the scale transformed by the link function. In other words, linear predictors are formed for all the combinations of factor levels and variate values specified by PREDICT, and then transformed by the link function back to the natural scale. This back-transformation may be useful when you are reporting results, since the tables from PREDICT can then be interpreted as natural averages of means predicted by the fitted model. You can set option BACKTRANSFORM=none if you want the averaging to be done on the scale of the linear predictor; PREDICT will then form averages and report predictions on the transformed scale.
PREDICT calculates the standard errors of predictions from iterative models by using first-order approximations that allow for the effect of the link function. Thus you should interpret them only as a rough guide to the variability of individual predictions.
The PREDICTIONS, SE, SED, LSD and VCOVARIANCE options let you save the results of PREDICT as well as, or instead of, printing them.
The SAVE option allows you to specify the regression save structure of the analysis on which the predictions are based. If SAVE is not set, the most recent regression model is used.
The NOMESSAGE option controls printing of messages. The nonlinear setting suppresses messages about the approximate nature of standard errors of predictions in generalized linear models, and the dispersion setting prevents reminders appearing about the basis of the standard errors.
Options: PRINT, CHANNEL, COMBINATIONS, ADJUSTMENT, WEIGHTS, OFFSET, METHOD, ALIASING, BACKTRANSFORM, SCOPE, NOMESSAGE, DISPERSION, DMETHOD, NBINOMIAL, PREDICTIONS, SE, SED, LSD, LSDLEVEL, VCOVARIANCE, SAVE.
Parameters: CLASSIFY, LEVELS, PARALLEL.
Reference
Lane, P.W. & Nelder, J.A. (1982). Analysis of covariance and standardization as instances of prediction. Biometrics, 38, 613-621.
See also
Directives: MODEL, TERMS, FIT, RDISPLAY, RKEEP, RKESTIMATES, RFUNCTION, VPREDICT.
Procedures: RGRAPH, RDESTIMATES, RCOMPARISONS, RTCOMPARISONS, FIELLER, HGPREDICT.