MODEL directive
Defines the response variate(s) and the type of model to be fitted for linear, generalized linear, generalized additive and nonlinear models.
Options
Parameters
Description
The MODEL directive does not actually fit anything: it simply sets up some structures inside GenStat that are used when you give a FIT, FITCURVE or FITNONLINEAR statement later on. So when you are doing regression, MODEL will always be accompanied by at least one other regression statement to fit a model, like FIT.
The Y parameter allows a list of variates; if you put more than one for linear regression, then you will get an analysis for each. This is a more efficient way of doing many linear regressions with the same explanatory variables, than separate pairs of MODEL and FIT statements. With additive models, generalized linear models and nonlinear models, only the first variate will be analysed (with the exception of multinomial response models); the others will be ignored.
The RESIDUALS and FITTEDVALUES parameters allow you to specify variates to contain the residuals and fitted values for each response variable. The residuals are the "unexplained" component of the response variable, standardized in some way according to the RMETHOD option. The fitted values are the "explained" component: that is, the combination of parameters and explanatory variables fitted in the model. You can get access to these sets of values in a different way through the RKEEP directive.
The DISTRIBUTION and LINK options are used to specify a generalized linear model (McCullagh & Nelder 1989). By default the data are assumed to follow a Normal distribution, as required for ordinary linear regression, but other distributions can be selected using the DISTRIBUTION option. The LINK option specifies the link function that relates the linear model to the expected values of the distribution; in the default ordinary linear regression, this is the identity function (indicating no transformation). So, for example, for a log-linear model we would specify DISTRIBUTION=Poisson and LINK=log, while for logistic regression we would have DISTRIBUTION=binomial and LINK=logit. The NBINOMIAL parameter must also be set when DISTRIBUTION=binomial, to give the number of binomial trials for each unit.
The EXPONENT option specifies the exponent when LINK=power. Similarly, the AGGREGATION option specifies the aggregation parameter k when DISTRIBUTION=negativebinomial. This is a measure of the tendency for observations to cluster together which appears in the formula for the variance as a function of the mean
variance = mean + mean2/k
The default value of k is set at 1, which corresponds to the geometric distribution. The parameter k must be positive, and as it increases to infinity the distribution approaches the Poisson distribution. The KLOGRATIO option sets the parameter k for the logratio link.
You can also define your own distribution or link function for a generalized linear model. To specify your own distribution, you need to set DISTRIBUTION=calculated and then specify expression structures with the DCALCULATION option to calculate the deviance and the variance function for each unit of the response variate, using the current values of the fitted-values variate. You must also set the FITTEDVALUES, DEVIANCE and VFUNCTION parameters to indicate which identifiers are used to represent these in the expressions. To specify your own link, you need to set LINK=calculated and provide expressions with the LCALCULATION option for two other calculations to form the fitted values and the derivative of the link function for each unit of the response variate, using the current values of the linear predictor. You must also set the FITTEDVALUES, LINEARPREDICTOR and DERIVATIVE parameters to specify the identifiers used to represent these in the calculations. In addition, you must provide initial values for the linear predictor, so that the iterative process can get started: often this can be done just by applying the link function to the response variate itself, but it may be necessary to modify extreme values such as 0 that may be mapped to infinity by the link function.
You can fit ordinal response models by setting option YRELATION=cumulative and option DISTRIBUTION=multinomial.
The DISPERSION option controls how the variance of the distribution of the response values is calculated. By default, the variance is estimated from the residual mean square, and standard errors and standardized residuals are calculated from the estimate. If you use DISPERSION to supply a value for the variance of the Normal distribution, or for the dispersion parameter of other distributions, then standard errors and residuals are based on this given value instead. In a generalized linear model, the dispersion of the chosen distribution can be fixed at a value provided by the DISPERSION option, or estimated from either the residual deviance or the Pearson chi-square statistic, as specified by the DMETHOD option.
The WEIGHTS option allows you to specify a variate holding weights for each unit. In simple linear regression, the estimate of dispersion is then the weighted residual mean square. Thus, if the variance of the response variable is not constant, and you know the relative size of the variance for each observation, you can set the weight to be proportional to the inverse of the variance of an observation. Alternatively, if the variance is related in a simple way to the mean, you may just need to specify a different distribution for the response. The WEIGHTS option can also be set to a symmetric matrix, supplying weights corresponding to some pattern of correlation or covariance between units as well as variance of each unit. The subsequent analysis is known as generalized least-squares if the response distribution is Normal.
The OFFSET option allows you to include in the regression a variable with no corresponding parameter. Linear regression analysis of Y with offset O is just the same as analysis of Y-O, but the offset has non-trivial applications in generalized linear models.
The GROUPS option specifies a factor whose effects you want to eliminate before any regression is fitted. The factor must already have been defined. This method of elimination is sometimes called absorption; you might want to use it when data from many different groups are to be modelled. Use of GROUPS gives less information than you would get if you included the factor explicitly in the model (leverages, predictions and some parameter correlations cannot be formed), but it saves space and time in fitting the model when the factor has many levels. You can use GROUPS only with linear and generalized linear regression.
The RMETHOD option controls how residuals are formed. By default, residuals are deviance residuals standardized by their estimated variance. The alternative Pearson residuals are defined in exactly the same way if the distribution is Normal, but for regression models with distributions other than Normal the two kinds of residual are different. If you do not want residuals, you can set the option to a missing value (*) to save space within GenStat. However, you will then not be able to get residuals, fitted values or leverages, and the automatic checks on the fit of a model will not be done.
The FUNCTION option is relevant only when you want to optimize a general function (see FITNONLINEAR). It is ignored unless no response variates are specified by the Y parameter.
The SAVE option allows you to specify an identifier for the regression save structure. This structure stores the current state of the regression model, and can be used explicitly in the directives RDISPLAY, RKEEP, PREDICT and RFUNCTION. If the identifier in SAVE is of a regression save structure that already has values, those values are deleted. You can reset the current regression save structure at any point in a program by using the SET directive. Then, later regression statements would use the model stored in this save structure.
Options: DISTRIBUTION, LINK, EXPONENT, AGGREGATION, KLOGRATIO, DISPERSION, WEIGHTS, OFFSET, GROUPS, RMETHOD, DMETHOD, FUNCTION, YRELATION, DCALCULATION, LCALCULATION, SAVE.
Parameters: Y, NBINOMIAL, RESIDUALS, FITTEDVALUES, LINEARPREDICTOR, DERIVATIVE, DEVIANCE, VFUNCTION.
Action with
RESTRICT
You can restrict the units that GenStat will use for the regression by putting a restriction on any of the vectors involved in the MODEL statement (response variates, weight variate, offset variate, grouping factor or variate of binomial totals), or on any explanatory variate or factor in a subsequent TERMS statement. However, you are not allowed to have different restrictions on the different vectors. You should not alter the restriction applied to the vectors between the TERMS statement and subsequent fitting statements.
Reference
McCullagh, P. & Nelder, J.A. (1989). Generalized Linear Models (second edition). Chapman and Hall, London.