GEE procedure

Fits models to longitudinal data by generalized estimating equations (D.M. Smith & M.G.Kenward).


Options

PRINT = string
What to display (estimates, correlations, scalefactor, monitoring); default esti, corr, scal

DISTRIBUTION = string
Distribution of response (normal, Poisson, binomial, gamma, inversenormal, negativebinomial); default *

LINK = string
Link function (identity, logarithm, logit, reciprocal, power, squareroot, probit, complementaryloglog, logratio); default *

EXPONENT = scalar
Exponent for power link; default -2

TERMS = formula
Explanatory variates, factors etc

CONSTANT = string
How to treat constant (estimate, omit); default esti

AGGREGATION = scalar
Fixed parameter for negative binomial distribution (parameter k as in variance function var = mean + mean2/k); default 1

KLOGRATIO = scalar
Parameter for logratio link, in form log(mean / (mean + k)); default as set in AGGREGATION option

QUADESTIMATION = string
Whether to use quadratic estimation (used, notused); default used

SCALEFACTOR = string
How to calculate the scale factor (fixed, constant, varytime); default varies with distribution, fixed for Poisson, binomial and negative binomial, constant for rest

SFVALUE = scalar
Value for scale factor when SCALEFACTOR=fixed; default 1.0 for Poisson and binomial, missing for rest

CRTYPE = string
Form of correlation matrix (independence, unstructured, exchangeable, autoregressive, dependence, antedependence); default *

ORDER = scalar
Order in dependence and ante-dependence form of correlation matrix; default 1

TIMEDEPENDENT = string
Whether correlation in dependence model changes with time (no, yes); default no


Parameters

Y = variates
Response variate for each analysis

NBINOMIAL = variates or scalars
Denominator in binomial

FITTEDVALUES = variates
To store fitted values

RESIDUALS = variates
To store residuals

SUBJECT = factors
Identifier of subjects

OUTCOME = factors
Identifier of outcomes

COUNT = variates
Variate of counts of no. outcomes

TIME = factors
Times of repeated measures variate

WEIGHT = variates
Weight variate

OFFSET = variates
Offset variate

SAVE = pointers
Structure to save output variables


Description

GEE implements the General Estimating Equation (GEE) methodology of Liang & Zeger (1986) with quadratic estimation for the covariance structure. In the terminology of Liang et al. (1992) the methodology implemented is a form of GEE1. Full details of the implementation are given in Kenward & Smith (1995a). GEE, as implemented here, is a comparatively simple non-likelihood method for fitting marginal models to repeated measurements that can be used when the response has a distribution in the exponential family. This includes the Gaussian distribution, for which the procedure implemented here reduces to a form of the EM algorithm, and then produces exact ML or REML estimates, or a close approximation to these depending on the particular correlation structure chosen. For other distributions the resulting estimates are not maximum likelihood but can be shown to have asymptotic properties familiar from quasi-likelihood, such as consistency and asymptotic normality.

   The standard range of generalized linear models (as in procedure GLM) can be fitted involving a variety of covariance/correlation structures over the times of the repeated measurements. The standard links and distributions can be chosen by setting the options DISTRIBUTION, LINK, EXPONENT, AGGREGATION and KLOGRATIO, as in the MODEL directive. Non-standard ones require the definition of auxiliary procedures to carry out the necessary calculations (see the Method section). The terms in the fitted model are specified by the TERMS option, which may be set to a formula or left unset to fit a null model. The CONSTANT option can be used to omit a constant term. Setting the QUADESTIMATION option to used requests the use of quadratic estimation for the data-based covariance/correlation matrix (see Kenward & Smith 1995a). The SCALEFACTOR option specifies the form of scalefactor to be used (fixed to a value specified by the SFVALUE option, constant over times of repeated measurements, or varying over times of repeated measurements). The CRTYPE option specifies the structure of the covariance/correlation matrix over the times of the repeated measurements. The ORDER option specifies the order of the covariance/correlation structures for the dependence and ante-dependence cases, with option TIMEDEPENDENT specifying whether the correlation in a dependence structure changes with the time of the repeated measurement.

   The Y parameter must be set to specify the response variate. For a binomial distribution the NBINOMIAL parameter must also be set; this may be either a variate or a scalar (if it is a scalar, GEE maps it out automatically to a variate with the same number of values as Y). The SUBJECT parameter specifies a factor to identify the subjects. Alternatively, where the data consist of outcomes and numbers with those outcomes, the parameter OUTCOME must be set to the identifier of the outcome and the parameter COUNT to the number with the outcome. The parameter TIME must be set to the times of the repeated measurements. The parameters WEIGHT and OFFSET specify weight and offset variates that may be involved.

   The output from the procedure is controlled by the PRINT option; by default estimates, their standard errors, covariances/correlations and scalefactors are given. Two sets of standard errors are provided for the estimates. One is the naive estimate which assumes the specified covariance/correlation structure holds. The other is the sandwich estimate which makes no such assumption. The fitted values and residuals can be obtained by setting the parameters FITTEDVALUES and RESIDUALS. The residuals are the Pearson residuals as defined in the Guide to GenStat, Part 2, Section 3.1.1. The SAVE parameter can be set to a pointer containing various details of the analysis: SAVE[1] contains the scalefactor(s), SAVE[2] the correlation or covariance matrix, SAVE[3] the estimates of the linear predictor parameters, SAVE[4] the naive variance/covariance matrix and SAVE[5] the sandwich variance/covariance matrix. These structures can then be used to perform Wald tests or other calculations.

   The algorithms in the procedure have been set up assuming that the data contain a complete set of observations for each subject. Where there are missing values these must be included explicitly (using the missing value symbol *) to create a complete set of observations. Missing values are allowed in both the Y variate and the explanatory variates in TERMS.

   In the case of the Gaussian distribution, a working covariance matrix, rather than correlation matrix, is used. This provides considerable simplification within the algorithm.

   This is a complicated algorithm and some examples may take a while to run. If necessary, however, you can set option PRINT=monitoring to see what is happening.

 

Options: PRINT, DISTRIBUTION, LINK, EXPONENT, TERMS, CONSTANT, AGGREGATION, KLOGRATIO, QUADESTIMATION, SCALEFACTOR, SFVALUE, CRTYPE, ORDER, TIMEDEPENDENT.

Parameters: Y, NBINOMIAL, FITTEDVALUES, RESIDUALS, SUBJECT, OUTCOME, COUNT, TIME, WEIGHT, OFFSET, SAVE.


Method

For full details of the method implemented in this procedure see Kenward & Smith (1995a). A generalized linear model is formulated for the marginal distribution of the observations at each time point using an appropriate link function and error distribution. If the repeated measurements could be assumed to be independent, the well-known iterative weighted least squares fitting procedure could be used to obtain ML estimates of the marginal model parameters. However this ignores the dependence among the repeated measurements. Full likelihood is in general very awkward in this setting so, to avoid a formal introduction of dependence into the model, a working correlation matrix is introduced into the iterative procedure, changing the least squares from a weighted to a generalized form. The correlation matrix can be introduced in various ways. It can be held constant throughout the iterative procedure. An example of this is the use of the identity matrix, leading to the so-called independence estimating equations for which the process reduces back to that of fitting a univariate generalized linear model. Alternatively an estimated correlation matrix can be introduced into the algorithm which is updated at each cycle using quadratic estimation: essentially the correlation structure is estimated from the residuals using the equations that would be appropriate were the residuals normally distributed. On convergence consistent estimates of the marginal linear model parameters are obtained and, if the correlation structure chosen is appropriate, then this will be consistently estimated as well. It is not necessary for the correlation structure to be correct for the consistency of the marginal parameter estimates, at least when the correlation structure is fixed; indeed the common choice of independence is almost certain not to be appropriate. However the estimates of precision of the marginal parameter estimates do need to be adjusted to allow for the true correlation structure. This correction is done in the so-called "sandwich" estimator provided by the procedure.

   The procedures have been written so that it is possible to fit models other than the standard ones. An important example of such a model is the application of the GEE methodology to ordinal categorical data. This application requires the data to be arranged in a particular form (as cummulative logits) and a particular correlation matrix (specified in _GEECORRELATION). The type of analyses are explained in Kenward et al. (1994) and the methodology described in that paper has been duplicated. Further details are given in Kenward & Smith (1995b).

   An option (SCALEFACTOR) has been included that allows the user to decide whether or not the scale factor is fixed at its independence distributional default, or is estimated from the scaled residuals as in Liang & Zeger (1986), or is treated as a vector varying over time.

   GEE calls three subsidiary procedures, _GEECODI, _GEECALLIN and _GEECALDIS to assist with the analysis. There is no need for the user to modify these procedures:

_GEECODI
prints out the results of the iterative processes;

_GEECALLIN
calculates the fitted values and derivatives for various links;

_GEECALDIS
calculates the variance function and deviance for various distributions.

There are also four other procedures, which can be re-written or replaced, to cater for further user-defined distributions, links and correlation structures:

_GEEINIT
calculates initial estimates of the linear predictor in the generalized linear model;

_GEELINK
calculates fitted values and derivatives;

_GEEDISTRIBUTION
calculates the variance function and deviance;

_GEECORRELATION
calculates the correlation matrix and the sandwich matrix involving the residuals. (For the normal distribution the variance/covariance matrices are used not the correlation matrices.)

If the LINK option is unset, the procedure will call _GEEINIT and _GEELINK instead of using those for the various standard link functions. For a logit link function _GEEINIT and _GEELINK should be defined as follows.


PROCEDURE '_GEEINIT'

          "Calculation of initial estimate of linear predictor,

           link unset"

PARAMETER NAME = \

          'Y', "I: variate; response variate"\

          'LINEARPREDICTOR',"O: variate; linear predictor"\

          'OFFSET', "I: variate; offset"\

          'NBINOMIAL'; "I: variate; denominator of binomial"\

          SET=3(yes),no;TYPE=4('variate'); \

          COMPATIBLE=*,3(!T(type,nvalues,restriction));\

          PRESENT=yes,no,2(yes)


CALC LINEARPREDICTOR = LOG((Y+0.5)/(NBINOMIAL-Y+0.5)) - OFFSET

ENDPROCEDURE


PROCEDURE '_GEELINK'

          "Calculation of fitted values and derivatives"

PARAMETER NAME = \

          'LINEARPREDICTOR', "I: variate; linear predictor"\

          'FITTEDVALUES', "O: variate; estimate of fitted values"\

          'DERIVATIVES', "O: variate; estimate of derivatives"\

          'OFFSET', "I: variate; offset"\

          'NBINOMIAL'; "I: variate; denominator of binomial"\

          SET=4(yes),no;TYPE=5('variate'); \

          COMPATIBLE=*,4(!T(type,nvalues,restriction));\

          PRESENT=yes,2(no),2(yes)


GETATTRIBUTE [ATTRIBUTE=NVALUES] LINEARPREDICTOR; SAVE=!P(nobs)


CALC FITTEDVALUES = NBINOMIAL/(1+EXP(-LINEARPREDICTOR - OFFSET))

& DERIVATIVES = 1/FITTEDVALUES+1/(NBINOMIAL-FITTEDVALUES)

ENDPROCEDURE


If the DISTRIBUTION option is unset, the procedure will call _GEEDISTRIBUTION instead of using one of the various standard distributions. For a binomial error distribution _GEEDISTRIBUTION should be defined as follows.


PROCEDURE '_GEEDISTRIBUTION'

          " Calculation of variance function and deviance"

PARAMETER NAME = \

          'Y', "I: variate; response variate"\

          'FITTEDVALUES',"I: variate; fitted values"\

          'VARIANCE', "O: variate; variance"\

          'DEVIANCE', "O: scalar; total deviance"\

          'NBINOMIAL'; "I: variate; denominator of binomial"\

          SET=4(yes),no;TYPE=3('variate'),'scalar','variate'; \

          COMPATIBLE=*,2(!T(type,nvalues,restriction)),*,\

                     !T(type,nvalues,restriction); \

          PRESENT=2(yes),2(no),yes


CALC VARIANCE = FITTEDVALUES*(NBINOMIAL-FITTEDVALUES)/NBINOMIAL

& DEVIANCE = -2*LLB(Y;NBINOMIAL;(FITTEDVALUES/NBINOMIAL))

ENDPROCEDURE


If the CRTYPE option is unset, the procedure will call _GEECORRELATION instead of using one of the various standard correlation models. For the independence model _GEECORRELATION should be defined as follows. Kenward & Smith (1995b) describe how _GEECORRELATION should be set up for analysing repeated ordinal categorical data.


PROCEDURE '_GEECORRELATION'

            "

              Calculation of correlation matrix


              For SANDWICH = NO

                  input is the R matrix as for UNSPECIFIED

                  output is the desired R matrix.

              For SANDWICH = YES

                  input is the (Y-MU)*T(Y-MU) matrix

                  output is the desired modified (Y-MU)*T(Y-MU) matrix.


              N.B. For the normal distribution both the input and

                   output R's should be variance/covariance matrices

                   not correlation matrices.

            "

OPTION NAME = \

            'CONSTANT', "I: text; how to treat constant (estimate,

                        omit); default e"\

            'SANDWICH'; "I; text; whether the sandwich central matrix

                        product or not) (no,yes); default no"\

            MODE=2(T);NVALUES=2(1); \

            VALUES=!T(ESTIMATE,OMIT),!T(NO,YES); \

            DEFAULT=!T(ESTIMATE),!T(NO);


PARAMETER NAME = \

          'CORRELATIONS',"I/O: matrix; the correlation matrix"\

          'ESTIMATES', "I: variate; estimates of parameters in

                         model"\

          'Y', "I: variate; response variate"\

          'RESIDUALS', "I: variate; residuals"\

          'FITTEDVALUES',"I: variate; fitted values"\

          'TIME', "I: variate; times of repeated measures"\

          'MARKER', "I: factor; identifier of subject or outcome"\

          'DISTRIBUTION',"I: text; identifier of distribution"\

          'SCALEFACTOR', "I: text; scalefactor option in use"\

          'SFVALUE'; "I: scalar; value of scalefactor if FIXED"\

          SET=10(yes);DECLARED=10(yes); \

          TYPE='symmetric',5('variate'),'factor',2('text'),'scalar'; \

          PRESENT=9(yes),no


GETATTRIBUTE [ATTRIBUTE=NVALUES] ESTIMATES; SAVE=!P(ncol)

 & [ATTRIBUTE=NROWS] CORRELATIONS; SAVE=!P(ntime)


DIAGONALMATRIX [ROWS=ntime;MODIFY=yes] done,wkdm; \

               VALUES=!(#ntime(1)),*


CALC const = 'ESTIMATE' .IN. CONSTANT

 & sandw = 'NO' .IN. SANDWICH


IF sandw

"

  SCALEFACTOR is as in GEE i.e. FIXED means fixed to SFVALUE

  CONSTANT means the scalefactor is estimated but constant

  across time, and VARYTIME means the scalefactor is estimated

  and varies across time.


  The variate TIME in this PROCEDURE represents the 1...ntime

  distinct times, it is not a FACTOR of length nobs as in GEE.

  It is the levels of the parameter TIME of GEE.

"

  IF DISTRIBUTION.EQS.'NORMAL'

    IF SCALEFACTOR.NES.'VARYTIME'

      IF SCALEFACTOR.EQS.'FIXED'

        CALC wkdm = SFVALUE

      ELSE

        CALC wkdm = TRACE(CORRELATIONS)/ntime

      ENDIF

    ELSE

      CALC wkdm = CORRELATIONS

    ENDIF

    CALC CORRELATIONS = 0 + wkdm

  ELSE

    CALC CORRELATIONS = done

  ENDIF

ENDIF

ENDPROCEDURE


If LINK, DISTRIBUTION or CRTYPE are unset, but no user routines are given for _GEEINIT, _GEELINK, _GEEDISTRIBUTION and _GEECORRELATION, then those given here (for logit link, binomial error distribution and independence) will be used.


Action with RESTRICT

Input structures must not be restricted, and any existing restrictions will be cancelled.


References

Kenward, M.G., Lesaffre, E. & Molenberghs, G. (1994). An application of maximum likelihood and generalized estimating equations to the analysis of ordinal data from a longitudinal study with cases missing at random. Biometrics, 50, 945-953.

Kenward, M.G. & Smith, D.M. (1995a). Computing the generalized estimating equations for repeated measurements. Genstat Newsletter, 32, 50-62.

Kenward, M.G. & Smith, D.M. (1995b). Computing the generalized estimating equations for repeated ordinal, categorical measurements. Genstat Newsletter, 32, 63-70.

Liang, K.-Y. & Zeger,S.L. (1986). Longitudinal data analysis using generalized linear models. Biometrika, 73, 13-22.

Liang, K.-Y., Zeger, S.L. & Qaqish, B. (1992). Multivariate regression analyses for categorical data (with discussion). Journal of the Royal Statistical Society, Series B, 54, 3-40.