R0INFLATED procedure

Fits zero-inflated regression models to count data with excess zeros (D.A. Murray).


Options

PRINT = string
Controls printed output (model, summary, estimates, fittedvalues, monitoring); default mode, summ, esti

DISTRIBUTION = string
Distribution of response variable (poisson, negativebinomial); default pois

METHOD = string
Method used for model fitting (em, conditional); default em

CONSTANT = string
How to treat constant for count state (estimate, omit); default esti

ZCONSTANT = string
How to treat constant for zero-inflation state (estimate, omit); default esti

XTERMS = formula
List of explanatory variates and factors, or model formula for count state of model

ZTERMS = formula
List of explanatory variates and factors, or model formula for zero-inflation state of model

WEIGHTS = variate
Variate of weights for weighted zero-inflated regression (Lambert model only)

OFFSET = variate
Offset variate to be used in the model (Lambert model only)

MAXCYCLE = scalar
Maximum number of iterations for EM algorithm; default 100

TOLERANCE = scalar or variate
Convergence criteria for EM algorithm, k and in the generalized linear models; default !(1.E-4, 1.E-4, 1.E-4)


Parameters

Y = variates
Response variate

RESIDUALS = variates
Saves the standardized residuals

FITTEDVALUES = variates
Saves the fitted values

ESTIMATES = variates
Saves the estimates of the parameters

SE = variates
Saves the standard errors of the estimates

RSAVE = identifiers
Saves the regression structure for the final generalized model fitted for the count model

ZSAVE = identifiers
Saves the regression structure for the final binomial regression fitted for the zero-inflation model


Description

R0INFLATED can be used to fit zero-inflated regression models to count data with excess zeros. The procedure allows the data to be modelled using two different approaches. The first possibility is to fit a zero-inflated Poisson regression model (ZIP) or a zero-inflated negative binomial regression model (ZINB) using an EM algorithm (Lambert 1992). In this analysis, the response variable of counts is assumed to be distributed as a mixture of a distribution (such as Poisson) and a degenerate distribution at zero. In these models, a generalized linear model (Poisson or negative binomial) with a log link is used for the count model, and a binomial model with logit link for the zero-inflation model. The alternative is to fit the conditional model of Welsh et al. (1996), which assumes that the data are in one of two states: a state where zeros are observed, or a state where counts are recorded. A binomial model with a logit link is used for the zero state, and truncated Poisson or truncated negative binomial model is used for the count state.

   The response variable is supplied, in a variate, using the Y parameter. The XTERMS and ZTERMS options each specifies a formula, to describe the count model and the zero-inflation model respectively. The CONSTANT and ZCONSTANT options control whether a constant parameter is included in the count and zero-inflation models.

   The METHOD option specifies the type of model to fit: the em setting fits the ZIP and ZINB mixture models, and the conditional setting fits the conditional model. The DISTRIBUTION option specifies the distribution for the count model. Note that a log link is always used for the count model.

   The ESTIMATES and SE parameters save the parameter estimates and their standard errors. R0INFLATED puts them into variates, using the same order as in the display produced by the PRINT option. The standardized residuals and fitted values can be saved using the RESIDUALS and FITTEDVALUES parameters.

   The RSAVE and ZSAVE parameters allow you to specify identifiers for the regression save structures for the count and zero-inflation states of the model. These structures store the final state of the regression models fitted. Note that the standard errors for the parameter estimates in the regression save structures will not be correct and should instead be obtained using the SE parameter or by the R0KEEP procedure.

   For the Lambert models, the WEIGHTS option can specify a variate holding weights for each unit, and the OFFSET option allows you to include an offset (i.e. a variable in the regression model with a regression coeefficient fixed at one).

   The PRINT option controls printed output, with settings:

    model
gives a description of the model, including response and explanatory variates for count and zero-inflation models;

    summary
displays minus twice log-likelihood, the Akaike information coefficient (AIC) and the Schwarz (Bayesian) information coefficient (BIC or SIC);

    estimates
gives the estimates of the parameters in the model with standard errors based on the asymptotic variance-covariance matrix derived from the inverse of the observed Fisher information matrix;

    fittedvalues
displays a table of unit labels, values of response variate, fitted values and standardized residuals;

    monitoring
displays monitoring information of the iterative algorithm.

   The iterative process for the EM algorithm is controlled by the MAXCYCLE option which defines the maximum number of cycles, and the TOLERANCE option which sets convergence criteria. The EM algorithm cycle stops when successive values of the log-likelihood are within a tolerance set by the first element of the TOLERANCE option. The second and third elements of TOLERANCE control the convergence criterion for the aggregation parameter (k) for the negative binomial model and for the generalized linear model, respectively.

 

Options: PRINT, DISTRIBUTION, METHOD, CONSTANT, ZCONSTANT, XTERMS, ZTERMS, WEIGHTS, OFFSET, MAXCYCLE, TOLERANCE.

Parameters: Y, RESDIUALS, FITTEDVALUES, ESTIMATES, SE, RSAVE, ZSAVE.


Method

The zero-inflated Poisson regression model has the distribution

          Pr(Y=y)                                    = { w + (1 - w) exp(-lam) for y=0

= { (1 - w) exp(-lam) lamy / y! for y>0

where lam and w depend on covariates.

   Similarly, the zero-inflated negative binomial regression model has the distribution

          Pr(Y=y)                                    = { w + (1 - w) × (1 + lam / k)-k for y=0

= { (1 - w) × Gamma(y + a) / (y! × Gamma(k))

× (1 + lam/ k)-k × (1 + k / lam)-y for y>0

where lam and w depend on covariates, and k≥0 is a scalar.

   For both the Poisson and negative binomial the following models are assumed:

log(lam) = X b

and    log(w/(1-w)) = G z

where X and G are covariate matrices and b and z are vectors of unknown parameters. The maximum likelihood estimates for b, z and k are then obtained using an EM algorithm (Lambert 1992).

   The standard errors for the parameter estimates are derived using the incomplete data observed information matrix as proposed by Lambert (1992).

   In the Poisson case of the conditional model, yi has a truncated Poisson distribution (lam(z)) with probability p(0)=0. So the probability model is

          Pr(Y=0|x)                                 = 1 - p(x)

          Pr(Y=r|x,z)                               = p(x) × exp(-lam(z)) × lam(z) × r / r!

× (1 - exp(-lam(z))), for r=1, 2, ...

   For the negative binomial case, yi has a truncated negative binomial (lam(z),k) with probability p(0)=0. So the probability model is

          Pr(Y=0|x)                                 = 1 - p(x)

          Pr(Y=r|x,z)                               = p(x) × Gamma(r + 1/k)) / r! × Gamma(1/k)

× (k × lam(z))r × (1 + k × lam(z))-(r+1/k)

× (1 - (1 + k × lam(z))-1/k)-1, for r=1, 2, ...

where k is the extra-variation parameter in the untruncated negative binomial distribution.

   For both conditional models the zero component is fitted using a logistic generalized linear model. The truncated Poisson model is fitted using an iteratively re-weighted least squares algorithm (see Welsh et al. 1996). The truncated negative binomial model is fitted using FITNONLINEAR.


Action with RESTRICT

If a parameter is restricted the statistics will be calculated using only those units included in the restriction.


References

Lambert, D. (1992). Zero-inflated Poisson regression, with an application to defects in manufacturing. Technometrics, 34, 1-14.

Ridout, M., Demetrio, C.G.B. & Hinde, J. (1998). Models for count data with many zeros. International Biometrics Conference, Cape Town.

Welsh, A.H., Cunningham, R.B., Donnelly, C.F. & Lindenmayer, D.B. (1996). Modelling the abundance of rare species: statistical models for counts with extra zeros. Ecological Modelling, 88, 297-308.