BOOTSTRAP procedure

Produces bootstrapped estimates, standard errors and distributions (P.W. Lane).


Options

PRINT = string
Controls printed output (estimates, graphs, vcovariance); default esti

DATA = variates, factors or texts
Data vectors from which the statistics are to be calculated; no default

AUXILIARY = pointers
Further sets of data vectors, each set to be resampled independently

ANCILLARY = any type
Other relevant information needed to calculate the statistics

NTIMES = scalar
Number of times to resample; default 100

SEED = scalar
Seed for random number generator; default continue from previous generation or use system clock

GRAPHICS = string
Type of graphics (lineprinter, highresolution); default high

PROBABILITY = scalar
Probability level for confidence interval; default 0.95

METHOD = string
What type of bootstrapping to use (random, balance, permute); default rand

BLOCKSTRUCTURE = formula
Block structure to use for random permutations

CIMETHOD = string
What type of confidence intervals to provide (bca, percentile); default perc

VCOVARIANCE = symmetric matrix
Saves the variance-covariance matrix of the statistics


Parameters

LABEL = texts
Texts, each containing a single line, to label the statistics; default 'Statistic'

ESTIMATE = scalars
Saves the bootstrap mean for each statistic

SE = scalars
Saves the bootstrap standard error for each statistic

LOWER = scalars
Saves the bootstrap lower confidence limit for each statistic

UPPER = scalars
Saves the bootstrap upper confidence limit for each statistic

STATISTIC = variates
Saves the series of bootstrap estimates of each statistic

WINDOW = scalars
Graphical window to use for displaying bootstrap distribution for each statistic; default 4

SCREEN = strings
Whether to clear graphical frame or draw on top (clear, keep); default clea


Description

The bootstrap is a method of providing distributional information, such as standard errors, about statistical estimates - without making precise distributional assumptions about the data. It can also provide estimates with reduced bias. This is achieved by "resampling" from the data; that is, generating new data sets by sampling with replacement from the data set being investigated. A good introduction to the bootstrap is given by Efron & Tibshirani (1986); a fuller treatment can be found in Efron & Tibshirani (1993).

   The BOOTSTRAP procedure can be used for any statistic or set of statistics that can be calculated by GenStat from one or more data matrices. You need to provide a procedure called RESAMPLE which calculates the statistics from the data, as explained in the Method section. There are also several examples of RESAMPLE in the standard examples, which can be extracted by the commands:

LIBEXAMPLE 'BOOTSTRAP'; EXAMPLE=Ex

PRINT Ex; JUSTIF=left

The options and parameters of RESAMPLE must not be changed. The body of the procedure should store the required statistics in scalars called STATISTIC[1...s] using variates, factors and texts called DATA[1...d], where each of s and d can be any positive integer. The EXIT parameter of RESAMPLE should be set to indicate when any of the calculations fail, as can sometimes happen if degenerate data-sets are generated (see Example 3).

   The data for BOOTSTRAP are provided as a list of vectors (variates, factors or texts) using the DATA option. From this, the procedure will generate new data by resampling from the set of units: all the vectors must have the same length, and each new sample uses the same set of units for all vectors. The procedure RESAMPLE is then called to calculate the statistics.

   Extra information required in procedure RESAMPLE to calculate the statistics, which is not to be resampled along with the data matrix, can be passed as a list of data structures using the ANCILLARY option of BOOTSTRAP (see Examples 2 and 3).

   The procedure can also deal with statistics calculated from several independent data matrices. For example, the difference in means between two independent samples must be dealt with by resampling independently from each sample, which may have different numbers of observations. In this case, one data matrix is specified as a list of vectors using the DATA option as usual, and the second data matrix is specified as a pointer using the AUXILIARY option. This option may be set to any number of pointers, each storing a list of vectors; resampling is done independently for each set of vectors (see Example 4).

   The option NTIMES specifies how many times the resampling is carried out. The default value is 100, which has been found by many users of the bootstrap to be sufficient for producing standard errors and bias-reduced estimates. However, the number should be increased to get reliable distributional information: 1000 or more may be needed for reliable 95% confidence limits.

   Printed output is controlled by the PRINT option, with settings estimates for the estimates and their standard errors and confidence limits, and vcovariance for the variance-covariance matrix. The graphs setting draws a histogram of the bootstrap distributions. The default setting is just estimates.

   A label should be provided for each statistic, using the LABEL parameter; by default, bootstrapping will be done for a single statistic which will be labelled simply as Statistic. The estimates and their standard errors can be saved by the ESTIMATE and SE parameters. Also, a variance-covariance matrix of the estimates can be saved using the VCOVARIANCE option. The number of labels, s say, must match the number of statistics, called STATISTIC[1...s], calculated in your version of the RESAMPLE procedure.

   The parameters LOWER and UPPER allow confidence limits for each statistic to be saved, with the probability level specified in the PROBABILITY option (default 0.95 i.e. 95% confidence intervals). By default the intervals are constructed as percentiles of the empirical distribution of the bootstrap estimates. However, provided there are no auxiliary data vectors, you can request bias-corrected and accelerated limits instead by setting option CIMETHOD=bca (see Efron & Tibshirani, 1993, Section 14.3). The full sets of bootstrap estimates can be saved by setting the STATISTICS parameter; each variate will contain n values, where n is the setting of the NTIMES option.

   Three methods of bootstrapping are provided. By default, resampling is completely pseudo-random, using GenStat's random-number generator. The generator can be initialized by setting option SEED, thereby producing reproducible results; otherwise, the initialization uses the system clock. A second alternative is balanced bootstrapping, requested by setting METHOD=balance. In this case, the resampling is constrained to ensure that each unit of the data matrix occurs the same number of times in the complete set of generated samples (see Examples 3 and 4). The third method, specified by METHOD=permute, is simply to permute the units of the data matrix. Note that this method gives no variation in results if the statistics are independent of the order of the data, like the sample mean. However, this method provides permutation tests, a type of randomization test that can be applied to grouped data (see Example 4). When METHOD=permute, you can set the BLOCKSTRUCTURE option to a model formula to define how the randomization is to be done (see the RANDOMIZE directive for details).

   If the graphics setting of the PRINT option is used, the procedure will display the distribution of each set of bootstrap estimates as a histogram. By default, this will be a high-resolution plot on the current device, but the GRAPHICS option can be set to line to produce a line-printer histogram. In a high-resolution plot, the histogram is enhanced with a smoothed line, giving a clearer indication of the distribution of the statistic. By default, the display for the statistics will appear in graphical window 4, one at a time (this window is set by default to fill the whole graphical frame). But the WINDOW and SCREEN parameters can be set to arrange for concurrent displays of the statistics in differently sized windows.

 

Options: PRINT, DATA, AUXILIARY, ANCILLARY, NTIMES, SEED, GRAPHICS, PROBABILITY, METHOD, BLOCKSTRUCTURE, CIMETHOD, VCOVARIANCE.

Parameters: LABEL, ESTIMATE, SE, LOWER, UPPER, STATISTIC, WINDOW, SCREEN.


Method

Samples are generated by scaling uniform random numbers produced by the URAND function. For the balanced bootstrap, a list of repeated unit numbers is sorted into random order and used one block at a time. For the permutation test, the RANDOMIZE directive is used to re-order the data at random.

   BOOTSTRAP needs a subsidiary procedure RESAMPLE to calculate the statistics of interest. RESAMPLE has an option, DATA, which is used to supply the data vectors (variates, factors or texts) from which the statistics are to be calculated. Other relevant information can be supplied through the AUXILIARY and ANCILLARY options, which correspond to the AUXILIARY and ANCILLARY options of BOOTSTRAP itself. There are two parameters: STATISTIC supplies a list of scalars to store the estimates of each statistic, and EXIT a list of scalars which should be set to zero or one according to whether or not each statistic could be estimated successfully with the supplied data vectors. If the value of EXIT is not calculated in RESAMPLE, the BOOTSTRAP procedure assumes that the calculations succeeded.

   This example shows a version of RESAMPLE which calculates the correlation between two variates.

PROCEDURE [PARAMETER=pointer] 'RESAMPLE'

OPTION 'DATA', " (I: variates, factors or texts) data

                         vectors from which to calculate

                         the statistics; no default"\

          'AUXILIARY', " (I: pointers) auxiliary sets of data

                         vectors, each of which is to be

                         resampled independently"\

          'ANCILLARY'; " (I: any type of structure) other

                         relevant information needed to

                         calculate the statistics "\

          MODE=p; TYPE=!t(variate,factor,text),'pointer',*;\

          SET=yes,no,no; LIST=yes; DECLARED=yes; PRESENT=yes

PARAMETER 'STATISTIC', " (O: scalars) to save the calculated

                         statistics "\

          'EXIT'; " (O: scalars) to save an exit code

                         to indicate failure (EXIT[i]=1) or

                         success (EXIT[i]=0) when calculating

                         each STATISTIC[i]"\

          MODE=p; TYPE='scalar'; SET=yes

  CALC STATISTIC[1] = CORRELATION(DATA[1]; DATA[2])

  & EXIT[1] = STATISTIC[1]==C('missing')

ENDPROCEDURE

VARIATE [VALUES=576,635,558,578,666,580,555,661,651,605, \

  653,575,545,572,594] Y

& [VALUES=3.39,3.30,2.81,3.03,3.44,3.07,3.00,3.43,3.36,3.13,\

  3.12,2.74,2.76,2.88,2.96] Z

BOOTSTRAP [DATA=Y,Z; SEED=77320] 'Correlation'

   The RESAMPLE procedure is called within a loop, and the statistics that are returned are loaded into variates. If any statistics fail to be calculated, as recorded by the EXIT parameter of RESAMPLE, they are stored as missing values. BOOTSTRAP will then base its estimation on the successful generations, but reports how many failures occurred.

   The bootstrap estimates are formed as simple means of the stored variates, and the s.e.s are square roots of the sample variance. The TABULATE directive is used to estimate quantiles from the stored variates, to define confidence limits. The variance-covariance matrix is formed from the statistics using the FSSPM directive.

   The graphical representation uses DHISTOGRAM or HISTOGRAM on the stored variates. The smoothed curves are calculated from the transformed percentages from the histogram: LOGIT(CUM(%)). A smoothing spline is fitted on this scale, by the FIT directive with the SSPLINE function, using 4 d.f. The resulting fitted values are then backtransformed and drawn on the plot with the monotonic setting of the PEN directive.


Action with RESTRICT

If any of the data vectors is restricted, BOOTSTRAP will use only the units that are not restricted for any of the vectors. The data vectors that are passed to the RESAMPLE procedure are all restricted to this identified set of units, but otherwise match the original data vectors. Each set of vectors supplied in pointers in the AUXILIARY option are treated separately in this way.


References

Efron, B. & Tibshirani, R. (1986). Bootstrap methods for standard errors, confidence intervals, and other measures of statistical accuracy. Statistical Science, 1, 54-77.

Efron, B. & Tibshirani, R.J. (1993). An Introduction to the Bootstrap. Chapman & Hall, London.