PCP directive

Performs principal components analysis.


Options

PRINT = strings
Printed output required (loadings, roots, residuals, scores, tests); default * i.e. no printing

NROOTS = scalar
Number of latent roots for printed output; default * requests them all to be printed

SMALLEST = string
Whether to print the smallest roots instead of the largest (yes, no); default no

METHOD = string
Whether to use sums of squares, correlations or variances and covariances (ssp, correlation, variancecovariance); default ssp


Parameters

DATA = pointers or matrices or SSPMs
Pointer of variates forming the data matrix, or matrix storing the variate values by columns, or SSPM giving their sums of squares and products (or correlations) etc

LRV = LRVs
To store the principal component loadings, roots and trace from each analysis

SSPM = SSPMs
To store the computed sum-of-squares-and-products or correlation matrix

SCORES = matrices
To store the principal component scores

RESIDUALS = matrices or variates
To store residuals from the dimensions fitted in the analysis (i.e. number of columns of the SCORES matrix, or as defined by the NROOTS option)


Description

Principal components analysis finds linear combinations of a set of variates that maximize the variation contained within them, thereby displaying most of the original variability in a smaller number of dimensions. Principal components analysis operates on sums of squares and products, or a correlation matrix, or a matrix of variances and covariances, formed from the variates.

   You supply the input for PCP using the first parameter; this list may have more than one entry, in which case GenStat repeats the analysis for each of the input structures. Instead of supplying an SSPM, you can supply a pointer containing the set of variates, or a matrix storing the variate values by columns. GenStat will then calculate the sums of squares and products, or correlations, or variances and covariances for the analysis (see option METHOD below).

   For example, these two forms of input are equivalent:

SSPM [TERMS=Height,Length,Width,Weight] S

FSSPM S

PCP [PRINT=roots] S

and

PCP [PRINT=roots] !P(Height,Length,Width,Weight)

But the first form does mean that you have the sums of squares and products available for later use, in the SSPM S. Here the pointer is unnamed but you may wish to use a named pointer. For example:

POINTER [VALUES=Height,Length,Width,Weight] Dmat

PCP [PRINT=roots] Dmat

By default the PCP directive does not print any results: you use the PRINT option to specify what output you require. The printed output is in five sections, each with a corresponding setting, as illustrated in the examples below.

   The columns of the matrices of principal component loadings and scores correspond to the latent roots. Each latent root corresponds to a single dimension, and gives the variability of the scores in that dimension. The loadings give the linear coefficients of the variables that are used to construct the scores in each dimension.

   The significance tests are for equality of the k smallest roots: li (i = 1, 2, ... k). The test statistic is

n - ((2p + 11) / 6) [ log( (1/k) ∑i>k ii ) - (1/k) ∑i>k log( ii )]

where n is the number of units and p is the number of variables. Asymptotically, the statistics have a chi-square distribution with (k+2)(k-1)/2 degrees of freedom. If any latent roots are zero, GenStat excludes them from the calculation of the test statistic; the effective value of p is reduced accordingly.

   If you omit the NROOTS option, GenStat prints by default the results corresponding to all the latent roots. The number of latent roots is the number of variates involved in the input to PCP. The NROOTS option allows you to print only part of the results, corresponding to the first or last r latent roots. You may then want to print the residuals formed from the remaining columns of scores. The residuals are all positive: this is because residuals from multivariate analyses generally occupy several dimensions, so they represent distances in multidimensional space and signs cannot be attached to them.

   To print results corresponding to the r smallest latent roots, you must set option NROOTS to r and option SMALLEST to yes. Now if residuals are printed they will be formed from the scores corresponding to the largest roots. The NROOTS and SMALLEST options apply to the latent roots and vectors, the principal component scores and the residuals. So you cannot print directly, for example, the first two columns of scores and the last three columns of loadings. This is rarely required but, if necessary, it can be done by saving the relevant results and printing them separately.

   By default, the PCP directive operates on the SSPM but you can set the METHOD option to correlations to operate on a derived matrix of correlations, or to variancecovariance to use variances and covariances. Note that when correlations are analysed the significance-test statistics no longer have asymptotic chi-square distributions.

   The LRV parameter allows you to save the principal component loadings, the latent roots, and their sum (the trace) in an LRV structure, while the SCORES parameter saves the principal component scores in a matrix. If you have declared the LRV already, its number of rows must be the same as the number of variates supplied in an input pointer or implied by an input SSPM. The number of rows of the SCORES matrix, if previously declared, must be equal to the number of units.

   The number of columns of the LRV and of the SCORES matrix corresponds to the number of dimensions to be saved from the analysis, and this must be the same for both of them. If the structures have been declared already, GenStat will take the larger of the numbers of columns declared for either, and declare (or redeclare) the other one to match. If neither has been declared and option SMALLEST retains the default setting no, GenStat takes the number of columns from the setting of the NROOTS option. Otherwise, GenStat saves results for the full set of dimensions. The trace saved as the third component of the LRV structure, however, will contain the sums of all the latent roots, whether or not they have all been saved. Procedure LRVSCREE can be used to produce a "scree" diagram which can be helpful in deciding how many dimensions to save.

   The SSPM parameter can save the SSPM structure used for the analysis. A particularly convenient instance is when you have supplied an SSPM structure as input but, for example, have set METHOD=correlation: the SSPM that is saved will then contain correlations instead of sums of squares and products.

   The RESIDUALS parameter allows you to save the principal component residuals, in a matrix with number of rows equal to the number of units and one column. If the latent roots and vectors (loadings) are saved from the analysis, the residuals will correspond to the dimensions not saved; the same applies if you save scores. If neither the LRV nor scores are saved, the saved residuals will correspond to the smallest latent roots not printed.

   If the variables used to form the SSPM structure are restricted, then the analysis will be subject to that restriction. Similarly, if a pointer to a set of variates is used as input to PCP, then any restriction on the variates will be taken into account by the analysis. If you want principal component scores or residuals to be printed or saved from the analysis, the original data must be available. The matrices to save such results must have been declared with as many rows as the variates have values, ignoring the restriction. You can calculate the analysis from one subset of units, but calculate the scores and residuals for all the units, by using as input to PCP an SSPM structure formed using a weight variate with zeros for the excluded sampling units and unity for those to be included. For example, to exclude a known set of outliers from an analysis, but to print scores for them, these statements could be used:

POINTER [NVALUES=5] V

FACTOR [LABELS=!T(No,Yes)] Outlier

READ [CHANNEL=2] Outlier,V[]

CALCULATE Wt = Outlier .IN. 'No'

SSPM [TERMS=V] S

FSSPM [WEIGHT=Wt] S

PCP [PRINT=scores] S

   Principal component regression is provided by procedure RIDGE.

 

Options: PRINT, NROOTS, SMALLEST, METHOD.

Parameters: DATA, LRV, SSPM, SCORES, RESIDUALS.


Action with RESTRICT

If any of the variates in a DATA pointer is restricted, only the defined subset of the units will be used in the analysis.