The VSNi Team

3 months agoLinear mixed effects models provide a powerful tool for modelling temporally correlated data. The goal of correlation modelling is to describe how the dependence between measurements changes as the separation in time between them increases. For example, if we measure a patient’s blood pressure each month, we expect measurements on consecutive months to be more alike than measurements several months apart.

Four commonly used correlation models for repeated measures data are:

**Simple**correlation structure, also known as**uniform**correlation. This model assumes the correlation () between measurements is constant regardless of how far apart in time they are.

**Autoregressive**model of order 1. This model allows the correlations between measurements to decrease as the length of time between them increases. This is a more realistic correlation structure for most repeated measures data sets than assuming constant correlation between all pairs of measurements. However, the autoregressive model should only be used when the repeated measures data is at equally spaced time intervals.

**Power**model of order 1. This is an alternative to the autoregressive model of order 1 which accommodates unequally spaced repeated measures. As with the autoregressive model, this allows the correlations between measurements to decrease as the length of time between them increases.

Note: is the Euclidean distance between the and measurements (i.e., time points).

**General**correlation structure, also known as**unstructured**correlation. This is the most flexible correlation structure. It allows a separate correlation for every pair of measurements.

Note: is the Euclidean distance between the and measurements (i.e., time points).

In addition, another important correlation structure is the **identity** structure. This assumes that the observations are uncorrelated, or independent of one another.

Let’s look at an example.

Our example is from an orthodontic growth rate study of children (Potthoff and Roy, 1964). In this study, researchers at the University of North Carolina Dental School tracked the orthodontic growth of 27 children by measuring the distance between the pituitary and the pterygomaxillary fissure every 2 years from the ages of 8 to 14 years. The graph below plots the orthodontic growth profiles of the individual children, showing the distance data in terms of sex and age.

The data set contains 2 variates:

`distance`

, the response variable`age`

, age of the child in years

and 3 factors :

`Subject`

, the individual children from whom repeated measures have been taken`Sex`

, sex of the child`agef`

, also age of the child in years but stored as a factor

Our goal is to model the `distance`

data allowing male and female children to have different growth patterns as they age. However, as this is *repeated measures* data, we must take into account the correlated nature of the distance measurements taken on the same child (i.e., `Subject`

) as they age. For this study, distance measurements taken on the **same** child at consecutive ages should be * correlated* but measurements on

This correlation is imposed by specifying the appropriate correlation structures on the residual.

The orthodontic growth data set contains a total of 108 observations arising from 27 children (factor `Subject`

) each measured 4 times, at ages 8, 10, 12 and 14 (factor `agef`

). Thus, for this data set the residual corresponds to the combination of the factors `Subject`

and `agef`

.

Our residual model has an I ⊗ C covariance structure, where the identity matrix I corresponds to the independent children, and the covariance matrix C corresponds to the correlated measurements over age within a child. |

Since the distance measurements taken on different children are assumed to be *independent*, the correlation structure that should be associated with the `Subject`

term in the residual is the **identity** correlation structure.

However, as the distance measurements taken on the same child over consecutive ages should be *correlated*, we need to select a different correlation structure for the `agef`

term in the residual.

Recall that:

- the simple correlation structure assumes that the correlation between distance measurements taken on the same child is constant regardless of how far apart in age the measurements were taken
- the autoregressive correlation structure of order 1 and the power model of order 1 allow the correlation between distance measurements taken on the same child to decrease as the age difference between the measurements increase.
- the general correlation structure allows a separate correlation for every pair of ages

So, for example, if we were to impose a general correlation structure on the `agef`

term in the residual we would be allowing the correlation to be different between every pair ages, i.e.,

In summary, if we model the distance data using a repeated measures model with an identity correlation structure on the `Subject`

term and a general correlation structure on the `agef`

term in the residual, we are saying that the residuals between different children (`Subject`

) are independent, but the residuals originating from measurements taken on the same child but at different levels of `agef`

are correlated according to the general correlation structure.

Related Reads

Dr. John Rogers

10 months agoEarlier this year I had an enquiry from Carey Langley of VSNi as to why I had not renewed my Genstat licence. The truth was simple – I have decided to fully retire after 50 years as an agricultural entomologist / applied biologist / consultant. This prompted some reflections about the evolution of bioscience data analysis that I have experienced over that half century, a period during which most of my focus was the interaction between insects and their plant hosts; both how insect feeding impacts on plant growth and crop yield, and how plants impact on the development of the insects that feed on them and on their natural enemies.

My journey into bioscience data analysis started with undergraduate courses in biometry – yes, it was an agriculture faculty, so it was biometry not statistics. We started doing statistical analyses using full keyboard Monroe calculators (for those of you who don’t know what I am talking about, you can find them here). It was a simpler time and as undergraduates we thought it was hugely funny to divide 1 by 0 until the blue smoke came out…

After leaving university in the early 1970s, I started working for the Agriculture Department of an Australian state government, at a small country research station. Statistical analysis was rudimentary to say the least. If you were motivated, there was always the option of running analyses yourself by hand, given the appearance of the first scientific calculators in the early 1970s. If you wanted a formal statistical analysis of your data, you would mail off a paper copy of the raw data to Biometry Branch… and wait. Some months later, you would get back your ANOVA, regression, or whatever the biometrician thought appropriate to do, on paper with some indication of what treatments were different from what other treatments. Dose-mortality data was dealt with by manually plotting data onto probit paper.

In-house ANOVA programs running on central mainframes were a step forward some years later as it at least enabled us to run our own analyses, as long as you wanted to do an ANOVA…. However, it also required a 2 hours’ drive to the nearest card reader, with the actual computer a further 1000 kilometres away.… The first desktop computer I used for statistical analysis was in the early 1980s and was a CP/M machine with two 8-inch floppy discs with, I think, 256k of memory, and booting it required turning a key and pressing the blue button - yes, really! And about the same time, the local agricultural economist drove us crazy extolling the virtues of a program called Lotus 1-2-3!

Having been brought up on a solid diet of the classic texts such as Steele and Torrie, Cochran and Cox and Sokal and Rohlf, the primary frustration during this period was not having ready access to the statistical analyses you knew were appropriate for your data. Typical modes of operating for agricultural scientists in that era were randomised blocks of various degrees of complexity, thus the emphasis on ANOVA in the software that was available in-house. Those of us who also had less-structured ecological data were less well catered for.

My first access to a comprehensive statistics package was during the early to mid-1980s at one of the American Land Grant universities. It was a revelation to be able to run virtually whatever statistical test deemed necessary. Access to non-linear regression was a definite plus, given the non-linear nature of many biological responses. As well, being able to run a series of models to test specific hypotheses opened up new options for more elegant and insightful analyses. Looking back from 2021, such things look very trivial, but compared to where we came from in the 1970s, they were significant steps forward.

My first exposure to Genstat, VSNi’s stalwart statistical software package, was Genstat for Windows, Third Edition (1997). Simple things like the availability of residual plots made a difference for us entomologists, given that much of our data had non-normal errors; it took the guesswork out of whether and what transformations to use. The availability of regressions with grouped data also opened some previously closed doors.

After a deviation away from hands-on research, I came back to biological-data analysis in the mid-2000s and found myself working with repeated-measures and survival / mortality data, so ventured into repeated-measures restricted maximum likelihood analyses and generalised linear mixed models for the first time (with assistance from a couple of Roger Payne’s training courses in Hobart and Queenstown). Looking back, it is interesting how quickly I became blasé about such computationally intensive analyses that would run in seconds on my laptop or desktop, forgetting that I was doing ANOVAs by hand 40 years earlier when John Nelder was developing generalised linear models. How the world has changed!

Of importance to my Genstat experience was the level of support that was available to me as a Genstat licensee. Over the last 15 years or so, as I attempted some of these more complex analyses, my aspirations were somewhat ahead of my abilities, and it was always reassuring to know that Genstat Support was only ever an email away. A couple of examples will flesh this out.

Back in 2008, I was working on the relationship between insect-pest density and crop yield using R2LINES, but had extra linear X’s related to plant vigour in addition to the measure of pest infestation. A support-enquiry email produced an overnight response from Roger Payne that basically said, “Try this”. While I slept, Roger had written an extension to R2LINES to incorporate extra linear X’s. This was later incorporated into the regular releases of Genstat. This work led to the clearer specification of the pest densities that warranted chemical control in soybeans and dry beans (https://doi.org/10.1016/j.cropro.2009.08.016 and https://doi.org/10.1016/j.cropro.2009.08.015).

More recently, I was attempting to disentangle the effects on caterpillar mortality of the two Cry insecticidal proteins in transgenic cotton and, while I got close, I would not have got the analysis to run properly without Roger’s support. The data was scant in the bottom half of the overall dose-response curves for both Cry proteins, but it was possible to fit asymptotic exponentials that modelled the upper half of each curve. The final double-exponential response surface I fitted with Roger’s assistance showed clearly that the dose-mortality response was stronger for one of the Cry proteins than the other, and that there was no synergistic action between the two proteins (https://doi.org/10.1016/j.cropro.2015.10.013)

One thing that I especially appreciate about having access to a comprehensive statistics package such as Genstat is having the capacity to tease apart biological data to get at the underlying relationships. About 10 years ago, I was asked to look at some data on the impact of cold stress on the expression of the Cry2Ab insecticidal protein in transgenic cotton. The data set was seemingly simple - two years of pot-trial data where groups of pots were either left out overnight or protected from low overnight temperatures by being moved into a glasshouse, plus temperature data and Cry2Ab protein levels. A REML analysis, and some correlations and regressions enabled me to show that cold overnight temperatures did reduce Cry2Ab protein levels, that the effects occurred for up to 6 days after the cold period and that the threshold for these effects was approximately 14 Cº (https://doi.org/10.1603/EC09369). What I took from this piece of work is how powerful a comprehensive statistics package can be in teasing apart important biological insights from what was seemingly very simple data. Note that I did not use any statistics that were cutting edge, just a combination of REML, correlation and regression analyses, but used these techniques to guide the dissection of the relationships in the data to end up with an elegant and insightful outcome.

Looking back over 50 years of work, one thing stands out for me: the huge advances that have occurred in the statistical analysis of biological data has allowed much more insightful statistical analyses that has, in turn, allowed biological scientists to more elegantly pull apart the interactions between insects and their plant hosts.

For me, Genstat has played a pivotal role in that process. I shall miss it.

**Dr John Rogers**

Research Connections and Consulting

St Lucia, Queensland 4067, Australia

Phone/Fax: +61 (0)7 3720 9065

Mobile: 0409 200 701

Email: john.rogers@rcac.net.au

Alternate email: D.John.Rogers@gmail.com

The VSNi Team

8 months agoOutliers are sample observations that are either much larger or much smaller than the other observations in a dataset. Outliers can skew your dataset, so how should you deal with them?

Imagine Jane, the general manager of a chain of computer stores, has asked a statistician, Vanessa, to assist her with the analysis of data on the daily sales at the stores she manages. Vanessa takes a look at the data, and produces a boxplot for each of the stores as shown below.

Vanessa pointed out to Jane the presence of outliers in the data from Store 2 on days 10 and 22. Vanessa recommended that Jane checks the accuracy of the data. *Are the outliers due to recording or measurement error?* If the outliers can’t be attributed to errors in the data, Jane should investigate what might have caused the increased sales on these two particular days. Always investigate outliers - this will help you better understand the data, how it was generated and how to analyse it.

Vanessa explained to Jane that we should never drop a data value just because it is an outlier. The nature of the outlier should be investigated before deciding what to do.

Whenever there are outliers in the data, we should look for possible causes of error in the data. If you find an error but cannot recover the correct data value, then you should replace the incorrect data value with a missing value.

However, outliers can also be real observations, and sometimes these are the most interesting ones! If your outlier can’t be attributed to an error, you shouldn’t remove it from the dataset. Removing data values unnecessarily, just because they are outliers, introduces bias and may lead you to draw the wrong conclusions from your study.

- Transform the data: if the dataset is not normally distributed, we can try transforming the data to normalize it. For example, if the data set has some high-value outliers (i.e. is right skewed), the log transformation will “pull” the high values in. This often works well for count data.
- Try a different model/analysis: different analyses may make different distributional assumptions, and you should pick one that is appropriate for your data. For example, count data are generally assumed to follow a Poisson distribution. Alternatively, the outliers may be able to be modelled using an appropriate explanatory variable. For example, computer sales may increase as we approach the start of a new school year.

In our example, Vanessa suggested that since the mean for Store 2 is highly influenced by the outliers, the median, another measure of central tendency, seems more appropriate for summarizing the daily sales at each store. Using the statistical software Genstat, Vanessa can easily calculate both the mean and median number of sales per store for Jane.

Vanessa also analyses the data assuming the daily sales have Poisson distributions, by fitting a log-linear model.

Notice that Vanessa has included “Day” as a blocking factor in the model to allow for variability due to temporal effects.

From this analysis, Vanessa and Jane conclude that the means (of the Poisson distributions) differ between the stores (p-value < 0.001). Store 3, on average, has the most computer sales per day, whereas Stores 1 and 4, on average, have the least.

There are other statistical approaches Vanessa might have used to analyse Jane’s sales data, including a one-way ANOVA blocked by Day on the log-transformed sales data and Friedman’s non-parametric ANOVA. Both approaches are available in Genstat’s comprehensive menu system.

There are many ways to deal with outliers, but no single method will work in every situation. As we have learnt, we can remove an observation if we have evidence it is an error. But, if that is not the case, we can always use alternative summary statistics, or even different statistical approaches, that accommodate them.

Kanchana Punyawaew and Dr. Vanessa Cave

a year agoThe term "**repeated measures**" refers to experimental designs or observational studies in which each experimental unit (or subject) is measured repeatedly over time or space. "**Longitudinal data**" is a special case of repeated measures in which variables are measured over time (often for a comparatively long period of time) and duration itself is typically a variable of interest.

In terms of data analysis, it doesn’t really matter what type of data you have, as you can analyze both using mixed models. Remember, the key feature of both types of data is that the response variable is measured more than once on each experimental unit, and these repeated measurements are likely to be correlated.

To illustrate the use of mixed model approaches for analyzing repeated measures, we’ll examine a data set from Landau and Everitt’s 2004 book, “*A Handbook of Statistical Analyses using SPSS”. Here, a double-blind, placebo-controlled clinical trial was conducted to determine whether an estrogen treatment reduces post-natal depression. Sixty three subjects were randomly assigned to one of two treatment groups: placebo (27 subjects) and estrogen treatment (36 subjects). Depression scores were measured on each subject at baseline, i.e. before randomization (predep*) and at six two-monthly visits after randomization (*postdep* at visits 1-6). However, not all the women in the trial had their depression score recorded on all scheduled visits.

In this example, the data were measured at fixed, equally spaced, time points. (*Visit* is time as a factor and *nVisit* is time as a continuous variable.) There is one between-subject factor (*Group*, i.e. the treatment group, either placebo or estrogen treatment), one within-subject factor (*Visit* or *nVisit*) and a covariate (*predep*).

Using the following plots, we can explore the data. In the first plot below, the depression scores for each subject are plotted against time, including the baseline, separately for each treatment group.

In the second plot, the mean depression score for each treatment group is plotted over time. From these plots, we can see variation among subjects within each treatment group that depression scores for subjects generally decrease with time, and on average the depression score at each visit is lower with the estrogen treatment than the placebo.

The simplest approach for analyzing repeated measures data is to use a random effects model with * subject* fitted as random. It assumes a constant correlation between all observations on the same subject. The analysis objectives can either be to measure the average treatment effect over time or to assess treatment effects at each time point and to test whether treatment interacts with time.

In this example, the treatment (*Group*), time (*Visit*), treatment by time interaction (*Group:Visit*) and baseline (*predep*) effects can all be fitted as fixed. The subject effects are fitted as random, allowing for constant correlation between depression scores taken on the same subject over time.

The code and output from fitting this model in ASReml-R 4 follows;

The output from summary() shows that the estimate of subject and residual variance from the model are 15.10 and 11.53, respectively, giving a total variance of 15.10 + 11.53 = 26.63. The Wald test (from the wald.asreml() table) for *predep*, *Group* and *Visit* are significant (probability level (Pr) ≤ 0.01). There appears to be no relationship between treatment group and time (*Group:Visit*) i.e. the probability level is greater than 0.05 (Pr = 0.8636).

In practice, often the correlation between observations on the same subject is not constant. It is common to expect that the covariances of measurements made closer together in time are more similar than those at more distant times. Mixed models can accommodate many different covariance patterns. The ideal usage is to select the pattern that best reflects the true covariance structure of the data. A typical strategy is to start with a simple pattern, such as compound symmetry or first-order autoregressive, and test if a more complex pattern leads to a significant improvement in the likelihood.

Note: using a covariance model with a simple correlation structure (i.e. uniform) will provide the same results as fitting a random effects model with random subject.

In ASReml-R 4 we use the corv() function on time (i.e. *Visit*) to specify uniform correlation between depression scores taken on the same subject over time.

Here, the estimate of the correlation among times (*Visit*) is 0.57 and the estimate of the residual variance is 26.63 (identical to the total variance of the random effects model, asr1).

Specifying a heterogeneous first-order autoregressive covariance structure is easily done in ASReml-R 4 by changing the variance-covariance function in the residual term from corv() to ar1h().

When the relationship of a measurement with time is of interest, a random coefficients model is often appropriate. In a random coefficients model, time is considered a continuous variable, and the subject and subject by time interaction (*Subject:nVisit*) are fitted as random effects. This allows the slopes and intercepts to vary randomly between subjects, resulting in a separate regression line to be fitted for each subject. However, importantly, the slopes and intercepts are correlated.

The str() function of asreml() call is used for fitting a random coefficient model;

The summary table contains the variance parameter for *Subject* (the set of intercepts, 23.24) and *Subject:nVisit* (the set of slopes, 0.89), the estimate of correlation between the slopes and intercepts (-0.57) and the estimate of residual variance (8.38).

Brady T. West, Kathleen B. Welch and Andrzej T. Galecki (2007). *Linear Mixed Models: A Practical Guide Using Statistical Software*. Chapman & Hall/CRC, Taylor & Francis Group, LLC.

Brown, H. and R. Prescott (2015). *Applied Mixed Models in Medicine*. Third Edition. John Wiley & Sons Ltd, England.

Sabine Landau and Brian S. Everitt (2004). *A Handbook of Statistical Analyses using SPSS*. Chapman & Hall/CRC Press LLC.