Overlaying Design Matrices for a Parental Model in ASReml-R

Overlaying Design Matrices for a Parental Model in ASReml-R

Most breeding programs plan several controlled crosses between outstanding parents to detect favorable alleles in their offspring. The progeny are later evaluated in a field experiment, and this information is used to assess the genetic worth of the parents by fitting parental linear mixed models (LMMs) and obtaining best linear unbiased predictions (BLUPs). These BLUPs, which are the general combining ability (GCA), or 1/2 of the breeding value (BV, with BV = 2 \(\times\) GCA) of each parent, are then used to select the best parents for future crosses or operational deployment.

In many plant breeding programs, a parent is considered in several crosses. In most cases it is easy to assign the sex of a given individual. However, several commercial plant species are monoecious, which means that a given genotype will bear both male and female flowers. In contrast, dioecious species have distinctive male and female plants. Some examples of monoecious species are corn, squash, banana, and many conifers, particularly those of the genus Pinus.

In quantitative genetic analyses, monoecious species present a particular challenge, as a given parent can contribute to the estimation of its breeding value (or GCA) as both a male and a female, something that needs to be taken into consideration when a statistical model is fitted.

This is done by overlaying design matrices of the factors associated with male and female parents. We will illustrate this here using an example of loblolly pine (Pinus taeda) in which some individuals, depending on the availability of pollen or flowers, were used in several artificial crosses as both male and female in the breeding program.

The data used here originates from a loblolly pine clonal study published by Resende et al. (2012). Parents were crossed in a circular mating design, constituting several full-sib families. Individuals from these families were vegetatively propagated (cloned) and established in a series of field trials. A subset of the full dataset, corresponding to diameter at breast height (DBH, inches) measured at 6 years since planting at the Nassau (Florida, USA) site, is used here. We will be using the adjusted mean values for this trait. In this dataset we have a total of 71 families that originated from 43 parents; however, 20 of those parents were used as both males and females in different crosses. There were no reciprocal or self-pollinated crosses planned, but these can occur in other crops and they might need to be identified and modelled properly. In this dataset, each family is formed by between 1 and 16 individuals (with an average of 12.2). We also have pedigree information for the 43 parents.

The first few lines of the phenotypic dataset \(\texttt{DAT_NASSAU.TXT}\) are presented below:

tree mother father family DBH
280200 20012 44112 1017 9.410
282200 20022 44090 1020 11.953
284400 44126 142170 1049 10.785
285210 50152 58040 1056 9.950
285600 142024 44112 1006 10.325
286600 44012 17766 1036 11.122

 

And the first and last few lines of parental pedigree file \(\texttt{PEDPAR_NASSAU.TXT}\) are:

genotype mother father
1 14006 0 0
2 14046 0 0
3 14060 0 0
4 14070 0 0
5 14104 0 0
6 14106 0 0
48 142076 14114 14104
49 142170 14106 14046
50 142194 14216 14060
51 202060 20030 20052
52 202096 20034 20052
53 222056 22034 14006
54 222060 22020 22034

 

In the phenotypic data each row contains the id of the individual tree, together with information about their mother and father, followed by the adjusted phenotypic response DBH. In the pedigree file there is information about the mother and father of each of the parents used in the crosses. For example, note that tree \(\texttt{284400}\) has father \(\texttt{142170}\), and this genotype has as parents genotypes \(\texttt{14106}\) and \(\texttt{14046}\).

We are interested in fitting a parental model, where our objective is to estimate GCA values for each of the parents. (Note that this is different than an animal model, where we estimate a BV for each of the individuals (parents or offspring).) Our parental base model is:

\[y_{ijk} = \mu + f_i + m_j + fm_{ij} + \epsilon_{ijk}\]

where, \(y_{ijk}\) is the response for the kth individual originating from the cross between the ith female with the jth male; \(\mu\) is the overall mean; \(f_i\) is the random effect of the ith female, with \(f_i \sim N(0, \sigma^2_f)\); \(m_j\) is the random effect of the jth male, with \(m_j \sim N(0, \sigma^2_m)\); and \(fm_{ij}\) is the interaction or family effect of the cross between the ith female with the jth male, with \(fm_{ij} \sim N(0, \sigma^2_{fm})\); and \(\epsilon_{ijk}\) is the random residual with \(\epsilon_{ijk} \sim N(0, \sigma^2)\).

Note that for the parental effects \(f_i\) and \(m_j\) there is a pedigree available, which is used to obtain the numerator relationship matrix \(\mathbf{A}\), that is incorporated in our analyses. Hence, the vectors of parental effects have the assumptions \(\mathbf{f} \sim N(\mathbf{0},\sigma^2_f \mathbf{A})\) and \(\mathbf{m} \sim N(\mathbf{0},\sigma^2_m \mathbf{A})\).

In the above model, we are assuming that there are distinct variances for females and males, \(\sigma^2_f\) and \(\sigma^2_m\), and that these can take any positive value.

We can fit the above model with ASReml-R using the following code:

ainv <- ainverse(PED)
modP <- asreml(fixed=DBH~1,
             random=~vm(mother,ainv)+vm(father,ainv)+family,
             residual=~idv(units),
             data=PINE)

 

In this code we are generating the numerator relationship matrix \(\mathbf{A}\) using the command \(\texttt{ainverse()}\). The matrix \(\mathbf{A}\) is incorporated into the model using the function \(\texttt{vm()}\). The log-likelihood of the above model fitted to our data is \(\texttt{-744.488}\) and the estimated variance components are:

component std.error z.ratio bound %ch
vm(mother, ainv) 0.2774623 0.1121623 2.473759 P 0
vm(father, ainv) 0.3815927 0.1348447 2.829868 P 0
family 0.0000004 NA NA B 0
units!units 1.8702273 0.0930707 20.094696 P 0
units!R 1.0000000 NA NA F 0

 

Note the difference in variance components between males (fathers, \(\texttt{0.277}\)) and females (mothers, \(\texttt{0.382}\)). Finding a difference in variance components by sex is common for plant species and it might be correct depending on the origin of the parents, but also other biological reasons might explain these differences. Also note that the family variance is bounded at almost zero.

In our case, we can calculate both the narrow-sense heritability for females and males (\(h^2_f\) and \(h^2_m\)) and the dominance ratio (\(d^2\)), by using the following expressions:

\(h^2_f = 4 \times \sigma^2_f / (\sigma^2_f + \sigma^2_m + \sigma^2_{fm} + \sigma^2)\)
\(h^2_m = 4 \times \sigma^2_m / (\sigma^2_f + \sigma^2_m + \sigma^2_{fm} + \sigma^2)\)
\(d^2 = 4 \times \sigma^2_{fm} / (\sigma^2_f + \sigma^2_m + \sigma^2_{fm} + \sigma^2)\)

In some cases, particualrly when the estimations of heritability for both sexes might not be very different, these can be combined into a single estimate, as:

\(h^2 = 2 \times (\sigma^2_f +\sigma^2_m) / (\sigma^2_f + \sigma^2_m + \sigma^2_{fm} + \sigma^2)\)

All of the above expressions were obtained in ASReml-R using the command \(\texttt{vpredict()}\) as shown below:

vpredict(modP,h2m~4*V1/(V1+V2+V3+V4))
vpredict(modP,h2f~4*V2/(V1+V2+V3+V4))
vpredict(modP,d2~4*V3/(V1+V2+V3+V4))
vpredict(modP,h2~2*(V1+V2)/(V1+V2+V3+V4))

with the results:

Estimate SE
h2m 0.4388000 0.1616208
h2f 0.6034797 0.1858860
d2 0.0000007 0.0000001
h2 0.5211399 0.1043225

 

As expected, our dominance ratio \(d^2\) is very small in this particular trait; hence, it could be ignored. Also, the narrow-sense heritabilities are reasonable for this type of study, with a combined estimate of \(\texttt{0.521}\) but with a moderate approximated standard error (\(\texttt{0.104}\)).

Also, we can obtain the BLUPs for the model terms \(\texttt{mother}\) and \(\texttt{father}\), which in this case will correspond to the GCA values. If these are multiplied by two, we will have the BVs. The GCAs are obtained with the code:

summary(modP,coef=TRUE)$coef.random

 

In the output below, we can observe a few of the estimated GCA values for the \(\texttt{mother}\) (top) and \(\texttt{father}\) (bottom).

solution std.error z.ratio
vm(mother, ainv)_14006 0.3600002 0.2841548 1.2669160
vm(mother, ainv)_14046 0.0000000 0.5267472 0.0000000
vm(mother, ainv)_14060 -0.2286267 0.4864881 -0.4699533
vm(mother, ainv)_14070 0.0000000 0.5267472 0.0000000
vm(mother, ainv)_14104 -0.4620406 0.4208766 -1.0978056
vm(mother, ainv)_14106 0.0000000 0.5267472 0.0000000

 

solution std.error z.ratio
vm(father, ainv)_14006 0.0013612 0.2827143 0.0048148
vm(father, ainv)_14046 -0.5495441 0.5510751 -0.9972219
vm(father, ainv)_14060 0.0000000 0.6177321 0.0000000
vm(father, ainv)_14070 -0.3792472 0.3257064 -1.1643837
vm(father, ainv)_14104 -0.2087355 0.5024186 -0.4154613
vm(father, ainv)_14106 -0.5495441 0.5510751 -0.9972219
vm(father, ainv)_14114 0.1476378 0.3462045 0.4264466

 

Note that for parent \(\texttt{14006}\) the estimates are different as a mother and a father (\(\texttt{0.360}\) against \(\texttt{0.001}\)), and their precision will depend on the amount of information a parent has (i.e., number of offspring), but also in its relationship with other genotypes on the dataset.

So far, we have not combined the information from a given parent when it is used as a male and as a female. Now, it is of interest to overlay the factors \(\texttt{mother}\) and \(\texttt{father}\). To do this in ASReml-R we require two elements: the use of the statement \(\texttt{and()}\) in the model and the specification of the command \(\texttt{equate.levels}\). This is illustrated in the following code for the same data as above:

modO <- asreml(fixed=DBH~1,
               random=~vm(mother,ainv)+and(vm(father,ainv)),
               equate.levels=c('mother','father'),
               residual=~idv(units),
               data=PINE)

 

First, let’s focus on the \(\texttt{and()}\) function. This is used to request that the design matrices associated with \(\texttt{father}\) are overlaid on the design matrix associated with the immediately previous term (here, \(\texttt{mother}\)). Second, the expression \(\texttt{equate.levels}\) requests that the levels for the factor \(\texttt{mother}\) and \(\texttt{father}\) are treated as identical. That is, levels that have the same name in the factors \(\texttt{mother}\) and \(\texttt{father}\) will be treated as the same, and levels that are found in only one of the factors will be added to the new design matrix. Both of the above commands are required in ASReml-R to ensure that consistent BV (or GCA) estimates are obtained.

In order to illustrate the above overlay of matrices, we will use a simple example, in which we have eight individuals originating from crosses of four parents (P1, P2, P3 and P4). These parents were used as female or male, and in some cases we have reciprocal crosses and selfing. The crossing and design matrices for our base model (without overlay) are shown below:

\[
\begin{matrix}
indiv & female & male \\
A & P1 & P2 \\
B & P1 & P3 \\
C & P1 & P4 \\
D & P2 & P3 \\
E & P2 & P4 \\
F & P3 & P2 \\
G & P3 & P4 \\
H & P4 & P4 \\
\end{matrix}
\Rightarrow
\begin{matrix}
P1 \:\: P2 \:\: P3 \:\: P4 \\
\begin{bmatrix}
1 & 0 & 0 & 0 \\
1 & 0 & 0 & 0 \\
1 & 0 & 0 & 0 \\
0 & 1 & 0 & 0 \\
0 & 1 & 0 & 0 \\
0 & 0 & 1 & 0 \\
0 & 0 & 1 & 0 \\
0 & 0 & 0 & 1 \\
\end{bmatrix}
\end{matrix}
\:\:
\begin{matrix}
P2 \:\: P3 \:\: P4 \\
\begin{bmatrix}
1 & 0 & 0 \\
0 & 1 & 0 \\
0 & 0 & 1 \\
0 & 1 & 0 \\
0 & 0 & 1 \\
1 & 0 & 0 \\
0 & 0 & 1 \\
0 & 0 & 1 \\
\end{bmatrix}
\end{matrix}
\]

Once the LMM model is fitted based on the above matrices (as we did with our previous object \(\texttt{modP}\)), we will have a total of four GCAs for females and three for males. If we count the number of ones for each effect, we find that they have between one and three pieces of information to be used in their estimation.

Now, if we proceed to overlay the female and male incidence matrices, we obtain the following results:

\[
\begin{matrix}
indiv & female & male \\
A & P1 & P2 \\
B & P1 & P3 \\
C & P1 & P4 \\
D & P2 & P3 \\
E & P2 & P4 \\
F & P3 & P2 \\
G & P3 & P4 \\
H & P4 & P4 \\
\end{matrix}
\Rightarrow
\begin{matrix}
P1 \:\: P2 \:\: P3 \:\: P4 \\
\begin{bmatrix}
1 & 1 & 0 & 0 \\
1 & 0 & 1 & 0 \\
1 & 0 & 0 & 1 \\
0 & 1 & 1 & 0 \\
0 & 1 & 0 & 1 \\
0 & 1 & 1 & 0 \\
0 & 0 & 1 & 1 \\
0 & 0 & 0 & 2 \\
\end{bmatrix}
\end{matrix}
\]

You can observe that there is a single design matrix but only with four columns: one for each parent. In this case, we have more pieces of information to estimate each effect, ranging from three to six. Of special interest is the effect of parent P3, it now uses four pieces of information to estimate its single GCA, whereas before it was being estimated as two effects, each with two pieces of information.

There are other interesting things in the above matrix, such as the value 2 associated with individual H, that originated from the cross P4 \(\times\) P4, which corresponds to a selfing of parent P4. Also, individuals D and F are full-sibs, originating from reciprocal crosses, but in this case there is no distinction between their parental effects. Both, selfing and reciprocal crosses can be considered in the fitted linear mixed model as extensions with the aim of evaluating their importance or significance.

Finally, we can observe our results from fitting the above overlay model to our loblolly pine data. The estimated variance components are:

component std.error z.ratio bound %ch
vm(mother, ainv) 0.4000453 0.1162417 3.441494 P 0
units!units 1.8599439 0.0917484 20.272218 P 0
units!R 1.0000000 NA NA F 0

 

In this particular case, we are missing the variance associated with the factor \(\texttt{father}\). However, this is still part of the model. As the matrices were overlaid, we have as an output a single parental effect now called \(\texttt{mother}\), which represents both male and female parents. Also, note that we dropped the term associated with \(\texttt{family}\), as in the previous analysis this was bound to zero. The log-likelihood of this model is \(\texttt{-737.328}\), a greater value indicating a better fit than the obtained from the original model (\(\texttt{-744.488}\)), and probably a significant improvement.

We can now proceed to obtain our estimate of heritability using the expression:

vpredict(modO,h2~4*V1/(2*V1+V2))

 

which is the traditional formula; however, note that in the denominator we are considering twice the \(\texttt{mother}\) variance as we have a single parental variance component. The results from the above code are presented below:

Estimate SE
h2 0.601564 0.1248922

 

As expected, these results are different to those from our previously fitted model because of the way this model is combining the available information. There is an increase in the heritability when we overlay the factors of mother and father compared to when we do not (\(h^2 = 0.521\)). This increase is expected as we now have more information to estimate each of the effects. Also, we expect a reduction in the standard errors of the estimated GCA values, which are presented below:

solution std.error z.ratio
vm(mother, ainv)_14006 0.5174180 0.2116542 2.4446388
vm(mother, ainv)_14046 -0.5055163 0.5609615 -0.9011606
vm(mother, ainv)_14060 -0.2843599 0.5668751 -0.5016271
vm(mother, ainv)_14070 -0.3430721 0.3366609 -1.0190434
vm(mother, ainv)_14104 -0.4849910 0.4781210 -1.0143688
vm(mother, ainv)_14106 -0.5055163 0.5609615 -0.9011606

 

In this case we have only \(\texttt{mother}\) effects that identify each of the levels of the parent. In contrast with our previous model, parent \(\texttt{14006}\) has a single GCA value (\(\texttt{0.517}\)), and its standard error is smaller than before (\(\texttt{0.21}\) compared to \(\texttt{0.28}\)). Some GCA estimates will increase or decrease, and their precision might get better or worse depending on how information is combined.

In this example, we have explored a parental model where the effects of male and female were overlaid. This resulted, according to the log-likelihood value, in a statistically superior model that combines the data better. But more importantly, this model is more biologically accurate given our knowledge of monoecy of the species under study.

Author

Salvador A. Gezan

References

Resende, MFR, Munoz, P, Resende, MDV, Garrick, DJ, Fernando, RL, Davis, JM, Jokela, EJ, Martin, TA, Peter, GF and M. Kirst. (2012). Accuracy of genomic selection methods in a standard data set of loblolly pine (Pinus taeda L.) Genetics 190:1503-1510

File to download

EQ_LOBLOLLY

Notes: SAG August-2020