detection.tex 14.2 KB
 Sylvain Jasson committed Jun 02, 2017 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 \chapter{Detection model} During detection we will consider many linear models and perform many analysis of variance (aka ANOVA). \section{Complete model} The more general form of the models is : \label{eq:complete_model} \forall i \in \mathbf{N} \ \ \ y_{i}=\mu_{c_i} + \sum_{h \in \mathbf{H}} \sum_{g \in \mathbf{G}(h)} P(i,g,h)\xi_{g,h,c_i} +\sum_{v \in \mathbf{V}} v_i \kappa_{v} +\varepsilon_{i} Where : \begin{itemize} \item $y_i$ is the value of the trait for individual $i$; \item $\mathbf{N}$ is the set of individuals under study, its cardinality $\left\vert{\mathbf{N}}\right\vert$ is the total number of individual in the design under study; \item $c_i$ is the cross to which individual $i$ belongs (let be $\mathbf{C}$ the set of crosses in the design, $\forall i\in \mathbf{N}, \ \ c_i \in \mathbf{C}$, its cardinality $|\mathbf{C}|$ is the total number of crosses in the design); \item $\mu_{c_i}$ is the estimated mean value of the trait for this cross; \item $\mathbf{H}$ is the set of genetic explicative factors (mainly $\mathbf{H}$ is a set QTLs, or a set of sets of QTLs when looking for epistasy); \item $\mathbf{G}(h)$ is the set of possible genotypes for factor $h$ (if $h$ is a unit set $h=\set{l}$ that is we address a single QTL, then $\mathbf{G}(h)=\mathbf{G}(\set{l})$ is the set of possible genotypes at locus $l$); \item $P(i,g,h)$ is the probability that individual $i$ has genotype $g$ on $h$; \item $\xi_{g,h,c_i}$ is the phenotypic value of genotype $g$ on $h$ in genetic background $c_i$ (see section \ref{sec:phen_value} for more information about the hypothesis we can make about $\xi$) ; \item $v$ is a user-defined explicative variable; \item $\mathbf{V}$ is the set of user-defined explicative variables (this set can be empty); \item $v_i$ is the value of variable $v$ for individual $i$; \item $\kappa_{v}$ is the effect of variable $v$; \item $\varepsilon_i$ is the modeling error for individual $i$. \end{itemize} Lower case letter values $y_i$, $c_i$ and $v_i$ are user provided, upper case letter value $P(i,g,h)$ is computed from markers score and genetic map (see chapter \ref{ch:probas}), bold upper case letters $\mathbf{N}$, $\mathbf{C}$, $\mathbf{H}$, $\mathbf{G}$ and $\mathbf{V}$ are sets, Greek letters values $\mu_{c}$, $\xi_{g,h,c}$, $\kappa_{v}$ and $\varepsilon_i$ are computed by the linear model engine. The sets $\mathbf{N}$, $\mathbf{C}$ and $\mathbf{V}$ are user defined, our main goal is to find the best $\mathbf{H}$ ($\mathbf{G}$ is $\mathbf{H}$-dependent). \section{Fitting and testing models} The lineal model engine minimizes the residual sum of squares $RSS=\sum_{i \in \mathbf{N}} \varepsilon_i^2$ under constraints \ref{eq:complete_model} for any tested set $\mathbf{H}$. Note that adding more elements to the tested set $\mathbf{H}$ will automatically reduce $RSS$ since it allows minimization over a wider space. Therefore the reduction of $RSS$ cannot be our rule for model selection. We must switch the question from "Does the new model display a smaller $RSS$?" to "Is the $RSS$ reduction of the new model worth the degrees of freedom it costs ?". In order to compare two nested models we will then use F-test, that is a measurement of the this compromise between goodness of fit and degrees of freedom cost. (To be fully honest, when we compare to models for pleiotropic effects the residual errors $\varepsilon_i$ won't be a vector but a matrix, its squared sum over individuals won't be number but a vector, and therefore we must use some kind of log-likelihood ratio, and a $\chi^2$ test with appropriate degree of freedom will do the job.) Lets now suppose that we have current model, and we wander if some new extended model is better. Let $df_{current}$ (resp. $df_{new}$) be the number of degree of freedom for the current model (resp. the new model). This number of degree of freedom is the number of independent parameters in the fitted model, that is the number of independent values the linear model engine has to choose for $\mu_{c}$, $\xi_{g,h,c}$ and $\kappa_{v}$. The test value in then: \label{eq:F-test} F(\mathbf{H}_{new},\mathbf{H}_{current})=\frac{ \frac{ RSS_{current} - RSS_{new} }{ df_{new}-df_{current} } }{ \frac{RSS_{new}}{\left\vert{\mathbf{N}}\right\vert-df_{new}} } The higher the value $F(\mathbf{H}_{new},\mathbf{H}_{current})$ is, the better is the new model. The typical use of such a score is to address the question of adding a new locus in the model. Every locus along the genome may then be tested. ( If a tested locus is already in $\mathbf{H}$, $RSS_{current} = RSS_{new}$ its score is $0$). Moreover the amount of degrees of freedom involved in \ref{eq:F-test} can change along the genome, with lack of independence of loci or with similarity of parents. It is then wiser to use the probability that a Fischer distribution ($\mathcal{F}$) with these degrees of freedom is greater than the actual $F$ value. Our score is then score(\mathbf{H}_{new},\mathbf{H}_{current})=-log_{10}P[\mathcal{F}(df_{new}-df_{current},|\mathbf{N}|-df_{new})>F(\mathbf{H}_{new},\mathbf{H}_{current})] (We prefer positive and not so great values, so the $-log_{10}$ provides a monotonic and aesthetic transformation that makes this score prettier). Unfortunately we have no analytic expression of the distribution of its maximum along a genome under the null hypothesis, and can not establish a critical value for its significativity at a chosen risk level. So we'll use a resampling method by permutation to access an empiric value of its quantile. This permutations are performed on the fly by the software and their result is saved for potential reuse. Don't be surprised if the very same analysis takes longer time at first attempt. \section{Phenotypic values}\label{sec:phen_value} As you can see in formula \ref{eq:F-test} the difference of degrees of freedom between the two models has an important effect on the test value. Therefore it is a good idea not to waste them. Some hypothesis on the effects allow phenotypic values simplification and degrees of freedom reduction. We'll review some possible hypothesis. Lets suppose that in our study population a set $\mathbf{P}$ of alleles are possible. Typically $\mathbf{P}=\set{P1,P2}$, $|\mathbf{P}|=2$ for a simple cross between two parents, $\mathbf{P}=\set{A,B,C,D,E,F,G,H}$, $|\mathbf{P}|=8$ in MAGIC. \subsection{Single locus effects} Remember that $\mathbf{H}$ is a set of explicative factors $h$. Each explicative factor is a set of one or more loci. In the simplest models, each QTL has its own effect without any interaction. That is: every explicative factor contains exactly one locus. Every $h=\set{l}$ is a unit set, that is $|h|=1$. Possible genotypes are then allele pairs $g=(g_1,g_2) \in \mathbf{P}^2$. \subsubsection{Additive and connected model} The simplest possible model is the additive and connected one. Each allele has its own effect without any dominance, nor interaction with genetic background. \xi_{g,h,c_i} = \xi_{(g_1,g_2),\set{l},c_i} = \alpha_{g_1,l}+\alpha_{g_2,l} Most of the time, the degree of freedom used by adding a new QTL to such a model is the number of distinct possible allele minus one $df_{new}-df_{current}=p-1$. \subsubsection{Dominance and connected model} This model is much like the previous one. The only difference is that interaction between allele are possible. \xi_{g,h,c_i} = \xi_{(g_1,g_2),\set{l},c_i} = \alpha_{g_1,l}+\alpha_{g_2,l}+\delta_{g_1,g_2,l} Note that $\forall g_k\in\mathbf{P}, \ \delta_{g_k,g_k,l}=0$: dominance exists only when distinct alleles are involved. Most of the time, the degree of freedom used by adding a new QTL to such a model is the number of distinct possible allele minus one plus the number of possible interactions $df_{new}-df_{current}=p-1 + \frac{p(p-1)}{2}=\frac{(p-1)(p+2)}{2}$. Some design do not allow to detect such effect (e.g. Back-cross), \texttt{spell-qtl} is able to detect such cases, and will then discard any dominance effect, even if asked. \subsubsection{Additive and disconnected model} A different enrichment of additive and connected model is allowing interaction with genetic background. \xi_{g,h,c_i} = \xi_{(g_1,g_2),\set{l},c_i} = \alpha_{g_1,l}+\alpha_{g_2,l}+\beta_{g_1,l,c_i}+\beta_{g_2,l,c_i} This is equivalent to enrich the model by saying that additive affects are genetic background dependent. That is $\alpha_{g,l,c_i}^{dis}=\alpha_{g,l}+\beta_{g,l,c_i}$. \xi_{g,h,c_i} = \xi_{(g_1,g_2),\set{l},c_i} = \alpha_{g_1,l,c_i}^{dis}+\alpha_{g_2,l,c_i}^{dis} Most of the time, the degree of freedom used by adding a new QTL to such a model is the sum of degrees of freedom that would have been needed by any cross in the design addressed alone. In the worst case $df_{new}-df_{current}=(p-1)\times |\mathbf{C}|$. Of course if there is only one cross $|\mathbf{C}|=1$ this is exactly the same as the additive and connected model. More surprisingly, some complex design such as star designs lead to identical model either connected or disconnected. \subsubsection{Dominance and disconnected model} This is the richest single locus model. Both additive and dominance effects are possible, and every effect has interactions with genetic background. \xi_{g,h,c_i} = \xi_{(g_1,g_2),\set{l},c_i} =\alpha_{g_1,l}+\alpha_{g_2,l}+\delta_{g_1,g_2,l}+\beta_{g_1,l,c_i}+\beta_{g_2,l,c_i}+\gamma_{g_1,g_2,l,c_i} One more time this is equivalent to have a model with background dependent effects $\alpha_{g,l,c_i}^{dis}=\alpha_{g,l}+\beta_{g,l,c_i}$, and $\delta_{g_1,g_2,l,c_i}^{dis}=\delta_{g_1,g_2,l}+\gamma_{g_1,g_2,l,c_i}$ \xi_{g,h,c_i} = \xi_{(g_1,g_2),\set{l},c_i} =\alpha_{g_1,l,c_i}^{dis}+\alpha_{g_2,l,c_i}^{dis}+\delta_{g_1,g_2,l,c_i}^{dis} Both remarks about the previous model are true again. The number of degree of freedom for each added QTL in the worst case is : $df_{new}-df_{current}=\frac{(p-1)(p+2)}{2}\times |\mathbf{C}|$. \subsection{Epistasis between two loci} Addressing epistasis limited to two loci interactions means that the explicative factors in $\mathbf{H}$ are either unit sets or two elements sets. That is $\forall h \in \mathbf{H}, |h|\in\set{1,2}$. The phenotypic values for every single locus explicative factor $h$ are the same as in previous section. We'll focus in this section on phenotypic values when addressing exactly 2 QTLs: $h=\set{l^1,l^2}$. Possible genotypes are then 4-uple of allele : genotype at locus $l^1$ is $(g^1_1,g^1_2)$, genotype at locus $l^2$ is $(g^2_1,g^2_2)$, overall genotype is $g=(g^1_1,g^1_2,g^2_1,g^2_2)$. Note that the correct use of formula \ref{eq:complete_model} requires that $P(i,g,h)$ is the joint probablity of $g$ on $h$. If loci $l^1$ and $l^2$ are far enough from each other (or even better are on different linkage groups) this probability can be computed as a product of single locus probabilities. But in general case a true joint probability must be computed (especially in order to avoid ghost QTLs). Spell-qtl can compute such a probability (see chapter \ref{ch:probas}) The joint two locus phenotypic effect can be expressed as the sum of the direct terms (the phenotypic effect of each locus) and an interaction term. \subsubsection{Additive Additive model} In this first epistatsis model, the single QTL direct term follows the additive connected model, and there is interaction only between additive effects. \label{eq:phenotypic_value_epistasis_AA} \begin{split} \xi_{(g^1_1,g^1_2,g^2_1,g^2_2),\set{l^1,l^2},c_i} &= \xi_{(g^1_1,g^1_2),\set{l^1}} + \xi_{(g^2_1,g^2_2),\set{l^2}}\\ &+ \eta_{(g^1_1,g^2_1),\set{l^1,l^2}}+ \eta_{(g^1_1,g^2_2),\set{l^1,l^2}}+ \eta_{(g^1_2,g^2_1),\set{l^1,l^2}}+ \eta_{(g^1_2,g^2_2),\set{l^1,l^2}} \end{split} % \sum_{\mathclap{\substack{(l_1,l_2) \in h^2\\l_1 \neq l_2}}} \zeta_{g,l_1,l_2} \section{Simplest possible test} If we address only direct additive effects of a unique QTL in a unique biparental population, the phenotypic values in the equation \ref{eq:complete_model} can be simplified : \subsection*{no QTL} $\mathbf{H}=\mathbf{V}=\varnothing$ The linear model with only cross mean is: \forall i \in \mathbf{N} \ \ \ y_{i}=\mu +\varepsilon_{i}^{\mathbf{H}=\varnothing} RSS_{\mathbf{H}=\varnothing}=\sum_{i\in \mathbf{N}} \varepsilon_{i}^{\mathbf{H}=\varnothing} The only independant parameter is the mean $\mu$ therefore: p_{\mathbf{H}=\varnothing}=1 \subsection*{a unique QTL} $\mathbf{V}=\varnothing$; $\mathbf{H}=\set{ \set{l} }$ that is the set $\mathbf{H}$ contains only one explicative element and this element contains only one locus. \begin{split} \forall i \in \mathbf{N} \ \ \ y_{i}&=\mu + 2P(AA,l)\alpha_{A,l}+P(AB,l)(\alpha_{A,l}+\alpha_{B,l})+2P(BB,l)\alpha_{B,l} +\varepsilon_{i}^{\mathbf{H}=\set{ \set{l} }} \\ &=\mu + [2P(AA,l)+P(AB,l)]\alpha_{A,l}+[P(AB,l)+2P(BB,l)]\alpha_{B,l} +\varepsilon_{i}^{\mathbf{H}=\set{ \set{l} }} \end{split} \label{eq:simplest_model} RSS_{\mathbf{H}=\set{ \set{l} }}=\sum_{i\in \mathbf{N}} \varepsilon_{i}^{\mathbf{H}=\set{ \set{l} }} We have two independant parameters: the mean $\mu$ and the contrast between alleles $\alpha_{A,l}-\alpha_{B,l}$ p_{\mathbf{H}=\set{ \set{l} }}=2 \subsection*{F-test} The test in order to decide if we accept or not accept the QTL at locus $l$ we'll first compute $F$ value : F(\mathbf{H}=\set{ \set{l} },\mathbf{H}=\varnothing)=\frac{\frac{RSS_{\mathbf{H}=\varnothing}-RSS_{\mathbf{H}=\set{ \set{l} }} }{p_{\mathbf{H}=\set{ \set{l} }} - p_{\mathbf{H}=\varnothing}} }{ \frac{RSS_{\mathbf{H}=\set{ \set{l} }}}{\left\vert{\mathbf{N}}\right\vert-p_{\mathbf{H}=\set{ \set{l} }}} } = ({\left\vert{\mathbf{N}}\right\vert-2}) \left( \frac{ RSS_{\mathbf{H}=\varnothing} }{ {RSS_{\mathbf{H}=\set{ \set{l} }}} } -1 \right) And then the score score(\mathbf{H}=\set{ \set{l} },\mathbf{H}=\varnothing)=-log_{10}P[\mathcal{F}(1,\left\vert{\mathbf{N}}\right\vert-2)>F(\mathbf{H}=\set{ \set{l} },\mathbf{H}=\varnothing)]