Commit c582234a by Sylvain Jasson

 \documentclass[openright,twoside,10pt,DIV=11]{scrreprt} \usepackage[toc,page]{appendix} \usepackage{url} \usepackage{listings} \usepackage{amsmath} \usepackage{amssymb} \usepackage{mathtools} \usepackage{braket} \lstset{ breakatwhitespace=false, breaklines=true } \title{Spell-QTL User Manual} \subtitle{Species perscrutandis enixe locis locabuntur} \author{Damien Leroux \and Sylvain Jasson} \date{Version \input{version.tex} } \begin{document} \maketitle \chapter*{Foreword} Spell-QTL\footnote{ {\em Species perscrutandis enixe locis locabutur} means {\em The characters will be localized using an zealous loci inspection}, the very purpose of the software} is a software suite allowing to detect QTLs in any crossing design. It is developed at MIAT lab by Damien Leroux and Sylvain Jasson. It is available (including the present user manual) at \url{https://mulcyber.toulouse.inra.fr/frs/?group_id=204} under Gnu Public License. \chapter{Detection model} During detection we will consider many linear models and perform many analysis of variance (aka ANOVA). \section{Complete model} The more general form of the models is : \label{eq:complete_model} \forall i \in \mathbf{N} \ \ \ y_{i}=\mu_{c_i} + \sum_{h \in \mathbf{H}} \sum_{g \in \mathbf{G}(h)} P(i,g,h)\xi_{g,h,c_i} +\sum_{v \in \mathbf{V}} v_i \kappa_{v} +\varepsilon_{i} Where : \begin{itemize} \item $y_i$ is the value of the trait for individual $i$; \item $\mathbf{N}$ is the set of individuals under study, its cardinality $\left\vert{\mathbf{N}}\right\vert$ is the total number of individual in the design under study; \item $c_i$ is the cross to which individual $i$ belongs (let be $\mathbf{C}$ the set of crosses in the design, $\forall i\in \mathbf{N}, \ \ c_i \in \mathbf{C}$, its cardinality $|\mathbf{C}|$ is the total number of crosses in the design); \item $\mu_{c_i}$ is the estimated mean value of the trait for this cross; \item $\mathbf{H}$ is the set of genetic explicative factors (mainly $\mathbf{H}$ is a set QTLs, or a set of sets of QTLs when looking for epistasy); \item $\mathbf{G}(h)$ is the set of possible genotypes for factor $h$ (if $h$ is a unit set $h=\set{l}$ that is we address a single QTL, then $\mathbf{G}(h)=\mathbf{G}(\set{l})$ is the set of possible genotypes at locus $l$); \item $P(i,g,h)$ is the probability that individual $i$ has genotype $g$ on $h$; \item $\xi_{g,h,c_i}$ is the phenotypic value of genotype $g$ on $h$ in genetic background $c_i$ (see section \ref{sec:phen_value} for more information about the hypothesis we can make about $\xi$) ; \item $v$ is a user-defined explicative variable; \item $\mathbf{V}$ is the set of user-defined explicative variables (this set can be empty); \item $v_i$ is the value of variable $v$ for individual $i$; \item $\kappa_{v}$ is the effect of variable $v$; \item $\varepsilon_i$ is the modeling error for individual $i$. \end{itemize} Lower case letter values $y_i$, $c_i$ and $v_i$ are user provided, upper case letter value $P(i,g,h)$ is computed from markers score and genetic map (see chapter \ref{ch:probas}), bold upper case letters $\mathbf{N}$, $\mathbf{C}$, $\mathbf{H}$, $\mathbf{G}$ and $\mathbf{V}$ are sets, Greek letters values $\mu_{c}$, $\xi_{g,h,c}$, $\kappa_{v}$ and $\varepsilon_i$ are computed by the linear model engine. The sets $\mathbf{N}$, $\mathbf{C}$ and $\mathbf{V}$ are user defined, our main goal is to find the best $\mathbf{H}$ ($\mathbf{G}$ is $\mathbf{H}$-dependent). \section{Fitting and testing models} The lineal model engine minimizes the residual sum of squares $RSS=\sum_{i \in \mathbf{N}} \varepsilon_i^2$ under constraints \ref{eq:complete_model} for any tested set $\mathbf{H}$. Note that adding more elements to the tested set $\mathbf{H}$ will automatically reduce $RSS$ since it allows minimization over a wider space. Therefore the reduction of $RSS$ cannot be our rule for model selection. We must switch the question from "Does the new model display a smaller $RSS$?" to "Is the $RSS$ reduction of the new model worth the degrees of freedom it costs ?". In order to compare two nested models we will then use F-test, that is a measurement of the this compromise between goodness of fit and degrees of freedom cost. (To be fully honest, when we compare to models for pleiotropic effects the residual errors $\varepsilon_i$ won't be a vector but a matrix, its squared sum over individuals won't be number but a vector, and therefore we must use some kind of log-likelihood ratio, and a $\chi^2$ test with appropriate degree of freedom will do the job.) Lets now suppose that we have current model, and we wander if some new extended model is better. Let $df_{current}$ (resp. $df_{new}$) be the number of degree of freedom for the current model (resp. the new model). This number of degree of freedom is the number of independent parameters in the fitted model, that is the number of independent values the linear model engine has to choose for $\mu_{c}$, $\xi_{g,h,c}$ and $\kappa_{v}$. The test value in then: \label{eq:F-test} F(\mathbf{H}_{new},\mathbf{H}_{current})=\frac{ \frac{ RSS_{current} - RSS_{new} }{ df_{new}-df_{current} } }{ \frac{RSS_{new}}{\left\vert{\mathbf{N}}\right\vert-df_{new}} } The higher the value $F(\mathbf{H}_{new},\mathbf{H}_{current})$ is, the better is the new model. The typical use of such a score is to address the question of adding a new locus in the model. Every locus along the genome may then be tested. ( If a tested locus is already in $\mathbf{H}$, $RSS_{current} = RSS_{new}$ its score is $0$). Moreover the amount of degrees of freedom involved in \ref{eq:F-test} can change along the genome, with lack of independence of loci or with similarity of parents. It is then wiser to use the probability that a Fischer distribution ($\mathcal{F}$) with these degrees of freedom is greater than the actual $F$ value. Our score is then score(\mathbf{H}_{new},\mathbf{H}_{current})=-log_{10}P[\mathcal{F}(df_{new}-df_{current},|\mathbf{N}|-df_{new})>F(\mathbf{H}_{new},\mathbf{H}_{current})] (We prefer positive and not so great values, so the $-log_{10}$ provides a monotonic and aesthetic transformation that makes this score prettier). Unfortunately we have no analytic expression of the distribution of its maximum along a genome under the null hypothesis, and can not establish a critical value for its significativity at a chosen risk level. So we'll use a resampling method by permutation to access an empiric value of its quantile. This permutations are performed on the fly by the software and their result is saved for potential reuse. Don't be surprised if the very same analysis takes longer time at first attempt. \section{Phenotypic values}\label{sec:phen_value} As you can see in formula \ref{eq:F-test} the difference of degrees of freedom between the two models has an important effect on the test value. Therefore it is a good idea not to waste them. Some hypothesis on the effects allow phenotypic values simplification and degrees of freedom reduction. We'll review some possible hypothesis. Lets suppose that in our study population a set $\mathbf{P}$ of alleles are possible. Typically $\mathbf{P}=\set{P1,P2}$, $|\mathbf{P}|=2$ for a simple cross between two parents, $\mathbf{P}=\set{A,B,C,D,E,F,G,H}$, $|\mathbf{P}|=8$ in MAGIC. \subsection{Single locus effects} Remember that $\mathbf{H}$ is a set of explicative factors $h$. Each explicative factor is a set of one or more loci. In the simplest models, each QTL has its own effect without any interaction. That is: every explicative factor contains exactly one locus. Every $h=\set{l}$ is a unit set, that is $|h|=1$. Possible genotypes are then allele pairs $g=(g_1,g_2) \in \mathbf{P}^2$. \subsubsection{Additive and connected model} The simplest possible model is the additive and connected one. Each allele has its own effect without any dominance, nor interaction with genetic background. \xi_{g,h,c_i} = \xi_{(g_1,g_2),\set{l},c_i} = \alpha_{g_1,l}+\alpha_{g_2,l} Most of the time, the degree of freedom used by adding a new QTL to such a model is the number of distinct possible allele minus one $df_{new}-df_{current}=p-1$. \subsubsection{Dominance and connected model} This model is much like the previous one. The only difference is that interaction between allele are possible. \xi_{g,h,c_i} = \xi_{(g_1,g_2),\set{l},c_i} = \alpha_{g_1,l}+\alpha_{g_2,l}+\delta_{g_1,g_2,l} Note that $\forall g_k\in\mathbf{P}, \ \delta_{g_k,g_k,l}=0$: dominance exists only when distinct alleles are involved. Most of the time, the degree of freedom used by adding a new QTL to such a model is the number of distinct possible allele minus one plus the number of possible interactions $df_{new}-df_{current}=p-1 + \frac{p(p-1)}{2}=\frac{(p-1)(p+2)}{2}$. Some design do not allow to detect such effect (e.g. Back-cross), spell-qtl is able to detect such cases, and will then discard any dominance effect, even if asked. \subsubsection{Additive and disconnected model} A different enrichment of additive and connected model is allowing interaction with genetic background. \xi_{g,h,c_i} = \xi_{(g_1,g_2),\set{l},c_i} = \alpha_{g_1,l}+\alpha_{g_2,l}+\beta_{g_1,l,c_i}+\beta_{g_2,l,c_i} This is equivalent to enrich the model by saying that additive affects are genetic background dependent. That is $\alpha_{g,l,c_i}^{dis}=\alpha_{g,l}+\beta_{g,l,c_i}$. \xi_{g,h,c_i} = \xi_{(g_1,g_2),\set{l},c_i} = \alpha_{g_1,l,c_i}^{dis}+\alpha_{g_2,l,c_i}^{dis} Most of the time, the degree of freedom used by adding a new QTL to such a model is the sum of degrees of freedom that would have been needed by any cross in the design addressed alone. In the worst case $df_{new}-df_{current}=(p-1)\times |\mathbf{C}|$. Of course if there is only one cross $|\mathbf{C}|=1$ this is exactly the same as the additive and connected model. More surprisingly, some complex design such as star designs lead to identical model either connected or disconnected. \subsubsection{Dominance and disconnected model} This is the richest single locus model. Both additive and dominance effects are possible, and every effect has interactions with genetic background. \xi_{g,h,c_i} = \xi_{(g_1,g_2),\set{l},c_i} =\alpha_{g_1,l}+\alpha_{g_2,l}+\delta_{g_1,g_2,l}+\beta_{g_1,l,c_i}+\beta_{g_2,l,c_i}+\gamma_{g_1,g_2,l,c_i} One more time this is equivalent to have a model with background dependent effects $\alpha_{g,l,c_i}^{dis}=\alpha_{g,l}+\beta_{g,l,c_i}$, and $\delta_{g_1,g_2,l,c_i}^{dis}=\delta_{g_1,g_2,l}+\gamma_{g_1,g_2,l,c_i}$ \xi_{g,h,c_i} = \xi_{(g_1,g_2),\set{l},c_i} =\alpha_{g_1,l,c_i}^{dis}+\alpha_{g_2,l,c_i}^{dis}+\delta_{g_1,g_2,l,c_i}^{dis} Both remarks about the previous model are true again. The number of degree of freedom for each added QTL in the worst case is : $df_{new}-df_{current}=\frac{(p-1)(p+2)}{2}\times |\mathbf{C}|$. \subsection{Epistasis between two loci} Addressing epistasis limited to two loci interactions means that the explicative factors in $\mathbf{H}$ are either unit sets or two elements sets. That is $\forall h \in \mathbf{H}, |h|\in\set{1,2}$. The phenotypic values for every single locus explicative factor $h$ are the same as in previous section. We'll focus in this section on phenotypic values when addressing exactly 2 QTLs: $h=\set{l^1,l^2}$. Possible genotypes are then 4-uple of allele : genotype at locus $l^1$ is $(g^1_1,g^1_2)$, genotype at locus $l^2$ is $(g^2_1,g^2_2)$, overall genotype is $g=(g^1_1,g^1_2,g^2_1,g^2_2)$. Note that the correct use of formula \ref{eq:complete_model} requires that $P(i,g,h)$ is the joint probablity of $g$ on $h$. If loci $l^1$ and $l^2$ are far enough from each other (or even better are on different linkage groups) this probability can be computed as a product of single locus probabilities. But in general case a true joint probability must be computed (especially in order to avoid ghost QTLs). Spell-qtl can compute such a probability (see chapter \ref{ch:probas}) The joint two locus phenotypic effect can be expressed as the sum of the direct terms (the phenotypic effect of each locus) and an interaction term. \subsubsection{Additive Additive model} In this first epistatsis model, the single QTL direct term follows the additive connected model, and there is interaction only between additive effects. \label{eq:phenotypic_value_epistasis_AA} \begin{split} \xi_{(g^1_1,g^1_2,g^2_1,g^2_2),\set{l^1,l^2},c_i} &= \xi_{(g^1_1,g^1_2),\set{l^1}} + \xi_{(g^2_1,g^2_2),\set{l^2}}\\ &+ \eta_{(g^1_1,g^2_1),\set{l^1,l^2}}+ \eta_{(g^1_1,g^2_2),\set{l^1,l^2}}+ \eta_{(g^1_2,g^2_1),\set{l^1,l^2}}+ \eta_{(g^1_2,g^2_2),\set{l^1,l^2}} \end{split} % \sum_{\mathclap{\substack{(l_1,l_2) \in h^2\\l_1 \neq l_2}}} \zeta_{g,l_1,l_2} \section{Simplest model} If we address only direct additive effects of a unique QTL in a unique biparental population, the phenotypic the equation \ref{eq:complete_model} can be simplified in : \subsection*{no QTL} $\mathbf{H}=\mathbf{V}=\varnothing$ The linear model with only cross mean is: \forall i \in \mathbf{N} \ \ \ y_{i}=\mu +\varepsilon_{i}^{\mathbf{H}=\varnothing} RSS_{\mathbf{H}=\varnothing}=\sum_{i\in \mathbf{N}} \varepsilon_{i}^{\mathbf{H}=\varnothing} The only independant parameter is the mean $\mu$ therefore: p_{\mathbf{H}=\varnothing}=1 \subsection*{a unique QTL} $\mathbf{V}=\varnothing$; $\mathbf{H}=\set{ \set{l} }$ that is the set $\mathbf{H}$ contains only one explicative element and this element contains only one locus. \begin{split} \forall i \in \mathbf{N} \ \ \ y_{i}&=\mu + 2P(AA,l)\alpha_{A,l}+P(AB,l)(\alpha_{A,l}+\alpha_{B,l})+2P(BB,l)\alpha_{B,l} +\varepsilon_{i}^{\mathbf{H}=\set{ \set{l} }} \\ &=\mu + [2P(AA,l)+P(AB,l)]\alpha_{A,l}+[P(AB,l)+2P(BB,l)]\alpha_{B,l} +\varepsilon_{i}^{\mathbf{H}=\set{ \set{l} }} \end{split} \label{eq:simplest_model} RSS_{\mathbf{H}=\set{ \set{l} }}=\sum_{i\in \mathbf{N}} \varepsilon_{i}^{\mathbf{H}=\set{ \set{l} }} We have two independant parameters: the mean $\mu$ and the contrast between alleles $\alpha_{A,l}-\alpha_{B,l}$ p_{\mathbf{H}=\set{ \set{l} }}=2 \subsection*{F-test} The test in order to decide if we accept or not accept the QTL at locus $l$ we'll first compute $F$ value : F(\mathbf{H}=\set{ \set{l} },\mathbf{H}=\varnothing)=\frac{\frac{RSS_{\mathbf{H}=\varnothing}-RSS_{\mathbf{H}=\set{ \set{l} }} }{p_{\mathbf{H}=\set{ \set{l} }} - p_{\mathbf{H}=\varnothing}} }{ \frac{RSS_{\mathbf{H}=\set{ \set{l} }}}{\left\vert{\mathbf{N}}\right\vert-p_{\mathbf{H}=\set{ \set{l} }}} } = ({\left\vert{\mathbf{N}}\right\vert-2}) \left( \frac{ RSS_{\mathbf{H}=\varnothing} }{ {RSS_{\mathbf{H}=\set{ \set{l} }}} } -1 \right) And then the score score(\mathbf{H}=\set{ \set{l} },\mathbf{H}=\varnothing)=-log_{10}P[\mathcal{F}(1,\left\vert{\mathbf{N}}\right\vert-2)>F(\mathbf{H}=\set{ \set{l} },\mathbf{H}=\varnothing)] \chapter{Underlying parental origin probability model}\label{ch:probas} The probabilities in equation \ref{eq:complete_model} are computed on the fly by spell-qtl when necessary. A hidden Markov chain representing the origin of the genetic material is derived for any population in the pedigree by spell-pedigree. Depending on the number of different parents in the design and the number of non-independent meiosis the minimal number of states that allow to enjoy the Markov property can be much more than the number of possible genotypes. Spell-pedigree ensures that this number is the smallest possible using lumping operations as soon as it is possible. At any observed marker, the probabilities of these states are inferred by spell-marker. Note that as soon as non clone population are involved, some information is gathered from siblings in order to infer states probabilities. Suppose the locus of interest $q$ being preceded by $n_l$ observed loci and followed by $n_r$ observed loci, the loci sequence being $[l_{n_l},...,l_2,l_1,q,r_1,r_2,...,r_{n_r}]$. The parental origin probability vector at locus $q$ is then : \begin{multline} \label{eq:pop} p(q|\mathcal{M}_{l_1},...,\mathcal{M}_{l_{n_l}},\mathcal{M}_{r_1},...,\mathcal{M}_{r_{n_r}}) =\\ \frac{ \left(T_{q-l_1}\displaystyle\left(\prod_{i=1}^{n_l-1} M_{l_i}T_{l_{i}-l_{i+1}} \right) M_{l_{n_l}} \varphi \right) \odot \left(T_{r_1-q}\displaystyle\left(\prod_{i=1}^{n_r-1}M_{r_i}T_{r_{i+1}-r_i} \right) M_{r_{n_r}} \varphi \right) \odot \varphi^{-1} } {\left| \left(T_{q-l_1}\displaystyle\left(\prod_{i=1}^{n_l-1} M_{l_i}T_{l_{i}-l_{i+1}} \right) M_{l_{n_l}} \varphi \right) \odot \left(T_{r_1-q}\displaystyle\left(\prod_{i=1}^{n_r-1}M_{r_i}T_{r_{i+1}-r_i} \right) M_{r_{n_r}} \varphi \right) \odot \varphi^{-1} \right| } \end{multline} Where : \begin{itemize} \item $T_d$ is the transition matrix of the Markov chain (as a function of genetic distance $d$). It is computed by spell-pedigree; \item $\varphi$ is the steady state vector associated with transition matrix $T_d$, $\varphi^{-1}$ its component wise inverse ; \item $M_l$ is the observation matrix at locus $l$. It is inferred from observations at locus $l$ by spell-marker; \end{itemize} Equation \ref{eq:pop} can be applied recursively in order to compute multi-loci joint probabilities. The code in spell-qtl allows these joint probabilities computations when needed. \chapter{Generated outputs} \chapter{The main Spell-QTL pipeline} \begin{appendices} \chapter{spell-Pedigree man page} \sloppy \input{spell-pedigree.tex} \fussy \chapter{spell-marker man page} \input{spell-marker.tex} \chapter{spell-qtl man page} \input{spell-qtl.tex} \chapter{spell-qtl-examples man page} \input{spell-qtl-examples.tex} \end{appendices} \tableofcontents \end{document}