### split main file, change typewriter police, round phen values

parent 12aa17f3
 \chapter{Detection model} During detection we will consider many linear models and perform many analysis of variance (aka ANOVA). \section{Complete model} The more general form of the models is : \begin{equation} \label{eq:complete_model} \forall i \in \mathbf{N} \ \ \ y_{i}=\mu_{c_i} + \sum_{h \in \mathbf{H}} \sum_{g \in \mathbf{G}(h)} P(i,g,h)\xi_{g,h,c_i} +\sum_{v \in \mathbf{V}} v_i \kappa_{v} +\varepsilon_{i} \end{equation} Where : \begin{itemize} \item $y_i$ is the value of the trait for individual $i$; \item $\mathbf{N}$ is the set of individuals under study, its cardinality $\left\vert{\mathbf{N}}\right\vert$ is the total number of individual in the design under study; \item $c_i$ is the cross to which individual $i$ belongs (let be $\mathbf{C}$ the set of crosses in the design, $\forall i\in \mathbf{N}, \ \ c_i \in \mathbf{C}$, its cardinality $|\mathbf{C}|$ is the total number of crosses in the design); \item $\mu_{c_i}$ is the estimated mean value of the trait for this cross; \item $\mathbf{H}$ is the set of genetic explicative factors (mainly $\mathbf{H}$ is a set QTLs, or a set of sets of QTLs when looking for epistasy); \item $\mathbf{G}(h)$ is the set of possible genotypes for factor $h$ (if $h$ is a unit set $h=\set{l}$ that is we address a single QTL, then $\mathbf{G}(h)=\mathbf{G}(\set{l})$ is the set of possible genotypes at locus $l$); \item $P(i,g,h)$ is the probability that individual $i$ has genotype $g$ on $h$; \item $\xi_{g,h,c_i}$ is the phenotypic value of genotype $g$ on $h$ in genetic background $c_i$ (see section \ref{sec:phen_value} for more information about the hypothesis we can make about $\xi$) ; \item $v$ is a user-defined explicative variable; \item $\mathbf{V}$ is the set of user-defined explicative variables (this set can be empty); \item $v_i$ is the value of variable $v$ for individual $i$; \item $\kappa_{v}$ is the effect of variable $v$; \item $\varepsilon_i$ is the modeling error for individual $i$. \end{itemize} Lower case letter values $y_i$, $c_i$ and $v_i$ are user provided, upper case letter value $P(i,g,h)$ is computed from markers score and genetic map (see chapter \ref{ch:probas}), bold upper case letters $\mathbf{N}$, $\mathbf{C}$, $\mathbf{H}$, $\mathbf{G}$ and $\mathbf{V}$ are sets, Greek letters values $\mu_{c}$, $\xi_{g,h,c}$, $\kappa_{v}$ and $\varepsilon_i$ are computed by the linear model engine. The sets $\mathbf{N}$, $\mathbf{C}$ and $\mathbf{V}$ are user defined, our main goal is to find the best $\mathbf{H}$ ($\mathbf{G}$ is $\mathbf{H}$-dependent). \section{Fitting and testing models} The lineal model engine minimizes the residual sum of squares $RSS=\sum_{i \in \mathbf{N}} \varepsilon_i^2$ under constraints \ref{eq:complete_model} for any tested set $\mathbf{H}$. Note that adding more elements to the tested set $\mathbf{H}$ will automatically reduce $RSS$ since it allows minimization over a wider space. Therefore the reduction of $RSS$ cannot be our rule for model selection. We must switch the question from "Does the new model display a smaller $RSS$?" to "Is the $RSS$ reduction of the new model worth the degrees of freedom it costs ?". In order to compare two nested models we will then use F-test, that is a measurement of the this compromise between goodness of fit and degrees of freedom cost. (To be fully honest, when we compare to models for pleiotropic effects the residual errors $\varepsilon_i$ won't be a vector but a matrix, its squared sum over individuals won't be number but a vector, and therefore we must use some kind of log-likelihood ratio, and a $\chi^2$ test with appropriate degree of freedom will do the job.) Lets now suppose that we have current model, and we wander if some new extended model is better. Let $df_{current}$ (resp. $df_{new}$) be the number of degree of freedom for the current model (resp. the new model). This number of degree of freedom is the number of independent parameters in the fitted model, that is the number of independent values the linear model engine has to choose for $\mu_{c}$, $\xi_{g,h,c}$ and $\kappa_{v}$. The test value in then: \begin{equation} \label{eq:F-test} F(\mathbf{H}_{new},\mathbf{H}_{current})=\frac{ \frac{ RSS_{current} - RSS_{new} }{ df_{new}-df_{current} } }{ \frac{RSS_{new}}{\left\vert{\mathbf{N}}\right\vert-df_{new}} } \end{equation} The higher the value $F(\mathbf{H}_{new},\mathbf{H}_{current})$ is, the better is the new model. The typical use of such a score is to address the question of adding a new locus in the model. Every locus along the genome may then be tested. ( If a tested locus is already in $\mathbf{H}$, $RSS_{current} = RSS_{new}$ its score is $0$). Moreover the amount of degrees of freedom involved in \ref{eq:F-test} can change along the genome, with lack of independence of loci or with similarity of parents. It is then wiser to use the probability that a Fischer distribution ($\mathcal{F}$) with these degrees of freedom is greater than the actual $F$ value. Our score is then \begin{equation} score(\mathbf{H}_{new},\mathbf{H}_{current})=-log_{10}P[\mathcal{F}(df_{new}-df_{current},|\mathbf{N}|-df_{new})>F(\mathbf{H}_{new},\mathbf{H}_{current})] \end{equation} (We prefer positive and not so great values, so the $-log_{10}$ provides a monotonic and aesthetic transformation that makes this score prettier). Unfortunately we have no analytic expression of the distribution of its maximum along a genome under the null hypothesis, and can not establish a critical value for its significativity at a chosen risk level. So we'll use a resampling method by permutation to access an empiric value of its quantile. This permutations are performed on the fly by the software and their result is saved for potential reuse. Don't be surprised if the very same analysis takes longer time at first attempt. \section{Phenotypic values}\label{sec:phen_value} As you can see in formula \ref{eq:F-test} the difference of degrees of freedom between the two models has an important effect on the test value. Therefore it is a good idea not to waste them. Some hypothesis on the effects allow phenotypic values simplification and degrees of freedom reduction. We'll review some possible hypothesis. Lets suppose that in our study population a set $\mathbf{P}$ of alleles are possible. Typically $\mathbf{P}=\set{P1,P2}$, $|\mathbf{P}|=2$ for a simple cross between two parents, $\mathbf{P}=\set{A,B,C,D,E,F,G,H}$, $|\mathbf{P}|=8$ in MAGIC. \subsection{Single locus effects} Remember that $\mathbf{H}$ is a set of explicative factors $h$. Each explicative factor is a set of one or more loci. In the simplest models, each QTL has its own effect without any interaction. That is: every explicative factor contains exactly one locus. Every $h=\set{l}$ is a unit set, that is $|h|=1$. Possible genotypes are then allele pairs $g=(g_1,g_2) \in \mathbf{P}^2$. \subsubsection{Additive and connected model} The simplest possible model is the additive and connected one. Each allele has its own effect without any dominance, nor interaction with genetic background. \begin{equation} \xi_{g,h,c_i} = \xi_{(g_1,g_2),\set{l},c_i} = \alpha_{g_1,l}+\alpha_{g_2,l} \end{equation} Most of the time, the degree of freedom used by adding a new QTL to such a model is the number of distinct possible allele minus one $df_{new}-df_{current}=p-1$. \subsubsection{Dominance and connected model} This model is much like the previous one. The only difference is that interaction between allele are possible. \begin{equation} \xi_{g,h,c_i} = \xi_{(g_1,g_2),\set{l},c_i} = \alpha_{g_1,l}+\alpha_{g_2,l}+\delta_{g_1,g_2,l} \end{equation} Note that $\forall g_k\in\mathbf{P}, \ \delta_{g_k,g_k,l}=0$: dominance exists only when distinct alleles are involved. Most of the time, the degree of freedom used by adding a new QTL to such a model is the number of distinct possible allele minus one plus the number of possible interactions $df_{new}-df_{current}=p-1 + \frac{p(p-1)}{2}=\frac{(p-1)(p+2)}{2}$. Some design do not allow to detect such effect (e.g. Back-cross), \texttt{spell-qtl} is able to detect such cases, and will then discard any dominance effect, even if asked. \subsubsection{Additive and disconnected model} A different enrichment of additive and connected model is allowing interaction with genetic background. \begin{equation} \xi_{g,h,c_i} = \xi_{(g_1,g_2),\set{l},c_i} = \alpha_{g_1,l}+\alpha_{g_2,l}+\beta_{g_1,l,c_i}+\beta_{g_2,l,c_i} \end{equation} This is equivalent to enrich the model by saying that additive affects are genetic background dependent. That is $\alpha_{g,l,c_i}^{dis}=\alpha_{g,l}+\beta_{g,l,c_i}$. \begin{equation} \xi_{g,h,c_i} = \xi_{(g_1,g_2),\set{l},c_i} = \alpha_{g_1,l,c_i}^{dis}+\alpha_{g_2,l,c_i}^{dis} \end{equation} Most of the time, the degree of freedom used by adding a new QTL to such a model is the sum of degrees of freedom that would have been needed by any cross in the design addressed alone. In the worst case $df_{new}-df_{current}=(p-1)\times |\mathbf{C}|$. Of course if there is only one cross $|\mathbf{C}|=1$ this is exactly the same as the additive and connected model. More surprisingly, some complex design such as star designs lead to identical model either connected or disconnected. \subsubsection{Dominance and disconnected model} This is the richest single locus model. Both additive and dominance effects are possible, and every effect has interactions with genetic background. \begin{equation} \xi_{g,h,c_i} = \xi_{(g_1,g_2),\set{l},c_i} =\alpha_{g_1,l}+\alpha_{g_2,l}+\delta_{g_1,g_2,l}+\beta_{g_1,l,c_i}+\beta_{g_2,l,c_i}+\gamma_{g_1,g_2,l,c_i} \end{equation} One more time this is equivalent to have a model with background dependent effects $\alpha_{g,l,c_i}^{dis}=\alpha_{g,l}+\beta_{g,l,c_i}$, and $\delta_{g_1,g_2,l,c_i}^{dis}=\delta_{g_1,g_2,l}+\gamma_{g_1,g_2,l,c_i}$ \begin{equation} \xi_{g,h,c_i} = \xi_{(g_1,g_2),\set{l},c_i} =\alpha_{g_1,l,c_i}^{dis}+\alpha_{g_2,l,c_i}^{dis}+\delta_{g_1,g_2,l,c_i}^{dis} \end{equation} Both remarks about the previous model are true again. The number of degree of freedom for each added QTL in the worst case is : $df_{new}-df_{current}=\frac{(p-1)(p+2)}{2}\times |\mathbf{C}|$. \subsection{Epistasis between two loci} Addressing epistasis limited to two loci interactions means that the explicative factors in $\mathbf{H}$ are either unit sets or two elements sets. That is $\forall h \in \mathbf{H}, |h|\in\set{1,2}$. The phenotypic values for every single locus explicative factor $h$ are the same as in previous section. We'll focus in this section on phenotypic values when addressing exactly 2 QTLs: $h=\set{l^1,l^2}$. Possible genotypes are then 4-uple of allele : genotype at locus $l^1$ is $(g^1_1,g^1_2)$, genotype at locus $l^2$ is $(g^2_1,g^2_2)$, overall genotype is $g=(g^1_1,g^1_2,g^2_1,g^2_2)$. Note that the correct use of formula \ref{eq:complete_model} requires that $P(i,g,h)$ is the joint probablity of $g$ on $h$. If loci $l^1$ and $l^2$ are far enough from each other (or even better are on different linkage groups) this probability can be computed as a product of single locus probabilities. But in general case a true joint probability must be computed (especially in order to avoid ghost QTLs). Spell-qtl can compute such a probability (see chapter \ref{ch:probas}) The joint two locus phenotypic effect can be expressed as the sum of the direct terms (the phenotypic effect of each locus) and an interaction term. \subsubsection{Additive Additive model} In this first epistatsis model, the single QTL direct term follows the additive connected model, and there is interaction only between additive effects. \begin{equation}\label{eq:phenotypic_value_epistasis_AA} \begin{split} \xi_{(g^1_1,g^1_2,g^2_1,g^2_2),\set{l^1,l^2},c_i} &= \xi_{(g^1_1,g^1_2),\set{l^1}} + \xi_{(g^2_1,g^2_2),\set{l^2}}\\ &+ \eta_{(g^1_1,g^2_1),\set{l^1,l^2}}+ \eta_{(g^1_1,g^2_2),\set{l^1,l^2}}+ \eta_{(g^1_2,g^2_1),\set{l^1,l^2}}+ \eta_{(g^1_2,g^2_2),\set{l^1,l^2}} \end{split} \end{equation} % \sum_{\mathclap{\substack{(l_1,l_2) \in h^2\\l_1 \neq l_2}}} \zeta_{g,l_1,l_2} \section{Simplest possible test} If we address only direct additive effects of a unique QTL in a unique biparental population, the phenotypic values in the equation \ref{eq:complete_model} can be simplified : \subsection*{no QTL} $\mathbf{H}=\mathbf{V}=\varnothing$ The linear model with only cross mean is: \begin{equation} \forall i \in \mathbf{N} \ \ \ y_{i}=\mu +\varepsilon_{i}^{\mathbf{H}=\varnothing} \end{equation} \begin{equation} RSS_{\mathbf{H}=\varnothing}=\sum_{i\in \mathbf{N}} \varepsilon_{i}^{\mathbf{H}=\varnothing} \end{equation} The only independant parameter is the mean $\mu$ therefore: \begin{equation} p_{\mathbf{H}=\varnothing}=1 \end{equation} \subsection*{a unique QTL} $\mathbf{V}=\varnothing$; $\mathbf{H}=\set{ \set{l} }$ that is the set $\mathbf{H}$ contains only one explicative element and this element contains only one locus. \begin{equation} \begin{split} \forall i \in \mathbf{N} \ \ \ y_{i}&=\mu + 2P(AA,l)\alpha_{A,l}+P(AB,l)(\alpha_{A,l}+\alpha_{B,l})+2P(BB,l)\alpha_{B,l} +\varepsilon_{i}^{\mathbf{H}=\set{ \set{l} }} \\ &=\mu + [2P(AA,l)+P(AB,l)]\alpha_{A,l}+[P(AB,l)+2P(BB,l)]\alpha_{B,l} +\varepsilon_{i}^{\mathbf{H}=\set{ \set{l} }} \end{split} \label{eq:simplest_model} \end{equation} \begin{equation} RSS_{\mathbf{H}=\set{ \set{l} }}=\sum_{i\in \mathbf{N}} \varepsilon_{i}^{\mathbf{H}=\set{ \set{l} }} \end{equation} We have two independant parameters: the mean $\mu$ and the contrast between alleles $\alpha_{A,l}-\alpha_{B,l}$ \begin{equation} p_{\mathbf{H}=\set{ \set{l} }}=2 \end{equation} \subsection*{F-test} The test in order to decide if we accept or not accept the QTL at locus $l$ we'll first compute $F$ value : \begin{equation} F(\mathbf{H}=\set{ \set{l} },\mathbf{H}=\varnothing)=\frac{\frac{RSS_{\mathbf{H}=\varnothing}-RSS_{\mathbf{H}=\set{ \set{l} }} }{p_{\mathbf{H}=\set{ \set{l} }} - p_{\mathbf{H}=\varnothing}} }{ \frac{RSS_{\mathbf{H}=\set{ \set{l} }}}{\left\vert{\mathbf{N}}\right\vert-p_{\mathbf{H}=\set{ \set{l} }}} } = ({\left\vert{\mathbf{N}}\right\vert-2}) \left( \frac{ RSS_{\mathbf{H}=\varnothing} }{ {RSS_{\mathbf{H}=\set{ \set{l} }}} } -1 \right) \end{equation} And then the score \begin{equation} score(\mathbf{H}=\set{ \set{l} },\mathbf{H}=\varnothing)=-log_{10}P[\mathcal{F}(1,\left\vert{\mathbf{N}}\right\vert-2)>F(\mathbf{H}=\set{ \set{l} },\mathbf{H}=\varnothing)] \end{equation}
 ... ... @@ -7,7 +7,7 @@ EXAMPLE_DIR = $(SRCDIR)/examples/three_parents_F2 all: user_manual.pdf user_manual.pdf: spell-pedigree.tex spell-marker.tex spell-qtl.tex spell-qtl-examples.tex version.tex user_manual.tex images/Spell-pipeline2.svg .inputs .outputs pdflatex --shell-escape user_manual.tex && pdflatex --shell-escape user_manual.tex > /dev/null pdflatex --shell-escape user_manual.tex && pdflatex --shell-escape user_manual.tex version.tex: git tag | tail -1 > version.tex ... ... @@ -32,9 +32,14 @@ spell-qtl-examples.tex:$(SRCDIR)/doc/man/spell-qtl-examples.1.md sed -i s/\\label\{/\label\{spell-qtl-examples:/g spell-qtl-examples.tex sed -i s/--/-\{\}-/g spell-qtl-examples.tex input_files/example1.map input_files/example1_F2.phen input_files/example1_F2.gen input_files/example1.ped:input_files/%: $(EXAMPLE_DIR)/% input_files/example1.map input_files/example1_F2.gen input_files/example1.ped:input_files/%:$(EXAMPLE_DIR)/% mkdir -p input_files awk -v len=69 '{ if (length($$0) > len) print substr($$0, 1, len-3) "..."; else print; }' $< >$@ awk -v len=73 '{ if (length($$0) > len) print substr($$0, 1, len-3) "..."; else print; }' $< >$@ input_files/example1_F2.phen: $(EXAMPLE_DIR)/example1_F2.phen mkdir -p input_files awk '{printf $$1;for(i=2;i<=NF;i++){printf " %.3f",$$i}; printf "\n"}'$(EXAMPLE_DIR)/example1_F2.phen | awk -v len=73 '{ if (length($$0) > len) print substr($$0, 1, len-3) "..."; else print; }' > input_files/example1_F2.phen .inputs: input_files/example1.map input_files/example1_F2.phen input_files/example1_F2.gen input_files/example1.ped touch @ ... ... @@ -59,9 +64,8 @@ clean: rm -f version.tex rm -f spell-pedigree.tex spell-marker.tex spell-qtl.tex spell-qtl-examples.tex rm -f images/*.pdf images/*.pdf_tex rm -rf input_files veryclean: clean very_clean: clean rm -f *~ rm -rf input_files rm -rf my_directory  \chapter{Generated outputs} \section{General organization} General output directory organization is : \begin{forest} for tree={ font=\ttfamily, %minimum height=0.75cm, %rounded corners=2pt, grow'=0, %inner ysep=8pt, child anchor=west, parent anchor=south, anchor=west, calign=first, %edge={rounded corners}, edge path={ \noexpand\path [draw, \forestoption{edge}] (!u.south west) +(12.5pt,0) |- (.child anchor)\forestoption{edge label}; }, before typesetting nodes={ if n=1 {insert before={[,phantom,minimum height=10pt]}} {} }, fit=band, %s sep=12pt, before computing xy={l=20pt}, } [my\_directory \textsl{(directory)} [my\_name.1-point \textsl{(directory)} [my\_name.pedigree-and-probabilities.M\_1\_10.csv \textsl{(text file)} ] [...] ] [my\_name.cache \textsl{(directory)} [my\_name.spell-marker.data \textsl{(binary file)}] [my\_name.spell-pedigree.data \textsl{(binary file)}] ] [my\_name.n-point \textsl{(directory)} [ch1 \textsl{(directory)} [F2 \textsl{(directory)} [my\_name.ch1.F2.0.csv \textsl{(text file)}] [...] ] [F2C \textsl{(directory)} [my\_name.ch1.F2.0.csv \textsl{(text file)}] [...] ] ] [ch2 \textsl{(directory)} [...] ] ] [my\_name.report \textsl{(directory)} [...] ] [my\_name.spell-qtl.log \textsl{(text file)} ] ] \end{forest} \begin{itemize} \item A common working directory must be set for all 3 executables using command line option \texttt{-wd my\_directory} or \texttt{--work-directory my\_directory}. Every result produced during analysis will be directed to this directory. \item A configuration name using command line option \texttt{-n my\_name} or \texttt{--name my\_name} is used to name subdirectories. \item Parental Origin Probabilities are output as CSV files \begin{description} \item [1-point] one file per marker (with pedigree like structure), use option \texttt{-O1} in command \texttt{spell-marker} to output these files \item [n-point] one file per linkage group per generation per individual, use model option \texttt{output-nppop} in command \texttt{spell-qtl} to output these files. \end{description} \item a report containing \begin{itemize} \item A text-mode file rendering of the genetic map with the detected QTLs and their confidence interval \item The model matrix ans the variance-covariance matrix for each selection the detection algorithm used \item The detailed final model including variance-covariance matrix, coefficients, contrasts and contrasts significance \end{itemize} \item some hidden files with cached computation results. \end{itemize} \section{Output files samples} \subsection{1-point POP} \lstinputlisting[numbers=left, frame=single, caption={[1-point output file] 1-point output file (selected lines from file \texttt{my\_name.pedigree-and-probabilities.M\_1\_1.csv} in directory \texttt{my\_directory/my\_name.1-point/} ) },%(selected lines from my\_name.pedigree-and-probabilities.M_1_1.csv from my\_directory/my\_name.1-point/)}, linerange={1-12,107-112} ]{my_directory/my_name.1-point/my_name.pedigree-and-probabilities.M_1_1.csv} Note that you must use \texttt{-01} command line option in \texttt{spell-marker} in order to generate these files. \subsection{n-point POP}  \chapter{The main Spell-QTL pipeline} \section{General view} \begin{figure}[h] \centering \includesvg[width=\columnwidth]{images/Spell-pipeline2} \caption{The main Spell-QTL pipeline}\label{fig:pipeline} \end{figure} Global software organization is displayed in figure \ref{fig:pipeline}. More detailed informations about the main purpose of each part and then about the required input files will be provided in this chapter. \section{Software suite details} \subsection{\texttt{spell-pedigree}} \begin{itemize} \item Computes the transition matrices for the Continuous Time Hidden Markov Models (CTHMM). They are theT_d$matrices in formula \ref{eq:pop}. \item These computations are inherently dependent, so it can only run sequentially. \item Outputs a data file that can be fed to \texttt{spell-marker}. \end{itemize} \subsection{\texttt{spell-marker}} \begin{itemize} \item Computes the 1-point Parental Origins Probabilities by Bayesian inference for all markers. \item Each marker is independent, so it can run in various ways: \begin{itemize} \item Sequentially, \item Multithreaded, \item Scheduling jobs on {\em Sun Grid Engine}, \item Sending jobs to remote machines via \texttt{ ssh} \end{itemize} \item Outputs a data file that can be fed to \texttt{spell-qtl}. \item Can also output the raw Parental Origin Probabilities. \end{itemize} \subsection{\texttt{spell-qtl}} \begin{itemize} \item Performs the QTL analysis {\em per se}. \item Can also output the n-point Parental Origin Probabilities along the linkage groups. \item Can run most computations concurrently on a multicore computer. \item Computation results are cached on disk (and/or in RAM). \end{itemize} \section{Input files} \subsection{Pedigree} \subsubsection{File format} See \texttt{spell-pedigree} man page (at appendix \ref{ch:spell:predigree}) \subsubsection{File sample} \lstinputlisting[numbers=left, frame=single, breaklines=false, caption={[Pedigree (.ped input file)]Pedigree (selected lines from example1.ped from three\_parents\_F2 example)}, linerange={1-12,107-112} ] {input_files/example1.ped} \subsection{Marker observations} \subsubsection{File format} \texttt{spell-marker} understand a few common formats, based on MapMaker RAW format (without traits) : \begin{itemize} \item A line beginning with \texttt{ data type} followed by ignored text \item A line containing four integer values : number of markers, number of individuals, two ignored values \item A line per marker beginning with starred(*) marker name followed by a space and by allele observed or inferred for each individual (a character per individual). \end{itemize} Build in allele code are : \begin{description} \item SNP observations, where 0 and 2 are homozygous and 1 is heterozygous. These observations type are relevant for any individual in the pedigree, including parents. \texttt{spell-marker} will then perform inference of possible genotypes and inference of possible states in the CTHMM. \item[ABHCD] MapMaker like Parental Origin inferred observations. These are relevant for inbred lines crosses products. Let's consider the cross$A|A \times B|B$: \begin{itemize} \item The child is typed A and the allele A is not dominant. The only possible genotype is$A|A$. This is encoded by the character \texttt{ A} in MapMaker. \item The child is typed A and the allele A is dominant. The possible genotype are$A|A$,$A|B$and$B|A$. This is encoded by the character \texttt{ D} in MapMaker. \item The child is typed B and the allele B is not dominant. The only possible genotype is$B|B$. This is encoded by the character \texttt{ B} in MapMaker. \item The child is typed B and the allele B is dominant. The possible genotype are$A|B$,$B|A$and$B|B$. This is encoded by the character \texttt{ C} in MapMaker. \item The child is typed AB (the allele A and B are codominant). The possible genotype are$A|B$and$B|A$. This is encoded by the character \texttt{ H} in MapMaker. \item The child in not typed. The possible genotypes are$A|A$,$A|B$,$B|A$and$B|B$. This is encoded by the character \texttt{ -} in MapMaker. \end{itemize} The parental origin letters can be overridden in the command line. \item[CP] Outbred observations as defined in Cathagene. These observations are relevant for all known phases situations, including cases where one parent is homozygous, when 3 or 4 different alleles are present. Lets consider the cross$A|B \times C|D$: The possibles child genotypes are$A|C$,$A|D$,$B|C$and$B|D$. Carthagene format actually enables the user to express any subset of the 4 different possibilities using a single hexadecimal digit (0-f). \begin{center} \begin{tabular}{cc} Code & Possible genotypes \\ \hline 1 &$A|C$\\ 2 &$A|D$\\ 3 &$A|C$,$A|D$\\ 4 &$B|C$\\ 5 &$A|C$,$B|C$\\ 6 &$A|D$,$B|C$\\ 7 &$A|C$,$A|D$,$B|C$\\ 8 &$B|D$\\ 9 &$A|C$,$B|D$\\ a &$A|D$,$B|D$\\ b &$A|C$,$A|D$,$B|D$\\ c &$B|C$,$B|D$\\ d &$A|C$,$B|C$,$B|D$\\ e &$A|D$,$B|C$,$B|D$\\ 0 or f or - &$A|C$,$A|D$,$B|C$,$B|D$\\ \end{tabular} \end{center} \end{description} Note that \textbf{CP} and \textbf{ABHCD} formats imply user-made genotype inference. Depending on generation, \texttt{spell-marker} will perform further genotype inference and HMM state inference using pedigree. Other allele code can be defined via a JSON file. (see in appendix \ref{spell-marker:marker-observation-format-specification} for format and \ref{spell-marker:example-the-02-abhcd-and-cp-formats} for sample files) \subsubsection{File sample} \lstinputlisting[numbers=left, frame=single, breaklines=false, caption={[Marker alleles (.gen input file)]Marker alleles (example1\_F2.gen from three\_parents\_F2 example)}, %linerange={1-8,35-39} ] {input_files/example1_F2.gen} Note that line 1 \texttt{F2} after \texttt{ data type} is irrelevant \subsection{Genetic map} \subsubsection{File format} One line per linkage group (space separated) : \begin{itemize} \item Starred(*) name for this linkage group \item Number of markers in the linkage group \item Name of first marker \item Series of distance in cM and name of next marker \end{itemize} \subsubsection{File sample} \lstinputlisting[numbers=left,frame=single,breaklines=false,caption={[Genetic Map (.map input file)] Genetic map (example1.map from three\_parents\_F2 example)}]{input_files/example1.map} \subsection{Trait observations} \subsubsection{File format} As in MapMaker RAW format, without header : one line per trait beginning with starred (*) trait name followed by space separated observations (one numerical observation per individual, \texttt{ -} means unobserved). \subsubsection{File sample} \lstinputlisting[numbers=left,frame=single,breaklines=false,caption={[Trait observations (.phen input file)]Trait observations (example1\_F2.phen from three\_parents\_F2 example)}]{input_files/example1_F2.phen}  \chapter{Underlying parental origin probability model}\label{ch:probas} The probabilities in equation \ref{eq:complete_model} are computed on the fly by \texttt{spell-qtl} when necessary. A hidden Markov chain representing the origin of the genetic material is derived for any population in the pedigree by \texttt{spell-pedigree}. Depending on the number of different parents in the design and the number of non-independent meiosis the minimal number of states that allow to enjoy the Markov property can be much more than the number of possible genotypes. Spell-pedigree ensures that this number is the smallest possible using lumping operations as soon as it is possible. At any observed marker, the probabilities of these states are inferred by \texttt{spell-marker}. Note that as soon as non clone population are involved, some information is gathered from siblings in order to infer states probabilities. (See figure \ref{fig:pipeline} for more information about information exchange between parts of the software). Suppose the locus of interest$q$being preceded by$n_l$observed loci and followed by$n_r$observed loci, the loci sequence being$[l_{n_l},...,l_2,l_1,q,r_1,r_2,...,r_{n_r}]$. The parental origin probability vector at locus$q$is then : \begin{multline} \label{eq:pop} p(q|\mathcal{M}_{l_1},...,\mathcal{M}_{l_{n_l}},\mathcal{M}_{r_1},...,\mathcal{M}_{r_{n_r}}) =\\ \frac{ \left(T_{q-l_1}\displaystyle\left(\prod_{i=1}^{n_l-1} M_{l_i}T_{l_{i}-l_{i+1}} \right) M_{l_{n_l}} \varphi \right) \odot \left(T_{r_1-q}\displaystyle\left(\prod_{i=1}^{n_r-1}M_{r_i}T_{r_{i+1}-r_i} \right) M_{r_{n_r}} \varphi \right) \odot \varphi^{-1} } {\left| \left(T_{q-l_1}\displaystyle\left(\prod_{i=1}^{n_l-1} M_{l_i}T_{l_{i}-l_{i+1}} \right) M_{l_{n_l}} \varphi \right) \odot \left(T_{r_1-q}\displaystyle\left(\prod_{i=1}^{n_r-1}M_{r_i}T_{r_{i+1}-r_i} \right) M_{r_{n_r}} \varphi \right) \odot \varphi^{-1} \right| } \end{multline} Where : \begin{itemize} \item$T_d$is the transition matrix of the Markov chain (as a function of genetic distance$d$). It is computed by \texttt{spell-pedigree}; \item$\varphi$is the steady state vector associated with transition matrix$T_d$,$\varphi^{-1}$its component wise inverse ; \item$M_l$is the observation matrix at locus$l$. It is inferred from observations at locus$l$by spell-marker; \end{itemize} Equation \ref{eq:pop} can be applied recursively in order to compute multi-loci joint probabilities. The code in \texttt{spell-qtl} allows these joint probabilities computations when needed.  ... ... @@ -10,7 +10,7 @@ \usepackage{svg} \usepackage{import} %required for svg inkscape import \usepackage{forest} \usepackage{inconsolata} % alternative typewriter font \graphicspath{{images/}} ... ... @@ -64,463 +64,16 @@ is a software suite allowing to detect QTLs in any crossing design. It is developed at MIAT lab by Damien Leroux and Sylvain Jasson. It is available (including the present user manual) at \url{https://mulcyber.toulouse.inra.fr/frs/?group_id=204} under Gnu Public License. \chapter{Detection model} During detection we will consider many linear models and perform many analysis of variance (aka ANOVA). \section{Complete model} The more general form of the models is : \begin{equation} \label{eq:complete_model} \forall i \in \mathbf{N} \ \ \ y_{i}=\mu_{c_i} + \sum_{h \in \mathbf{H}} \sum_{g \in \mathbf{G}(h)} P(i,g,h)\xi_{g,h,c_i} +\sum_{v \in \mathbf{V}} v_i \kappa_{v} +\varepsilon_{i} \end{equation} Where : \begin{itemize} \item$y_i$is the value of the trait for individual$i$; \item$\mathbf{N}$is the set of individuals under study, its cardinality$\left\vert{\mathbf{N}}\right\vert$is the total number of individual in the design under study; \item$c_i$is the cross to which individual$i$belongs (let be$\mathbf{C}$the set of crosses in the design,$\forall i\in \mathbf{N}, \ \ c_i \in \mathbf{C}$, its cardinality$|\mathbf{C}|$is the total number of crosses in the design); \item$\mu_{c_i}$is the estimated mean value of the trait for this cross; \item$\mathbf{H}$is the set of genetic explicative factors (mainly$\mathbf{H}$is a set QTLs, or a set of sets of QTLs when looking for epistasy); \item$\mathbf{G}(h)$is the set of possible genotypes for factor$h$(if$h$is a unit set$h=\set{l}$that is we address a single QTL, then$\mathbf{G}(h)=\mathbf{G}(\set{l})$is the set of possible genotypes at locus$l$); \item$P(i,g,h)$is the probability that individual$i$has genotype$g$on$h$; \item$\xi_{g,h,c_i}$is the phenotypic value of genotype$g$on$h$in genetic background$c_i$(see section \ref{sec:phen_value} for more information about the hypothesis we can make about$\xi$) ; \item$v$is a user-defined explicative variable; \item$\mathbf{V}$is the set of user-defined explicative variables (this set can be empty); \item$v_i$is the value of variable$v$for individual$i$; \item$\kappa_{v}$is the effect of variable$v$; \item$\varepsilon_i $is the modeling error for individual$i$. \end{itemize} Lower case letter values$y_i$,$c_i$and$v_i$are user provided, upper case letter value$P(i,g,h)$is computed from markers score and genetic map (see chapter \ref{ch:probas}), bold upper case letters$\mathbf{N}$,$\mathbf{C}$,$\mathbf{H}$,$\mathbf{G}$and$\mathbf{V}$are sets, Greek letters values$\mu_{c}$,$\xi_{g,h,c}$,$\kappa_{v}$and$\varepsilon_i$are computed by the linear model engine. The sets$\mathbf{N}$,$\mathbf{C}$and$\mathbf{V}$are user defined, our main goal is to find the best$\mathbf{H}$($\mathbf{G}$is$\mathbf{H}$-dependent). \section{Fitting and testing models} The lineal model engine minimizes the residual sum of squares$RSS=\sum_{i \in \mathbf{N}} \varepsilon_i^2$under constraints \ref{eq:complete_model} for any tested set$\mathbf{H}$. Note that adding more elements to the tested set$\mathbf{H}$will automatically reduce$RSS$since it allows minimization over a wider space. Therefore the reduction of$RSS$cannot be our rule for model selection. We must switch the question from "Does the new model display a smaller$RSS$?" to "Is the$RSS$reduction of the new model worth the degrees of freedom it costs ?". In order to compare two nested models we will then use F-test, that is a measurement of the this compromise between goodness of fit and degrees of freedom cost. (To be fully honest, when we compare to models for pleiotropic effects the residual errors$\varepsilon_i$won't be a vector but a matrix, its squared sum over individuals won't be number but a vector, and therefore we must use some kind of log-likelihood ratio, and a$\chi^2$test with appropriate degree of freedom will do the job.) Lets now suppose that we have current model, and we wander if some new extended model is better. Let$df_{current}$(resp.$df_{new}$) be the number of degree of freedom for the current model (resp. the new model). This number of degree of freedom is the number of independent parameters in the fitted model, that is the number of independent values the linear model engine has to choose for$\mu_{c}$,$\xi_{g,h,c}$and$\kappa_{v}$. The test value in then: \begin{equation} \label{eq:F-test} F(\mathbf{H}_{new},\mathbf{H}_{current})=\frac{ \frac{ RSS_{current} - RSS_{new} }{ df_{new}-df_{current} } }{ \frac{RSS_{new}}{\left\vert{\mathbf{N}}\right\vert-df_{new}} } \end{equation} The higher the value$F(\mathbf{H}_{new},\mathbf{H}_{current})$is, the better is the new model. The typical use of such a score is to address the question of adding a new locus in the model. Every locus along the genome may then be tested. ( If a tested locus is already in$\mathbf{H}$,$RSS_{current} = RSS_{new}$its score is$0$). Moreover the amount of degrees of freedom involved in \ref{eq:F-test} can change along the genome, with lack of independence of loci or with similarity of parents. It is then wiser to use the probability that a Fischer distribution ($\mathcal{F}$) with these degrees of freedom is greater than the actual$F$value. Our score is then \begin{equation} score(\mathbf{H}_{new},\mathbf{H}_{current})=-log_{10}P[\mathcal{F}(df_{new}-df_{current},|\mathbf{N}|-df_{new})>F(\mathbf{H}_{new},\mathbf{H}_{current})] \end{equation} (We prefer positive and not so great values, so the$-log_{10}$provides a monotonic and aesthetic transformation that makes this score prettier). Unfortunately we have no analytic expression of the distribution of its maximum along a genome under the null hypothesis, and can not establish a critical value for its significativity at a chosen risk level. So we'll use a resampling method by permutation to access an empiric value of its quantile. This permutations are performed on the fly by the software and their result is saved for potential reuse. Don't be surprised if the very same analysis takes longer time at first attempt. \section{Phenotypic values}\label{sec:phen_value} As you can see in formula \ref{eq:F-test} the difference of degrees of freedom between the two models has an important effect on the test value. Therefore it is a good idea not to waste them. Some hypothesis on the effects allow phenotypic values simplification and degrees of freedom reduction. We'll review some possible hypothesis. Lets suppose that in our study population a set$\mathbf{P}$of alleles are possible. Typically$\mathbf{P}=\set{P1,P2}$,$|\mathbf{P}|=2$for a simple cross between two parents,$\mathbf{P}=\set{A,B,C,D,E,F,G,H}$,$|\mathbf{P}|=8$in MAGIC. \subsection{Single locus effects} Remember that$\mathbf{H}$is a set of explicative factors$h$. Each explicative factor is a set of one or more loci. In the simplest models, each QTL has its own effect without any interaction. That is: every explicative factor contains exactly one locus. Every$h=\set{l}$is a unit set, that is$|h|=1$. Possible genotypes are then allele pairs$g=(g_1,g_2) \in \mathbf{P}^2$. \subsubsection{Additive and connected model} The simplest possible model is the additive and connected one. Each allele has its own effect without any dominance, nor interaction with genetic background. \begin{equation} \xi_{g,h,c_i} = \xi_{(g_1,g_2),\set{l},c_i} = \alpha_{g_1,l}+\alpha_{g_2,l} \end{equation} Most of the time, the degree of freedom used by adding a new QTL to such a model is the number of distinct possible allele minus one$df_{new}-df_{current}=p-1\$. \subsubsection{Dominance and connected model} This model is much like the previous one. The only difference is that interaction between allele are possible. \begin{equation}