\item$\mathbf{H}$ is the set of genetic explicative factors (mainly $\mathbf{H}$ is a set QTLs, or a set of sets of QTLs when looking for epistasy);

\item$\mathbf{G}(h)$ is the set of possible genotypes for factor $h$ (if $h$ is a unit set $h=\set{l}$ that is we address a single QTL, then $\mathbf{G}(h)=\mathbf{G}(\set{l})$ is the set of possible genotypes at locus $l$);

\item$P(i,g,h)$ is the probability that individual $i$ has genotype $g$ on $h$;

\item$\xi_{g,h,c_i}$ is the phenotypic value of genotype $g$ on $h$ in genetic background $c_i$ (see section \ref{sec:phen_value} for more information about the hypothesis we can make about $\xi$) ;

\item$\xi_{g,h,c_i}$ is the phenotypic value of genotype $g$ on $h$ in genetic background $c_i$ (see section \vref{sec:phen_value} for more information about the hypothesis we can make about $\xi$) ;

\item$v$ is a user-defined explicative variable;

\item$\mathbf{V}$ is the set of user-defined explicative variables (this set can be empty);

\item$v_i$ is the value of variable $v$ for individual $i$;

...

...

@@ -27,12 +27,12 @@ Where :

\item$\varepsilon_i $ is the modeling error for individual $i$.

\end{itemize}

Lower case letter values $y_i$, $c_i$ and $v_i$ are user provided, upper case letter value $P(i,g,h)$ is computed from markers score and genetic map (see chapter \ref{ch:probas}), bold upper case letters $\mathbf{N}$, $\mathbf{C}$, $\mathbf{H}$, $\mathbf{G}$ and $\mathbf{V}$ are sets, Greek letters values $\mu_{c}$, $\xi_{g,h,c}$, $\kappa_{v}$ and $\varepsilon_i$ are computed by the linear model engine.

Lower case letter values $y_i$, $c_i$ and $v_i$ are user provided, upper case letter value $P(i,g,h)$ is computed from markers score and genetic map (see chapter \vref{ch:probas}), bold upper case letters $\mathbf{N}$, $\mathbf{C}$, $\mathbf{H}$, $\mathbf{G}$ and $\mathbf{V}$ are sets, Greek letters values $\mu_{c}$, $\xi_{g,h,c}$, $\kappa_{v}$ and $\varepsilon_i$ are computed by the linear model engine.

The sets $\mathbf{N}$, $\mathbf{C}$ and $\mathbf{V}$ are user defined, our main goal is to find the best $\mathbf{H}$ ($\mathbf{G}$ is $\mathbf{H}$-dependent).

\section{Fitting and testing models}

The lineal model engine minimizes the residual sum of squares $RSS=\sum_{i \in\mathbf{N}}\varepsilon_i^2$ under constraints \ref{eq:complete_model} for any tested set $\mathbf{H}$. Note that adding more elements to the tested set $\mathbf{H}$ will automatically reduce $RSS$ since it allows minimization over a wider space.

The lineal model engine minimizes the residual sum of squares $RSS=\sum_{i \in\mathbf{N}}\varepsilon_i^2$ under constraints \vref{eq:complete_model} for any tested set $\mathbf{H}$. Note that adding more elements to the tested set $\mathbf{H}$ will automatically reduce $RSS$ since it allows minimization over a wider space.

Therefore the reduction of $RSS$ cannot be our rule for model selection. We must switch the question from "Does the new model display a smaller $RSS$?" to "Is the $RSS$ reduction of the new model worth the degrees of freedom it costs ?".

The higher the value $F(\mathbf{H}_{new},\mathbf{H}_{current})$ is, the better is the new model.

The typical use of such a score is to address the question of adding a new locus in the model. Every locus along the genome may then be tested. ( If a tested locus is already in $\mathbf{H}$, $RSS_{current}= RSS_{new}$ its score is $0$).

Moreover the amount of degrees of freedom involved in \ref{eq:F-test} can change along the genome, with lack of independence of loci or with similarity of parents. It is then wiser to use the probability that a Fischer distribution ($\mathcal{F}$) with these degrees of freedom is greater than the actual $F$ value. Our score is then

Moreover the amount of degrees of freedom involved in \vref{eq:F-test} can change along the genome, with lack of independence of loci or with similarity of parents. It is then wiser to use the probability that a Fischer distribution ($\mathcal{F}$) with these degrees of freedom is greater than the actual $F$ value. Our score is then

\end{equation} (We prefer positive and not so great values, so the $-log_{10}$ provides a monotonic and aesthetic transformation that makes this score prettier).

\end{equation} (We pvrefer positive and not so great values, so the $-log_{10}$ provides a monotonic and aesthetic transformation that makes this score prettier).

Unfortunately we have no analytic expression of the distribution of its maximum along a genome under the null hypothesis, and can not establish a critical value for its significativity at a chosen risk level.

So we'll use a resampling method by permutation to access an empiric value of its quantile. This permutations are performed on the fly by the software and their result is saved for potential reuse. Don't be surprised if the very same analysis takes longer time at first attempt.

\section{Phenotypic values}\label{sec:phen_value}

As you can see in formula \ref{eq:F-test} the difference of degrees of freedom between the two models has an important effect on the test value. Therefore it is a good idea not to waste them.

As you can see in formula \vref{eq:F-test} the difference of degrees of freedom between the two models has an important effect on the test value. Therefore it is a good idea not to waste them.

Some hypothesis on the effects allow phenotypic values simplification and degrees of freedom reduction. We'll review some possible hypothesis. Lets suppose that in our study population a set $\mathbf{P}$ of alleles are possible. Typically $\mathbf{P}=\set{P1,P2}$, $|\mathbf{P}|=2$ for a simple cross between two parents, $\mathbf{P}=\set{A,B,C,D,E,F,G,H}$, $|\mathbf{P}|=8$ in MAGIC.

...

...

@@ -124,7 +124,7 @@ The number of degree of freedom for each added QTL in the worst case is : $df_{n

Addressing epistasis limited to two loci interactions means that the explicative factors in $\mathbf{H}$ are either unit sets or two elements sets. That is $\forall h \in\mathbf{H}, |h|\in\set{1,2}$. The phenotypic values for every single locus explicative factor $h$ are the same as in previous section.

We'll focus in this section on phenotypic values when addressing exactly 2 QTLs: $h=\set{l^1,l^2}$. Possible genotypes are then 4-uple of allele : genotype at locus $l^1$ is $(g^1_1,g^1_2)$, genotype at locus $l^2$ is $(g^2_1,g^2_2)$, overall genotype is $g=(g^1_1,g^1_2,g^2_1,g^2_2)$.

Note that the correct use of formula \ref{eq:complete_model} requires that $P(i,g,h)$ is the joint probablity of $g$ on $h$. If loci $l^1$ and $l^2$ are far enough from each other (or even better are on different linkage groups) this probability can be computed as a product of single locus probabilities. But in general case a true joint probability must be computed (especially in order to avoid ghost QTLs). Spell-qtl can compute such a probability (see chapter \ref{ch:probas})

Note that the correct use of formula \vref{eq:complete_model} requires that $P(i,g,h)$ is the joint probablity of $g$ on $h$. If loci $l^1$ and $l^2$ are far enough from each other (or even better are on different linkage groups) this probability can be computed as a product of single locus probabilities. But in general case a true joint probability must be computed (especially in order to avoid ghost QTLs). Spell-qtl can compute such a probability (see chapter \vref{ch:probas})

The joint two locus phenotypic effect can be expressed as the sum of the direct terms (the phenotypic effect of each locus) and an interaction term.

...

...

@@ -148,7 +148,7 @@ In this first epistatsis model, the single QTL direct term follows the additive

If we address only direct additive effects of a unique QTL in a unique biparental population, the phenotypic values in

the equation \ref{eq:complete_model} can be simplified :

the equation \vref{eq:complete_model} can be simplified :

Note that you must use \texttt{-01} command line option in \texttt{spell-marker} in order to generate these files.

Note that you must use \texttt{-01} command line option in \texttt{spell-marker} in order to generate these files (see \vref{spell-marker:output-modes}.)

Note that you must use \texttt{output-nppop} processing option in \texttt{spell-qtl} in order to generate these files (see \vref{spell-qtl:processing-options}.)

\subsection{full map}

The special text file named \texttt{full\_map.txt} is produced at the root of \texttt{my name.report} directory. The full genetic map is drawn using text characters. Any detected QTL is inserted in this map with its confidence interval (figure \label{fig:full-map}).

\caption{screen capture of \texttt{full\_map.txt} in a terminal. The chromosome and markers are white. The found QTL and its confidence interval are displayed in green.}

\label{fig:full-map}

\end{figure}

Note that you must uses \texttt{cat} ou \texttt{less -RS} command in ordrer to see it properly, \texttt{less} ou your favorite text editor may fail to read special characters.

\subsection{Trait by trait reports}

For every trait on analysis, a report directory is generated (this directory name is the trait name). Whithin this directoty the report file itself is named after the trait name followed by \texttt{\_report.txt}

This file is divided in several parts. For the sample file \vref{sample:report} they are :

\begin{description}

\item[General information (lines 1-7)] Trait name and what was detected (QTLs positions and confidence intervals)

\item[R2 (lines 12-20)] Part of the variance explained by each QTL

\item[Coefficients (lines 22-54)] Cross and QTLs allele effects are displayed with their estimated vairiance-covariance matrix

\item[Contrasts (lines 57-79 and 81-103)] Any tractable contrast is displayed and its significance tested. \texttt{spell-qtl} displays a contrasts section for every comparable effects group.

\item[Final Model (lines 105-351)] The final linear model for this detection is then displayed in human readable form. (It is available in computer form in files \texttt{trait\_values.txt} and \texttt{Model\_*X.txt})

\caption{The main Spell-QTL pipeline}\label{fig:pipeline}

\end{figure}

Global software organization is displayed in figure \ref{fig:pipeline}. More detailed informations about the main purpose of each part and then about the required input files will be provided in this chapter.

Global software organization is displayed in figure \vref{fig:pipeline}. More detailed informations about the main purpose of each part and then about the required input files will be provided in this chapter.

\section{Software suite details}

\subsection{\texttt{spell-pedigree}}

\begin{itemize}

\item Computes the transition matrices for the Continuous Time Hidden Markov Models (CTHMM). They are the $T_d$ matrices in formula \ref{eq:pop}.

\item Computes the transition matrices for the Continuous Time Hidden Markov Models (CTHMM). They are the $T_d$ matrices in formula \vref{eq:pop}. The number of hidden states is of course the order od the matrix.

\item These computations are inherently dependent, so it can only run sequentially.

\item Outputs a data file that can be fed to \texttt{spell-marker}.

\end{itemize}

...

...

@@ -42,25 +42,31 @@ Global software organization is displayed in figure \ref{fig:pipeline}. More det

\section{Input files}

\subsection{Pedigree}

\subsubsection{File format}

See \texttt{spell-pedigree} man page (at appendix \ref{ch:spell:predigree})

See \texttt{spell-pedigree} man page (at \vref{spell-pedigree:description}.)

\subsubsection{File sample}

\lstinputlisting[numbers=left,

frame=single,

breaklines=false,

caption={[Pedigree (.ped input file)]Pedigree (selected lines from example1.ped from three\_parents\_F2 example)},

%label=file:pedigree,

linerange={1-12,107-112}

]

{input_files/example1.ped}

Note that \begin{itemize}

\item the first line is expected to be header only and will be ignored by \texttt{spell-pedigree}.

\item Only four columns are used, any additional column will be silently ignored by \texttt{spell-pedigree}

\end{itemize}

\subsection{Marker observations}

\subsubsection{File format}

\texttt{spell-marker} understand a few common formats, based on MapMaker RAW format (without traits) :

\begin{itemize}

\item A line beginning with \texttt{data type} followed by ignored text

\item A line containing four integer values : number of markers, number of individuals, two ignored values

\item A line per marker beginning with starred(*) marker name followed by a space and by allele observed or inferred for each individual (a character per individual).

\item A line beginning with \texttt{data type} followed by ignored text (\textit{e.g.} line 1 in sample \vref{file:gen})

\item A line containing four integer values : number of individuals, number of markers, two ignored values (\textit{e.g.} line 2 in sample \vref{file:gen})

\item A line per marker beginning with starred(*) marker name followed by a space and by allele observed or inferred for each individual (a character per individual). (\textit{e.g.} line 3-39 in sample \vref{file:gen})

\end{itemize}

Build in allele code are :

...

...

@@ -103,18 +109,23 @@ e & $A|D$,$B|C$,$B|D$ \\

Note that \textbf{CP} and \textbf{ABHCD} formats imply user-made genotype inference. Depending on generation, \texttt{spell-marker} will perform further genotype inference and HMM state inference using pedigree.

Other allele code can be defined via a JSON file. (see in appendix \ref{spell-marker:marker-observation-format-specification} for format and \ref{spell-marker:example-the-02-abhcd-and-cp-formats} for sample files)

Other allele code can be defined via a JSON file. (see in appendix \vref{spell-marker:marker-observation-format-specification} for format and \vref{spell-marker:example-the-02-abhcd-and-cp-formats} for sample files)

\subsubsection{File sample}

\lstinputlisting[numbers=left,

frame=single,

breaklines=false,

caption={[Marker alleles (.gen input file)]Marker alleles (example1\_F2.gen from three\_parents\_F2 example)},

label=file:gen

%linerange={1-8,35-39}

]

{input_files/example1_F2.gen}

Note that line 1 \texttt{F2} after \texttt{ data type} is irrelevant

Note that \begin{itemize}

\item in line 1 \texttt{F2} after \texttt{ data type} is irrelevant for \texttt{spell-marker}.

\item in line 2 \texttt{0 0} after \texttt{100 37} is irrelevant \texttt{spell-marker}.

\end{itemize}

\subsection{Genetic map}

\subsubsection{File format}

...

...

@@ -127,7 +138,12 @@ One line per linkage group (space separated) :

\chapter{Underlying parental origin probability model}\label{ch:probas}

The probabilities in equation \ref{eq:complete_model} are computed on the fly by \texttt{spell-qtl} when necessary. A hidden Markov chain representing the origin of the genetic material is derived for any population in the pedigree by \texttt{spell-pedigree}. Depending on the number of different parents in the design and the number of non-independent meiosis the minimal number of states that allow to enjoy the Markov property can be much more than the number of possible genotypes. Spell-pedigree ensures that this number is the smallest possible using lumping operations as soon as it is possible.

The probabilities in equation \vref{eq:complete_model} are computed on the fly by \texttt{spell-qtl} when necessary. A hidden Markov chain representing the origin of the genetic material is derived for any population in the pedigree by \texttt{spell-pedigree}. Depending on the number of different parents in the design and the number of non-independent meiosis the minimal number of states that allow to enjoy the Markov property can be much more than the number of possible genotypes. Spell-pedigree ensures that this number is the smallest possible using lumping operations as soon as it is possible.

At any observed marker, the probabilities of these states are inferred by \texttt{spell-marker}. Note that as soon as non clone population are involved, some information is gathered from siblings in order to infer states probabilities. (See figure \ref{fig:pipeline} for more information about information exchange between parts of the software).

At any observed marker, the probabilities of these states are inferred by \texttt{spell-marker}. Note that as soon as non clone population are involved, some information is gathered from siblings in order to infer states probabilities. (See figure \vref{fig:pipeline} for more information about information exchange between parts of the software).

Suppose the locus of interest $q$ being preceded by $n_l$ observed loci and followed by $n_r$ observed loci, the loci sequence being

$[l_{n_l},...,l_2,l_1,q,r_1,r_2,...,r_{n_r}]$. The parental origin probability vector at locus $q$ is then :

...

...

@@ -20,6 +20,6 @@ Where :

\item$M_l$ is the observation matrix at locus $l$. It is inferred from observations at locus $l$ by spell-marker;

\end{itemize}

Equation \ref{eq:pop} can be applied recursively in order to compute multi-loci joint probabilities. The code in \texttt{spell-qtl} allows these joint probabilities computations when needed.

Equation \vref{eq:pop} can be applied recursively in order to compute multi-loci joint probabilities. The code in \texttt{spell-qtl} allows these joint probabilities computations when needed.

\usepackage{import}%required for svg inkscape import

\usepackage{forest}

\usepackage{inconsolata}% alternative typewriter font

\usepackage{hyperref}

\graphicspath{{images/}}

...

...

@@ -19,7 +23,7 @@

\lstset{

breakatwhitespace=false,

breaklines=true,

breaklines=false,

basicstyle=\ttfamily

}

...

...

@@ -65,6 +69,8 @@ is a software suite allowing to detect QTLs in any crossing design.

It is developed at MIAT lab by Damien Leroux and Sylvain Jasson. It is available (including the present user manual) at \url{https://mulcyber.toulouse.inra.fr/frs/?group_id=204} under Gnu Public License.