pipeline.tex 7.72 KB
Newer Older
1
2
3
4
5
6
7
8
9
10
\chapter{The main Spell-QTL pipeline}

\section{General view}

\begin{figure}[h] 
  \centering
  \includesvg[width=\columnwidth]{images/Spell-pipeline2}
  \caption{The main Spell-QTL pipeline}\label{fig:pipeline}
\end{figure}

Sylvain Jasson's avatar
Sylvain Jasson committed
11
Global software organization is displayed in figure \vref{fig:pipeline}. More detailed informations about the main purpose of each part and then about the required input files will be provided in this chapter. 
12
13
14
15
16


\section{Software suite details}
\subsection{\texttt{spell-pedigree}}
\begin{itemize}
Sylvain Jasson's avatar
Sylvain Jasson committed
17
	\item Computes the transition matrices for the Continuous Time Hidden Markov Models (CTHMM). They are the $T_d$ matrices in formula \vref{eq:pop}. The number of hidden states is of course the order od the matrix. 
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
        \item These computations are inherently dependent, so it can only run sequentially.
        \item Outputs a data file that can be fed to \texttt{spell-marker}.
\end{itemize}
\subsection{\texttt{spell-marker}}
\begin{itemize}
	\item Computes the 1-point Parental Origins Probabilities by Bayesian inference for all markers.
        \item Each marker is independent, so it can run in various ways:
                \begin{itemize}
                  \item Sequentially,
                  \item Multithreaded,
                  \item Scheduling jobs on {\em Sun Grid Engine},
                  \item Sending jobs to remote machines via \texttt{ ssh}
                \end{itemize}
        \item Outputs a data file that can be fed to \texttt{spell-qtl}.
        \item Can also output the raw Parental Origin Probabilities.
\end{itemize}
\subsection{\texttt{spell-qtl}}
\begin{itemize}
	\item Performs the QTL analysis {\em per se}.
        \item Can also output the n-point Parental Origin Probabilities along the linkage groups.
        \item Can run most computations concurrently on a multicore computer.
        \item Computation results are cached on disk (and/or in RAM).
\end{itemize}

\section{Input files}
\subsection{Pedigree}
\subsubsection{File format}
Sylvain Jasson's avatar
Sylvain Jasson committed
45
See \texttt{spell-pedigree} man page (at \vref{spell-pedigree:description}.)
46
47
48
49
50
\subsubsection{File sample}
\lstinputlisting[numbers=left,
		frame=single,
		breaklines=false,
		caption={[Pedigree (.ped input file)]Pedigree (selected lines from example1.ped from three\_parents\_F2 example)},
Sylvain Jasson's avatar
Sylvain Jasson committed
51
		%label=file:pedigree,
52
53
54
55
		linerange={1-12,107-112}
		]
		{input_files/example1.ped}

Sylvain Jasson's avatar
Sylvain Jasson committed
56
57
58
59
60
Note that \begin{itemize}
\item the first line is expected to be header only and will be ignored by \texttt{spell-pedigree}.
\item Only four columns are used, any additional column will be silently ignored by \texttt{spell-pedigree}
\end{itemize}

61
62
63
64
65
66

\subsection{Marker observations}

\subsubsection{File format}
\texttt{spell-marker} understand a few common formats, based on MapMaker RAW format (without traits) :
\begin{itemize}
Sylvain Jasson's avatar
Sylvain Jasson committed
67
68
69
\item A line beginning with \texttt{data type} followed by ignored text (\textit{e.g.} line 1 in sample \vref{file:gen})
\item A line containing four integer values :  number of individuals, number of markers, two ignored values (\textit{e.g.} line 2 in sample \vref{file:gen})
\item A line per marker beginning with starred(*) marker name followed by a space and by allele observed or inferred for each individual (a character per individual). (\textit{e.g.} line 3-39 in sample \vref{file:gen})
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
\end{itemize}

Build in allele code are : 
\begin{description}
\item[02] SNP observations, where 0 and 2 are homozygous and 1 is heterozygous. These observations type are relevant for any individual in the pedigree, including parents. \texttt{spell-marker} will then perform inference of possible genotypes and inference of possible states in the CTHMM.   
\item[ABHCD] MapMaker like Parental Origin inferred observations. These are relevant for inbred lines crosses products.  Let's consider the cross $A|A \times B|B$:
\begin{itemize}
\item The child is typed A and the allele A is not dominant. The only possible genotype is $A|A$. This is encoded by the character \texttt{ A} in MapMaker.
\item The child is typed A and the allele A is dominant. The possible genotype are $A|A$, $A|B$ and $B|A$. This is encoded by the character \texttt{ D} in MapMaker.
\item The child is typed B and the allele B is not dominant. The only possible genotype is $B|B$. This is encoded by the character \texttt{ B} in MapMaker.
\item The child is typed B and the allele B is dominant. The possible genotype are $A|B$, $B|A$ and $B|B$. This is encoded by the character \texttt{ C} in MapMaker.
\item The child is typed AB (the allele A and B are codominant). The possible genotype are $A|B$ and  $B|A$. This is encoded by the character \texttt{ H} in MapMaker.
\item The child in not typed. The possible genotypes are $A|A$, $A|B$, $B|A$ and $B|B$. This is encoded by the character \texttt{ -} in MapMaker.
\end{itemize}
The parental origin letters can be overridden in the command line.
\item[CP] Outbred observations  as defined in Cathagene. These observations are relevant for all known phases situations, including cases where one parent is homozygous, when 3 or 4 different alleles are present.  Lets consider the cross $A|B \times C|D$: The possibles child genotypes are $A|C$, $A|D$, $B|C$ and $B|D$. Carthagene format actually enables the user to express any subset of the 4 different possibilities using a single hexadecimal digit (0-f). 

\begin{center}
\begin{tabular}{cc}
Code & Possible genotypes \\
\hline
1    & $A|C$ \\
2    & $A|D$     \\
3    & $A|C$,$A|D$      \\
4    & $B|C$      \\
5    & $A|C$,$B|C$      \\
6    & $A|D$,$B|C$      \\
7    & $A|C$,$A|D$,$B|C$      \\
8    & $B|D$      \\
9    & $A|C$,$B|D$      \\
a    & $A|D$,$B|D$      \\
b    & $A|C$,$A|D$,$B|D$      \\
c    & $B|C$,$B|D$     \\
d    & $A|C$,$B|C$,$B|D$       \\
e    & $A|D$,$B|C$,$B|D$  \\
0 or f or -    & $A|C$,$A|D$,$B|C$,$B|D$      \\
\end{tabular}
\end{center}
\end{description}

Note that \textbf{CP} and \textbf{ABHCD} formats imply user-made genotype inference. Depending on generation, \texttt{spell-marker} will perform further genotype inference and HMM state inference using pedigree. 

Sylvain Jasson's avatar
Sylvain Jasson committed
112
Other allele code can be defined via a JSON file. (see in appendix \vref{spell-marker:marker-observation-format-specification} for format and \vref{spell-marker:example-the-02-abhcd-and-cp-formats} for sample files)
113
114
115
116
117
118

\subsubsection{File sample}
\lstinputlisting[numbers=left,
		frame=single,
		breaklines=false,
		caption={[Marker alleles (.gen input file)]Marker alleles  (example1\_F2.gen from three\_parents\_F2 example)},
Sylvain Jasson's avatar
Sylvain Jasson committed
119
		label=file:gen
120
121
122
123
		%linerange={1-8,35-39}
		]
		{input_files/example1_F2.gen}

Sylvain Jasson's avatar
Sylvain Jasson committed
124
125
126
127
128
Note that \begin{itemize}
\item in line 1 \texttt{F2} after \texttt{ data type} is irrelevant for \texttt{spell-marker}.
\item in line 2 \texttt{0 0} after \texttt{100 37} is irrelevant \texttt{spell-marker}.
\end{itemize}

129
130
131
132
133
134
135
136
137
138
139
140

\subsection{Genetic map}
\subsubsection{File format}
One line per linkage group (space separated) :
\begin{itemize}
\item Starred(*) name for this linkage group
\item Number of markers in the linkage group
\item Name of first marker
\item Series of distance in cM and name of next marker
\end{itemize}
               
\subsubsection{File sample}
Sylvain Jasson's avatar
Sylvain Jasson committed
141
142
143
144
145
146
\lstinputlisting[numbers=left,
		frame=single,
		breaklines=false,
		caption={[Genetic Map (.map input file)] Genetic map (example1.map from three\_parents\_F2 example)},
		label=file:map]
		{input_files/example1.map}
147
148
149
150
151
152
153
154

\subsection{Trait observations}
\subsubsection{File format}
As in MapMaker RAW format, without header : one line per trait beginning with starred (*) trait name followed by space separated observations (one numerical observation per individual, \texttt{ -} means unobserved). 
 
\subsubsection{File sample}
\lstinputlisting[numbers=left,frame=single,breaklines=false,caption={[Trait observations (.phen input file)]Trait observations (example1\_F2.phen from three\_parents\_F2 example)}]{input_files/example1_F2.phen}