pipeline.tex 7.04 KB
Newer Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
\chapter{The main Spell-QTL pipeline}

\section{General view}

\begin{figure}[h] 
  \centering
  \includesvg[width=\columnwidth]{images/Spell-pipeline2}
  \caption{The main Spell-QTL pipeline}\label{fig:pipeline}
\end{figure}

Global software organization is displayed in figure \ref{fig:pipeline}. More detailed informations about the main purpose of each part and then about the required input files will be provided in this chapter. 


\section{Software suite details}
\subsection{\texttt{spell-pedigree}}
\begin{itemize}
	\item Computes the transition matrices for the Continuous Time Hidden Markov Models (CTHMM). They are the $T_d$ matrices in formula \ref{eq:pop}. 
        \item These computations are inherently dependent, so it can only run sequentially.
        \item Outputs a data file that can be fed to \texttt{spell-marker}.
\end{itemize}
\subsection{\texttt{spell-marker}}
\begin{itemize}
	\item Computes the 1-point Parental Origins Probabilities by Bayesian inference for all markers.
        \item Each marker is independent, so it can run in various ways:
                \begin{itemize}
                  \item Sequentially,
                  \item Multithreaded,
                  \item Scheduling jobs on {\em Sun Grid Engine},
                  \item Sending jobs to remote machines via \texttt{ ssh}
                \end{itemize}
        \item Outputs a data file that can be fed to \texttt{spell-qtl}.
        \item Can also output the raw Parental Origin Probabilities.
\end{itemize}
\subsection{\texttt{spell-qtl}}
\begin{itemize}
	\item Performs the QTL analysis {\em per se}.
        \item Can also output the n-point Parental Origin Probabilities along the linkage groups.
        \item Can run most computations concurrently on a multicore computer.
        \item Computation results are cached on disk (and/or in RAM).
\end{itemize}

\section{Input files}
\subsection{Pedigree}
\subsubsection{File format}
See \texttt{spell-pedigree} man page (at appendix \ref{ch:spell:predigree})
\subsubsection{File sample}
\lstinputlisting[numbers=left,
		frame=single,
		breaklines=false,
		caption={[Pedigree (.ped input file)]Pedigree (selected lines from example1.ped from three\_parents\_F2 example)},
		linerange={1-12,107-112}
		]
		{input_files/example1.ped}


\subsection{Marker observations}

\subsubsection{File format}
\texttt{spell-marker} understand a few common formats, based on MapMaker RAW format (without traits) :
\begin{itemize}
\item A line beginning with \texttt{ data type} followed by ignored text
\item A line containing four integer values :  number of markers, number of individuals, two ignored values
\item A line per marker beginning with starred(*) marker name followed by a space and by allele observed or inferred for each individual (a character per individual).  
\end{itemize}

Build in allele code are : 
\begin{description}
\item[02] SNP observations, where 0 and 2 are homozygous and 1 is heterozygous. These observations type are relevant for any individual in the pedigree, including parents. \texttt{spell-marker} will then perform inference of possible genotypes and inference of possible states in the CTHMM.   
\item[ABHCD] MapMaker like Parental Origin inferred observations. These are relevant for inbred lines crosses products.  Let's consider the cross $A|A \times B|B$:
\begin{itemize}
\item The child is typed A and the allele A is not dominant. The only possible genotype is $A|A$. This is encoded by the character \texttt{ A} in MapMaker.
\item The child is typed A and the allele A is dominant. The possible genotype are $A|A$, $A|B$ and $B|A$. This is encoded by the character \texttt{ D} in MapMaker.
\item The child is typed B and the allele B is not dominant. The only possible genotype is $B|B$. This is encoded by the character \texttt{ B} in MapMaker.
\item The child is typed B and the allele B is dominant. The possible genotype are $A|B$, $B|A$ and $B|B$. This is encoded by the character \texttt{ C} in MapMaker.
\item The child is typed AB (the allele A and B are codominant). The possible genotype are $A|B$ and  $B|A$. This is encoded by the character \texttt{ H} in MapMaker.
\item The child in not typed. The possible genotypes are $A|A$, $A|B$, $B|A$ and $B|B$. This is encoded by the character \texttt{ -} in MapMaker.
\end{itemize}
The parental origin letters can be overridden in the command line.
\item[CP] Outbred observations  as defined in Cathagene. These observations are relevant for all known phases situations, including cases where one parent is homozygous, when 3 or 4 different alleles are present.  Lets consider the cross $A|B \times C|D$: The possibles child genotypes are $A|C$, $A|D$, $B|C$ and $B|D$. Carthagene format actually enables the user to express any subset of the 4 different possibilities using a single hexadecimal digit (0-f). 

\begin{center}
\begin{tabular}{cc}
Code & Possible genotypes \\
\hline
1    & $A|C$ \\
2    & $A|D$     \\
3    & $A|C$,$A|D$      \\
4    & $B|C$      \\
5    & $A|C$,$B|C$      \\
6    & $A|D$,$B|C$      \\
7    & $A|C$,$A|D$,$B|C$      \\
8    & $B|D$      \\
9    & $A|C$,$B|D$      \\
a    & $A|D$,$B|D$      \\
b    & $A|C$,$A|D$,$B|D$      \\
c    & $B|C$,$B|D$     \\
d    & $A|C$,$B|C$,$B|D$       \\
e    & $A|D$,$B|C$,$B|D$  \\
0 or f or -    & $A|C$,$A|D$,$B|C$,$B|D$      \\
\end{tabular}
\end{center}
\end{description}

Note that \textbf{CP} and \textbf{ABHCD} formats imply user-made genotype inference. Depending on generation, \texttt{spell-marker} will perform further genotype inference and HMM state inference using pedigree. 

Other allele code can be defined via a JSON file. (see in appendix \ref{spell-marker:marker-observation-format-specification} for format and \ref{spell-marker:example-the-02-abhcd-and-cp-formats} for sample files)

\subsubsection{File sample}
\lstinputlisting[numbers=left,
		frame=single,
		breaklines=false,
		caption={[Marker alleles (.gen input file)]Marker alleles  (example1\_F2.gen from three\_parents\_F2 example)},
		%linerange={1-8,35-39}
		]
		{input_files/example1_F2.gen}

Note that line 1 \texttt{F2} after \texttt{ data type} is irrelevant

\subsection{Genetic map}
\subsubsection{File format}
One line per linkage group (space separated) :
\begin{itemize}
\item Starred(*) name for this linkage group
\item Number of markers in the linkage group
\item Name of first marker
\item Series of distance in cM and name of next marker
\end{itemize}
               
\subsubsection{File sample}
\lstinputlisting[numbers=left,frame=single,breaklines=false,caption={[Genetic Map (.map input file)] Genetic map (example1.map from three\_parents\_F2 example)}]{input_files/example1.map}

\subsection{Trait observations}
\subsubsection{File format}
As in MapMaker RAW format, without header : one line per trait beginning with starred (*) trait name followed by space separated observations (one numerical observation per individual, \texttt{ -} means unobserved). 
 
\subsubsection{File sample}
\lstinputlisting[numbers=left,frame=single,breaklines=false,caption={[Trait observations (.phen input file)]Trait observations (example1\_F2.phen from three\_parents\_F2 example)}]{input_files/example1_F2.phen}