Doc.tex 53.1 KB
Newer Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
\documentclass[a4paper,titlepage]{report}

\usepackage{t1enc}
\usepackage[pdftex]{graphicx}
\usepackage{times}
\usepackage{a4wide}
\usepackage{hyperref}
\pdfcompresslevel=9
\usepackage{fancyvrb}
\usepackage{multind}

\makeindex{CmdFlags}

\parskip 5pt plus 3pt minus 2pt 

16
\newcommand{\EuGene}{\textsc{EuG\`ene}}
17
% comment one of the above line to hide or not the developpers documentation
18
19
%\newcommand{\shrink}[1]{}   % hide text
\newcommand{\shrink}[1]{#1} % no effect
20

21
\author{Erika Sallet \and J\'er\^ome Gouzy \and Philippe Bardou \and Marie-Jos\'ee Cros \and Sylvain Foissac \and  Annick Moisan \and C\'eline Noirot \and Damien Leroux \and Thomas Schiex \\ Applied Mathematics and Computer Science Dept.\\ INRA Toulouse, France}
22
23
24
25

\def\abstractname{Overview}
\setcounter{tocdepth}{3}
\setcounter{secnumdepth}{3}
26

27
\title{\EuGene: an open gene finder for eukaryotes and prokaryotes}
28
29
30
31
32

\begin{document}
\maketitle
\tableofcontents

33
34
\begin{abstract}
  \EuGene\ is a sophisticated open gene finder for eukaryotic
35
36
  organisms, and since version 4.0 for prokaryotic organisms also. 
  It has been developed thanks to funding by INRA
37
  (permanent scientists and engineers), G\'enoplante and the french
Thomas Schiex's avatar
Thomas Schiex committed
38
39
  ministry of research (with one PhD student). It generates text, HTML
  and graphical outputs.
40
41
  
  \EuGene\ uses a graph based model to predict genes that covers both
42
  HMM based or more complex Bayesian net based or Conditional Markov Field probabilistic
43
44
45
  predictions. The model is fixed but complex (with 47 different
  states in eukaryote mode, and 64 in prokaryote mode) 
  and covers Exon, Intron, UTR, UTR introns\ldots each with a
Thomas Schiex's avatar
Thomas Schiex committed
46
47
  possible explicit distribution on length. The prediction itself
  relies on an optimal linear time and space algorithm for prediction.
48
49
50
51
52
53
54
55
  
  Even if the gene model is fixed, the sources of information taken
  into account be \EuGene\ for prediction are extremely varied and can
  be easily extended by creating so-called plugins. Currently,
  \EuGene\ can use around 30 different plugins integrating statistical
  information (Markov models at DNA or amino acid level, WAM, Support
  Vector machine based signal prediction\ldots), similarity information
  (Est, cDNA, proteins) and homology (exon conservation). It can also
56
  integrate predictions from other gene predictors if needed.
57
  
58
  In order to integrate all this information, \EuGene\ does not use
59
  maximum likelihood estimation for all parameters but parameters
60
61
  optimized by maximum of prediction quality on expertized data sets (minimizing
  empirical risk on a given dataset).
62
63
64

  The software called \texttt{eugene} is written in C++ and is distributed 
  under the artistic license.
65
 \end{abstract}
66
\iffalse
67

68
\chapter{Quick Start}
69

70
\section{Annotating a sequence}
71
72
Here is a small example based on the \texttt{SYNO\_ARATH.fasta} sequence.
For reference information on the software, see chapter 2.
73
74
75
76
77
78
79
80
81
82

In order to first collect information on the sequence (splice sites,
translation start predictions\ldots) we will have to use the
\texttt{getsites4eugene.pl} script. This script directly queries the
Netgene2, SPlicePredictor and NetStart web servers. Alternatively, if
you have installed these programs locally, you can use the
\texttt{lgetsites4eugene.pl} script (you must modify it and indicate
the paths to the executables). 

\begin{Verbatim}[fontsize=\scriptsize]
83
> ./getsites4eugene.pl Sequences/SYNO_ARATH.fasta 
84
started on sam dec 7 13:44:35 CET 2002
85

86
processing Sequences/SYNO_ARATH.fasta
87
88
89
90

NetStart [2*1 request(s)]: F1..R1..done
NetGene2 [1   request(s)]: 1..FR..done
SplicePredictor: done
91
finished on sam dec 7 13:47:14 CET 2002
92
93
94
95
96
97
98
99
\end{Verbatim}

The script creates the files that contains information about the
sequence in the same directory as the fasta file itself. The
extensions used are \texttt{.splices} for NetGene2, \texttt{.spliceP}
for SplicePredictor, \texttt{.starts} for NetStart (in each case, a
\texttt{R} is added for the reverse strand). 

100
101
We are now ready to use \EuGene\ on this sequence. Because the
sequence lacks context around the CDS of the gene, we inform \EuGene\ 
102
103
104
105
106
that the prediction should start and end in intergenic mode using the
\texttt{-s} flag. This behavior can also be controlled by all the
\texttt{Prior} parameters in the program parameter file (see
section~\ref{param}).

107
\EuGene\ produces two kinds of output: textual and graphical. To manage this outputs
108
109
110
111
112
113
114
115
several options could be use. Two of them (more details see chapter 2):
\begin{itemize}
\item \texttt{-p a|d|g|h|l|s|o}: if we want, we may ask for multiple textual output. For example an HTML output
and an GFF output using the \texttt{-phg} flag. Two files will be created '\texttt{SYNO\_ARATH.html}'
and '\texttt{SYNO\_ARATH.gff}'. \texttt{-po} allows to print prediction on stdout.
\item \texttt{-g}: activates the graphical output (with \texttt{-ph} flag \texttt{-g} is on).\\
\end{itemize}

116
117
118
119
120
121
\begin{Verbatim}[fontsize=\scriptsize]
EXECUTION_TRACE1
\end{Verbatim}

\section{Using transcribed sequences}

122
If you want to exploit similarities with cDNA/EST sequences, you have to inform \EuGene\ of existing similarities. These similarities
123
should be available in a file with the \texttt{.est} extension. The
124
format of this file is described in the \texttt{Est} plugin
125
126
section~\ref{plugest}. It can easily be created from an existing FASTA
databank of EST and cDNA using a patched version of \texttt{sim4}. The
127
patch is provided with \EuGene.
128
129

\begin{Verbatim}[fontsize=\scriptsize]
130
> sim4 Sequences/SYNO_ARATH.fasta cDNA A=6 > seqs/SYNO_ARATH.fasta.est
131
132
133
134
135
136
\end{Verbatim}

With an old dbEST databank completed with the cDNA databank PlantGene,
we get the following file:

\begin{Verbatim}[fontsize=\scriptsize]
137
> cat Sequences/SYNO_ARATH.fasta.est
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
    32    421 1844 0 0 ATAJ644    1  390
   514    582 1844 0 0 ATAJ644  391  459
   699    809 1844 0 0 ATAJ644  460  570
   914   1018 1844 0 0 ATAJ644  571  675
  1271   1408 1844 0 0 ATAJ644  676  813
  1522   1602 1844 0 0 ATAJ644  814  894
  1694   1771 1844 0 0 ATAJ644  895  972
  1853   1921 1844 0 0 ATAJ644  973 1041
  2014   2088 1844 0 0 ATAJ644 1042 1116
  2181   2264 1844 0 0 ATAJ644 1117 1200
  2360   2446 1844 0 0 ATAJ644 1201 1287
  2712   2882 1844 0 0 ATAJ644 1288 1458
  2966   3092 1844 0 0 ATAJ644 1459 1585
  3189   3447 1844 0 0 ATAJ644 1586 1844
    32    375 347 0 0 N97006    1  347
  3099   3379 297 0 0 AV525988   51  347
  3071   3092 256 0 1 AI994358    1   22
  3189   3421 256 0 1 AI994358   23  256
   658    672 61 0 1 AV521563    1   14
   765    813 61 0 1 AV521563   15   61
\end{Verbatim}

160
We can now ask \EuGene\ for a new prediction, including this new
161
evidence using the \texttt{-d} flag (equivalently, the \texttt{Est}
162
163
plugin can be activated by modifying \EuGene\ parameter file).  When
evidence from transcribed sequences is available, \EuGene\ will
164
automatically report in the last column of its output the percentage
165
166
167
168
169
of bases of the element (exon, UTR\ldots) which is consistent with the
available evidence. Here, the gene is almost completely covered by the
available transcribed sequences. The \texttt{Est} plugin also mentions
if transcribed sequences are rejected and why. The information from
two transcribed sequences is rejected. The first one because no splice
170
site has been found near one of the intron border detected by the EST,
171
172
173
174
another one because it was inconsistent with a sequence considered as
more reliable.

\begin{Verbatim}[fontsize=\scriptsize]
175
EXECUTION_TRACE2
176
177
178
179
180
\end{Verbatim}

An additional postprocessing can be requested to the plugin using the
\texttt{-E} flag. For each gene predicted, the plugin will analyze
each transcribed sequence matching the gene and report its consistency
181
with the prediction in '\texttt{SYNO\_ARATH.misc\_info}' file.
182
183

\begin{Verbatim}[fontsize=\scriptsize]
184
EXECUTION_TRACE3
185
186
\end{Verbatim}

187
188
189
190
191
192
193
194
% TO UPDATE: THOMAS
%One can see that 2 of the 5 transcribed sequences have been filtered.
%\texttt{ATAJ644} is an almost full-length cDNA. It covers almost all
%the gene, from the position after the ATG to the 3'UTR. At the end,
%for each gene predicted, a summary reports that the CDS predicted is
%supported by 3273 over 3276 bases (the ATG is missing in all
%trannscripts). The transcribed sequence predicted (including the UTR)
%is supported for 3416 of its 3444 bases.
195
196
197
198
199
200
201

\section{Using protein similarities}

If one wants also to exploit similarities with homologous proteins, a
similar file format can be used (see the corresponding plugin). The
plugin can analyze similarities from several databases, each being
associated with a specific ``level''. For each level, a confidence is
202
defined in \EuGene's parameter file. Usually, 3 databases are used:
203
204
SwissProt, PIR and TrEMBL (from the highest confidence to the
lowest). Each collection of similarity is stored in a file with an
205
206
extension \texttt{.blast} followed by the level of the database 
(\texttt{.blast0}, \texttt{.blast1}, ..., \texttt{.blast9}).  The
207
script used create these files from the output of NCBI-BLASTX is
208
copyrighted and is therefore not distributed with \EuGene. It is not
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
difficult to design another one. Here is an extract from
\texttt{SYNO\_ARATH.fasta.blast0}:

\begin{Verbatim}[fontsize=\scriptsize]
2820 2861 36 3e-08 +3  sp_O07683_SYD_HALSA; 335 348
2972 3088 41 3e-08 +2  sp_O07683_SYD_HALSA; 359 397
3185 3298 113 3e-08 +2  sp_O07683_SYD_HALSA; 398 435
353 418 45 2e-13 +2  sp_O24822_SYD_HALVO; 13 34
1850 1915 67 2e-13 +2  sp_O24822_SYD_HALVO; 202 223
2775 2858 72 2e-13 +3  sp_O24822_SYD_HALVO; 318 345
3191 3280 104 2e-13 +2  sp_O24822_SYD_HALVO; 397 426
353 418 51 7e-12 +2  sp_O26328_SYD_METTH; 21 42
1271 1414 70 7e-12 +2  sp_O26328_SYD_METTH; 141 188
1850 1954 62 7e-12 +2  sp_O26328_SYD_METTH; 210 244
3191 3280 93 7e-12 +2  sp_O26328_SYD_METTH; 401 430
\end{Verbatim}

To exploit this information, the \texttt{-b} flag must be used,
optionally followed by the set of levels to be exploited
228
(``\texttt{012}'' means level 0, 1 and 2).  We start \EuGene\ and ask
229
230
for both EST and proteic similarities analysis. We again enforce the
use of an intergenic mode on the beginning and end of the sequence.
231
232
And similar to the EST, an additional postprocessing can be requested
to the plugin using the \texttt{-B} flag.
233
234

\begin{Verbatim}[fontsize=\scriptsize]
235
EXECUTION_TRACE4
236
237
238
239
\end{Verbatim}

Other plugins are described in the reference section of this document.

240
\section{Annotating a prokaryotic sequence}
241

242
Since version 4.0, \EuGene\ is able to annotate prokarotic sequence. Especially \EuGene\ is capable of predicting overlapping protein genes, (possibly antisense) RNA genes and operon structures.
243
Here a simple example based on a \textit{Sinorhizobium meliloti} sequence.
244
To annotate prokaryotes, the -P flag can be used. Here we want to exploit similarities with 2 proteic databases, 
245
so we used the flag -b, followed by the sets of levels to be exploited (``01'' for levels 0 and 1) For more details about protein similarities use, 
246
see section~\ref{blastxlabel}.
247

248
\EuGene\ results show that the two last genes are overlapping.
249
250
251
252
253
254


\begin{Verbatim}[fontsize=\scriptsize]
EXECUTION_TRACE5
\end{Verbatim}

255
\newpage
256
257
\fi

258
259
\chapter{Reference documentation}

260
To be executed, \EuGene\ needs at least one file: this is the
261
so-called parameter file. \EuGene\ behavior is entirely controlled
262
263
by a set of parameters whose default values are available in this
file. These default values can be altered by editing this file or for
264
265
some values through flags in the command line (such as \texttt{-d}). 
Command line flags override any value in the parameter
266
267
268
file. The name of the parameter file that \EuGene\ seeks is obtained
by adding the suffix ``\texttt{.par}'' to the name of the \EuGene\ 
command itself.  As it is distributed, \EuGene\ command's name is
269
\texttt{eugene} and accordingly the parameter file is
Marie-Josee Cros's avatar
Marie-Josee Cros committed
270
\texttt{eugene.par}. If at some point you want to use several
271
different parameter files, you can simply use symbolic links to the
Marie-Josee Cros's avatar
Marie-Josee Cros committed
272
\texttt{eugene} binary executable. Using a symbolic link, with a
273
specific name to call \EuGene\ will enable you to load a different
274
275
276
277
278
parameter file whose name is derived from the symbolic link name by
ading `\texttt{`.par}'' The parameter file is first sought in the
local directory.  If this fails, the value of the environment variable
\texttt{EUGENEDIR} is used as a second possible path.

279
\EuGene\ gathers all informations on the FASTA sequences through
280
281
so-called ``plugins'' also called ``sensors''. A plugin is a small
software component that can be dynamically loaded and that can inform
282
\EuGene\ about likely exonic, intronic, utr, intergenic regions and
283
284
285
286
287
about signals in the sequence (either splice sites, translation starts
and stops, transcription starts and stops and possible frameshifts).
Plugins can typically embody Markov models (that characterize exonic,
intronic\ldots regions) or splice site detectors or others.  Available
sensors are stored in the \texttt{PLUGINS} directory and are
288
dynamically loaded by \EuGene\ according to the parameters.
289

290
The typical call to \EuGene\ is:
291
292

\begin{Verbatim}[fontsize=\small]
Marie-Josee Cros's avatar
Marie-Josee Cros committed
293
eugene <fasta files>
294
295
\end{Verbatim}

296
where each FASTA file contains one single DNA sequence. In this case,
297
298
the first action of \EuGene\ is to seek and load the parameter file.
All the parameters in this file are either used by \EuGene\ or by the
299
plugins. Each plugin may have its own parameters. The following
300
section describes all the parameters used by \EuGene. Information
301
302
303
about the parameters used by plugins is provided in each plugin
section (see section~\ref{plug}).

304
\section{\EuGene's general parameters}
305
306
\label{param}

307
Here is a list of all the parameters not related to a plugin which control \EuGene's
308
309
behavior. When a command line flag exists that can modify the
corresponding parameter, it is indicated.  All the parameters that
310
control \EuGene\ 's behavior are available in the parameter file. This
311
312
313
314
315
316
317
318
file has a relatively strict formatting. Each line can either be a
comment line (the first character in the line must be a \verb # ) or a
parameter definition. Empty lines are not allowed. A parameter
definition is composed of two strings of character. The first one is
the name of the parameter, the second is its value. Everything is case
sensitive. The definition order is not important.

\begin{itemize}
319
320
321
\item \texttt{EuGene.version}: specifies the \EuGene \ version. 
  After having load the parameter file, \EuGene \ checks
  that the parameter file version is consistent with the executable
322
  version.
Marie-Josee Cros's avatar
Marie-Josee Cros committed
323

324
  \item \texttt{EuGene.organism}: name of the considered organism.
325

326
327
  \item \texttt{EuGene.mode}: 'Eukaryote' or 'Prokaryote'. The ``\texttt{-P}'' command 
  line flag activates the Prokaryote mode.  The value 'Prokaryote2' is also allowed (see section ~\ref{prok})
Marie-Josee Cros's avatar
Marie-Josee Cros committed
328
   
Thomas Schiex's avatar
Thomas Schiex committed
329
330
331
332
  \item \texttt{EuGene.sloppy}: in the default (non-sloppy) mode, 
  \EuGene\ will stop and abort if some needed parameters in the parameter
  file is missing. If the parameter is set to \texttt{1} then a simple warning
  is emitted. Not advised unless you know what you do.
333
334
335
336
337
338
339

  \item \texttt{EuGene.VerboseGC}: If the parameter is set to \texttt{1} then 
  display Garbage Collector Info.

  \item \texttt{EuGene.GCLatency}: latency of the garbage collector.

  \item \texttt{EuGene.ExonPrior}, \texttt{EuGene.IntronPrior},
340
  \texttt{EuGene.InterPrior}, \texttt{EuGene.FivePri\-mePrior},
341
342
343
  \texttt{EuGene.ThreePrimePrior}, \texttt{EuGene.RnaPrior}, 
  \texttt{EuGene.BiCodingPrior}, \texttt{EuGene.UIRPrior}: 
  prior on the initial/final state of
344
345
  prediction. The ``\texttt{-s}'' command line flag can override these
  priors by setting all the non intergenic priors to $0.0$. This
346
  forces \EuGene\ to start and end its prediction in intergenic mode.
347
  
348
349
350
\item \texttt{EuGene.InitExDist}, \texttt{EuGene.IntrExDist},
  \texttt{EuGene.TermExDist}, \texttt{EuGene.SnglExDist},
  \texttt{EuGene.IntronDist}, \texttt{EuGene.InterDist},
351
  \texttt{EuGene.5PrimeDist}, \texttt{EuGene.3PrimeDist}, 
352
  \texttt{EuGene.RnaDist}, \texttt{EuGene.OverlapDist}, \texttt{EuGene.UIRDist}: 
353
354
  names of the files given explicit penalty distributions on the length. 
  The path is relative to EUGENEDIR/models. \EuGene\ 
355
356
  can use explicit penalty distributions on the length of the elements
  predicted. This can be an initial exon, and intermediary exon, a
357
  terminal exon, a single exon gene, an intron, an intergenic region, a
358
359
360
361
  5' UTR region, a 3' UTR region or a non coding protein RNA. 
  In prokaryote mode, this can be also an overlapping exons
  region or a untranslated intern region (UIR). 
  Each parameter specifies the
362
  filename of an explicit penalty distribution file (see the following section). %TBD
363
364
365
366
367
368

\item \texttt{EuGene.SplicedStopPen}: indicates the penalty for
  predicting genes containing in-frame spliced STOPs. This is
  basically set to an infinite value in order to avoid prediction
  containing spliced STOPs but setting this to 0.0 can be useful for
  pseudo-gene prediction\ldots
369
  
370
\item \texttt{EuGene.CodonTable}: name of the file which contains 
371
the DNA codon table. The path is relative to EUGENEDIR/models. 
372
373
374
375
This file is composed of three columns:
the first contains the codon ; the second contains the amino acid (Just one letter) or the character '*' if it's a stop codon ;
the third column contains the character '+' only if it's a start codon.

376
Excerpt of the default eukaryote codon table:
377
378
379
380
381
382
383
384
385
386
\begin{Verbatim}
AAA K	
AAC N
ACA T
...
ATG M +
...
TGA *
\end{Verbatim}

387
388
389
390
\item \texttt{EuGene.NonCanDon}: list of allowed non canonical splice donor sites. (Separated by comma)

\item \texttt{EuGene.NonCanAcc}: list of allowed non canonical splice acceptor sites. (Separated by comma)

Thomas Schiex's avatar
Thomas Schiex committed
391
392
393
394
\item \texttt{Output.RemoveFrags}: in the text output, remove any 
fragmentary gene prediction (missing ATG or STOP or both). The prediction
process is unchanged,  the prediction is just filtered. \index{CmdFlags}{[Remove fragmentary proteins] F}

395
396
397
\item \texttt{Output.truncate}: in the text output, each gene element
  predicted if prefixed by the FASTA sequence id (or the filename if
  no FASTA id is available). This is truncated to the number of
398
399
  caracters indicated. If set to 0 (or FALSE), the full id is used.

Thomas Schiex's avatar
Thomas Schiex committed
400
401
402
403
404
405
406
407
408
409
\item \texttt{Output.MinCDSLen}: any predicted gene whose CDS length
  in number of nucleotides is lower than this is filtered out from the
  output.

\item \texttt{Output.UTRtrim}: EuGene is natively capable of
  predicting UTR. If desired however, the UTR prediction of EuGene can
  be trimmed to be exactly consistent with the transcript evidence
  available as provided by the Est plugin. If no EST evidence is
  available, this means that all UTR predictions will be removed from
  the output.
Thomas Schiex's avatar
Thomas Schiex committed
410

411
412
\item \texttt{Output.initid}: in the text output, initial value for numbering genes.

413
\item \texttt{Output.stepid}: in the text output, step for numbering genes.
414

415
\item \texttt{Output.graph}: if set, requests graphical PNG output.
416
417
418
419
  This can also be set using the \texttt{-g} command line flag. \index{CmdFlags}{[eugene PNG graph required] g}
  The PNG filename is composed by the seq name (w/o the .fasta suffix) completed by
  the number of the figure + .png extension (possibly,
  start/end positions will be inserted too if -u/-v is used).
420
421
422
  
\item \texttt{Output.resx}, \texttt{Output.resy}: controls the
  horizontal and vertical resolution of the PNG images generated by
423
  \EuGene. 
424
425
426
427
428
  
\item \texttt{Output.gfrom}, \texttt{Output.gto}: respectively
  controls which part of the sequence is to be plotted (eg. for
  zooming). The default value for both is $-1$ which corresponds to
  the whole sequence. These parameters can also be set using the
429
430
431
  \texttt{-u} and \texttt{-v}. flags \index{CmdFlags}{[eugene PNG graph lower bound] u}
  \index{CmdFlags}{[eugene PNG graph higher bound] v}

432
433
434
435
436
437
438
439
440
441
442
\item \texttt{Output.glen}: controls the number of nucleotides that
  will appear on a single image. The value $-1$ corresponds to a
  default adaptative mechanism which plots min (6000,length to
  visualize). The ``length to visualize'' is computed from the value
  given to \texttt{Output.gfrom} and \texttt{Output.gto}.
  
\item \texttt{Output.golap}: controls how successives PNG images
  overlap. It must be set to the number of overlapping nucleotides
  between 2 successives PNG images. Default is $-1$ which
  heuristically determines this based on resolution and number of nuc.
  per image. This parameter can also be set using the \texttt{-c}
443
444
445
446
447
448
449
450
451
452
453
454
455
  command line flag. \index{CmdFlags}{[eugene PNG graph overlapping] c}
    
\item \texttt{Output.normopt}: indicates the way the score are
  normalized accross the possibles states (phase 1, 2, 3, -1, -2, -3,
  introns and intergenic states).
  \begin{itemize}
  \item 0: no normalization
  \item 1: normalize accross all states
  \item 2: normalize each coding phase w.r.t. to the non coding
    score only.
  \end{itemize}
  Default is 1. Does not affect prediction, only graphical output.

456
457
458
\item \texttt{Output.window}: sets the half-size of the smoothing
  window used to plot the scores.  Default is 48. This does not affect
  prediction, only graphical output. It can be set using the
459
  \texttt{-w} command line flag. \index{CmdFlags}{[eugene PNG graph smoothing window] w}
460
461
462
463

\item \texttt{Output.intron}: allows to print introns in the textual output.
  Default is 0 (no introns).
 
464
\item \texttt{Output.format}: controls the format of the textual
465
466
467
468
469
470
471
472
473
474
475
476
477
  outpout. May be \texttt{o} (stdout), \texttt{d} (detailed),
  \texttt{l} (long), \texttt{s} (short), \texttt{h} (html), \texttt{g}
  (gff) or \texttt{a} (araset format). Default is \texttt{l}. This can
  be overrided using the \texttt{-p} command line
  flag.\index{CmdFlags}{[eugene textual output format] p} \texttt{o}:
  print the prediction on stdout using the same format than
  \texttt{l}.  All the others print the prediction in files which name
  are composed by the name of the sequence file (w/o the extension
  .fasta, .tfa, .fsa or .txt) completed by \texttt{.egn.debug (d)},
  \texttt{.egn (l)}, \texttt{.egn.short (s)}, \texttt{.html (h)},
  \texttt{.gff (g)}, \texttt{.gff3 (g)} or \texttt{.egn.ara (a)}.
  Multiple format can be selected (\texttt{ohg} for example). When GFF
  is requested, both GFF1 and GFF3 are produced.
478
479
480
481
482
  
\item \texttt{Output.offset}: allows to offset the nucleotide position
  of the prediction.  That is, the prediction for nucleotide at
  position $i$ of the given sequence is printed as nucleotide $i+$ the
  offset. Useful to perform prediction on an extracted sequence
483
484
  without loosing the original position. Can also be set using the
  \texttt{-o} command line flag. \index{CmdFlags}{[eugene nucleotide position offset] o}
485
486
  
\item \texttt{Output.Prefix}: indicates the directory where all non
487
488
489
  stderr/stdout output (eg. PNG images, HTML and GFF files...) should go.
  Default is the current directory.
  \index{CmdFlags}{[eugene output directory] O}
490

491
492
493
494
495
\item \texttt{Output.webdir}: location of the directory needed to generate html output. 
This location has to contain 'Image', 'Style' and 'Javascripts' directories.
If the parameter is set to \texttt{LOCAL} then the web directory is EUGENEDIR/web. 
Else this parameter value has to be an URL.

496
497
498
\item \texttt{Gff3.SoTerm}: indicates the path where sofa (Sequence Ontology Feature Terms) terms are store.
  The path is relative to EUGENEDIR.

499
500
501
502
503
504
505
\item \texttt{Eval.offset}: during the evaluation of a prediction, the prediction is compared with a 
reference (the real gene structure). The region in which compare the prediction 
and the reference is defined as the reference positions +/- the offset.
(used in optimization mode)

\item \texttt{Eval.ignoreNpcRNA}: Put 1 to ignore the npcRNA 
for the fitness computing (used in optimization mode)
506
507
508

\item \texttt{Fitness.wsng}, \texttt{Fitness.wsne}, 
\texttt{Fitness.wsnn}, \texttt{Fitness.wspg}, 
509
510
\texttt{Fitness.wspe}, \texttt{Fitness.wsspn}: indicate respectively the weight 
of the gene sensitivity, of the exon sensitivity, 
511
of the nucleotide sensitivity, of the gene specificity, of the exon specificity 
512
and of the nucleotide specificity in the fitness computing. (used in optimization mode)
513
514
\end{itemize}

515
{\bf Specification of explicit penalty distributions on length}
516
517
518
519
520
521
522
523
524

As in semi-Markov models, \EuGene\ uses explicit distribution of
penalties on the length of all predicted elements. The dynamic
programming inside \EuGene\ garantees that \EuGene\ will run in linear
time and space in the length of the sequence in all cases.

The distributions handled by \EuGene\ are made of 3 components. First,
there is a region of forbidden length (minimum length), then a region
with an arbitrary penalty distribution, then a region with a linear
525
variation of the penalty. From a probabilitic point of view, this means
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
an exponential tail.

Although \EuGene\ is linear in time in the sequence length, it is also
typically linear in time in the sum of the size of the two first
regions. For the moment, all existing \EuGene\ instances use explicit
distributions with an empty arbitrary region (the distribution is just
a minimum length followed by an exponential tail).

Explicit distributions must be specified in distribution files. Each
line in a distribution file contains a length and a penalty. The first
length used specifies the minimum allowed length. Then each line
specifies a point of the explicit distribution. Linear interpolation
is used between points. Then the last length used specifies the start
of the linear tail. The last slope used becomes the slop of the linear
tail.

A
typical distribution file is given below:
\begin{Verbatim}
3 0.0
4 2.0
6 4.0
\end{Verbatim}

It specifies a minimum length of 3. We then have an explicit
distribution region with penalty 0.0 for 3, 2.0 for 4, 3.0 for 5
552
553
(linear interpolation), then 4.0 at 6. As this is the last point
and the slope is 1, the rest of the distribution will be linear
554
with slope 1.
555

556
\section{\EuGene's prokaryote mode}
557
558
\label{prok}

559
560
Since version 4.0, \EuGene\ is able to annotate prokaryotic sequences. This mode is activated using 
the ``\texttt{-P}'' flag or equivalently by setting the parameter \texttt{EuGene.mode} to 'Prokaryote'.
561
562
563
The value 'Prokaryote2' is also allowed: \EuGene\ will do two independant predictions on the two strands. 
It is useful especially to predict antisense ncRNA.

564
565
566
567
568
569
570
In this mode, \EuGene\ can predict overlapping genes on the same strand or different strands, and operon structure.
Operon predictions are only available in the GFF3 output, activated by setting \texttt{Output.format} parameter 
to \texttt{g}.

Length constraints can be applied to overlap regions and to transcribed regions between genes, by filling respectively 
 \texttt{EuGene.OverlapDist} and \texttt{EuGene.UIRDist} parameters.

571
Two genes predicted on the same strand whose distance is inferior to \texttt{Operon.maxDistance} 
572
573
574
are automatically consider as members of the same operon. 
\texttt{Operon.initid} indicates the initial value for operon numbering.
   
575
576


577
578
579
\section{Splice variant prediction}

Since version 3.4, \EuGene\ allows to predict splice variants based purely on
sallet's avatar
sallet committed
580
581
experimental data (alternative transcripts observed through EST, RNAseq or IsoSeq data). 
The feature is activated using the \texttt{-a} flag or 
582
583
equivalently by setting the parameter \texttt{AltEst.use} to 1 or 
TRUE.\index{CmdFlags}{[splice isoform prediction] a}
584

sallet's avatar
sallet committed
585

586
587
588
In this case, \EuGene\ will look for a file with the same name as the sequence
file and with a suffix '\texttt{.alt.est}'. This file has the same format as
the '\texttt{.est}' used by the Est pugin (see later) and contains information
sallet's avatar
sallet committed
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
about genomic region with high quality similarity with EST. GFF3 format is
allowed, by setting \texttt{AltEst.format} value to GFF3. The spliced
alignment algorithm used to create this file should be of high quality,
with clear exon-intron frontiers associated with splice sites.

Eugene only analyzes the EST alignments showing an inconsistency with a gene from the original prediction.
That is to say the alignments where one of the exons shows at one of its borders a difference of
at least \texttt{AltEst.IncompatibilityExonBorderMatchThreshold} nucleotides with an original gene.

\EuGene\ will analyze the kept EST and try to produce a prediction that follows the EST structure.
This prediction is performed in the region around the EST overlapping gene (+/- \texttt{AltEst.RepredictMargin} nucleotides).
If the prediction is different from the optimal prediction (that is where one of its exons shows 
at one of its borders a difference of at least \texttt{AltEst.ExonBorderMatchThreshold} nucleotides
with an original gene), the gene variant structure will be also output. 
Two files are created: one for the initial prediction (.gff3) and one also including
the variants (.variants.gff3)

This feature is controlled by a number of other parameters with the '\texttt{AltEst}' prefix in the parameter file. 
The parameters that you could change are the
parameters regarding length thresholds, used for filtering (\texttt{AltEst.maxEstLength},
\texttt{AltEst.minEstLength}, \texttt{AltEst.maxIn}, \texttt{AltEst.minIn},
\texttt{AltEst.maxEx} and \texttt{AltEst.minEx} which speak for themselves. 
These filters are applied if AltEst.extremeLengthFilter is activated.

613
If the ESTs are oriented, you can activate the parameter \texttt{AltEst.strandSpecific} to take into account the strand.
614

sallet's avatar
sallet committed
615
616
617
618
619
620
621
622
623
624
If \texttt{AltEst.includedEstFilter} is activated, \EuGene\ will remove the EST alignments
included in another. (Recommended use)

If \texttt{AltEst.compatibleEstFilter} is activated, \EuGene\ will look for pairs of EST
which are inconsistent one with the other (there is one nucleotide mapped to an exon by one which is
mapped to an intron/gap by the other). Only EST of such a pair will be analyzed.

If \texttt{AltEst.unsplicedEstFilter} is activated, \EuGene\ will remove the unspliced EST alignments.


625
Every alignment is also "trimmed" by an amount of \texttt{AltEst.exonucleasicLength} 
sallet's avatar
sallet committed
626
on the first and last hit to account for possible spurious short matches.
627
If these hits are shorter than this amount, they are removed from the available data.
628

sallet's avatar
sallet committed
629
\texttt{AltEst.Penalty} is the penalty applied to each region incompatible with the EST alignment.
630
631


sallet's avatar
sallet committed
632
633
634
635
636
637
638
639
640
641
642
\section{Splice variant prediction from a reference annotation}

Since version 4.3, \EuGene\ allows to predict variants from a reference annotation. Only splice variants of the 
reference genes would be predicted. 

The expected format is GFF3 similar to the output of 
the egnep annotation pipeline. Note it only works if the egnep annotation was performed with \texttt{independent\_strand\_annotation}=0. 

To load the reference annotation, fill in the GFF3 file using the \texttt{-k} parameter or equivalently by setting the parameter 
\texttt{AltEst.reference}. EuGene works as described in the section above and only the '.variants.gff3' file is created.

643

644
645
646
647
\section{Plugins}
\label{plug}

Plugins are small software components that can be dynamically loaded
648
649
by \EuGene. Although it is completely transparent to the end-user,
every plugin loaded by \EuGene\ must be written in C++ and be a
650
651
652
subclass of the Sensor class. This class provides essentially four
methods:
\begin{itemize}
653
\item constructor: when instanciated, a plugin receives an instance number
654
655
656
657
658
659
  (specified in the parameter file) and a DNA sequence (instance of
  the \texttt{DNASeq} class). The instance number allows to load
  several identical plugins using different parameters. A plugin with
  a parameter \texttt{X} and instance number \texttt{n} will fetch
  parameter \texttt{X[n]} in the parameter file. On instanciation, the
  plugin should load all data needed to handle the sequence. If the
660
  plugin depends on optimizable parameters (parameters whose name is
661
662
663
664
  followed by a \texttt{*}), then the final configuration that may
  depend on these parameters must be postponed in the \texttt{Init}
  method. 
  
665
\item \texttt{Init}: receives as argument the sequence to process (an
666
  instance of the \texttt{DNASeq} class) and performs the extra
667
668
  initializations that depends on optimizable parameters values
  (parameters whose name is followed by a \texttt{*}).
669
  
670
671
672
673
\item \texttt{GiveInfo}: receives as argument the sequence to process
  (an instance of the \texttt{DNASeq} class), a position on the
  sequence and a \texttt{Data} instance. The \texttt{Data}
  data-structure can receive predictions on all signals and contents
674
  scores known to \EuGene.
675

676
677
678
\item \texttt{Plot}: receives as argument the sequence to process (an
  instance of the \texttt{DNASeq} class) and plots all the predictions
  made by the sensor.
679

680
\item \texttt{PostAnalyse}: receives as argument the prediction of
681
  \EuGene\ and may check it against its own prediction and report
682
683
684
  support or inconsistencies.
\end{itemize}

685
686
687
688
The \texttt{Plot} and \texttt{PostAnalyse} methods are often empty.
The \texttt{Init} is usually limited to the reloading of optimizable
parameters (see the source of the \texttt{Est} or \texttt{BlastX}
plugins for exceptions).
689
690
691

\subsection{Loading plugins}

692
When \EuGene\ starts, plugins are loaded and instanciated following
693
694
695
696
parameters in the parameter file. The \texttt{Sensor.*.use} may
activate or desactivate the corresponding sensor (which must be
available in the PLUGINS directory). If the parameter value is set to
\texttt{0} or \texttt{FALSE}, the plugin is not used. If the parameter
Philippe Bardou's avatar
Philippe Bardou committed
697
value is set to \texttt{1}, then a single instance of the plugin is
698
699
700
701
loaded. If the parameter is set to an integer value, then this number
of instances of the plugin are created.

Below is the list of minimum plugins which are activated by default by
702
the \emph{Arabidopsis thaliana} version of \EuGene.
703
704

\begin{Verbatim}
Philippe Bardou's avatar
Philippe Bardou committed
705
706
707
708
709
710
Sensor.Transcript.use   1
Sensor.EuStop.use       1
Sensor.NStart.use       1
Sensor.IfElse.use       1 (with 2 splice site prediction plugins)
Sensor.MarkovIMM.use    1
Sensor.MarkovConst.use  1
711
712
713
714
715
716
717
718
719
720
721
722
723
\end{Verbatim}  

Sensors are loaded and instanciated following an increasing order of
priorities. The priority of a given type of plugin is defined by the
value of the corresponding \texttt{Sensor.*} parameter. Here is an
example of actual priorities:

\begin{Verbatim}
Sensor.Transcript       1
Sensor.FrameShift       1
Sensor.IfElse           1
Sensor.EuStop           1       
Sensor.NStart           1       
724
Sensor.MarkovIMM        1 
Thomas Schiex's avatar
Thomas Schiex committed
725
Sensor.Est              30         
726
727
\end{Verbatim}

Thomas Schiex's avatar
Thomas Schiex committed
728
729
730
731
The \texttt{Sensor.Est} is loaded last because it has the highest
priority.  This is important since the sensor actually uses the
information provided by other sensors (splice site prediction sensors)
that then have to be loaded before.
732
733
734
735
736

Several instances of the same sensor can be loaded. Eg., if you are
dealing with an organism that has a large GC\% range, one may use
several \texttt{Sensor.MarkovIMM}. Imagine you want to use one model
for sequences who have a GC\% below 50 and another for higher GC\%.
737
This can be achieved by instanciating 2 such sensors. 
738
739
740
741
742
743

\begin{Verbatim}
Sensor.MarkovIMM        2
\end{Verbatim}

When these sensors will be instanciated, they will look for specific
744
745
746
parameters. The first instance will use the usual parameters or
parameters followed by \texttt{[0]} for this plugin class, the second
instance will use parameters followed by \texttt{[1]}.
747
748
749
750
751
752
753
754
755
756
757
758
759
760

\begin{Verbatim}
MarkovIMM.matname[0]    lowGC.mat 
MarkovIMM.minGC[0]      0
MarkovIMM.maxGC[0]      50
MarkovIMM.matname[1]    highGC.mat
MarkovIMM.minGC[1]      50
MarkovIMM.maxGC[1]      100
\end{Verbatim}

As the example show, it is equivalent to define the parameter
\texttt{MarkovIMM.matname[0]} (or any parameter followed by
\texttt{[0]} and the parameter \texttt{MarkovIMM.matname}.

761
762
\subsection{\texttt{GFF3 input documentation}} 

763
Since version 3.4b, Eugene allows for Gff3-compliant input and output.
764
765
766

\paragraph{GFF3 format}

767
768
See \texttt{http://www.sequenceontology.org/gff3.shtml} for details.
In a GF33 file, everything is line-based. The format of a line is:
769

770
771
772
773
774
775
776
777
778
\verb!<seqid> <source> <type> <start> <end> <score> <strand> <phase><attributes>!

The attributes column is composed of tags, some tags have predefined
meanings according to Gff3 specifications. These are the tags
\texttt{ID, Target, Ontology\_term}.  For Eugene, we define additional
specific attributes: \texttt{is\_full\_length, target\_length,
  target\_sequence, database, frame\_hit, frame\_hit, score\_hit}.

A new parameter is needed in \texttt{eugene.par}: 
779
780
\texttt{Gff3.SoTerms		cfg/sofa.obo}

781
782
783
784
785
786
It specifies the path relatively to EUGENEDIR of the file which contains
all SOFA codes. Currently we use the version 1.2 of 25:07:2007. In order
to create valid gff3 file, you have to use SOFA terms or codes.

The third column (type) of gff3 format must contain a term of SOFA,
program accept the id, name and synonyms.
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801

Example of SOFA definition term :
\begin{Verbatim}
[Term]
id: SO:0000164
name: three_prime_splice_site
def: "The junction between the 3 prime end of an intron and 
the following exon." [http://www.ucl.ac.uk/~ucbhjow/b241/glossary.html]
subset: SOFA
synonym: "3' splice site" RELATED []
synonym: "acceptor" RELATED []
synonym: "acceptor splice site" EXACT []
synonym: "splice acceptor site" EXACT []
is_a: SO:0000162 ! splice_site
\end{Verbatim}
802
803
804

The accepted types in the third columns are:

805
806
807
808
809
810
811
\begin{itemize}
\item\texttt{ SO:0000164}
\item\texttt{ three\_prime\_splice\_site}
\item\texttt{ acceptor }
\item\texttt{ acceptor splice site or  acceptor\_splice\_site}
\item\texttt{ splice acceptor site or  splice\_acceptor\_site}
\end{itemize}
812
813

Each plugin has its own extension, in gff3 mode you just have to add '.gff3' after the native file name.
814
815
816
Example if plugin \texttt{SPred} is active with gff3 input format , it will expect 
a file named \texttt{file.SPred.gff3} ( instead of \texttt{file.SPred} in native mode)

817
818
We now descrive each plugin, its behavior and parameters.

819
\subsection{\texttt{Signal plugins}}
Marie-Josee Cros's avatar
Marie-Josee Cros committed
820
\input{Doc_EuStop.tex}
821
822
823
\input{Doc_FrameShift.tex}
\input{Doc_GSplicer.tex}
\input{Doc_NG2.tex}
Marie-Josee Cros's avatar
Marie-Josee Cros committed
824
\input{Doc_NStart.tex}
825
826
\input{Doc_PatConst.tex}
\input{Doc_PepSignal.tex}
827
\input{Doc_RibosomalFrameShift.tex}
Marie-Josee Cros's avatar
Marie-Josee Cros committed
828
\input{Doc_SMachine.tex}
Marie-Josee Cros's avatar
Marie-Josee Cros committed
829
830
\input{Doc_SpliceWAM.tex}
\input{Doc_SPred.tex}
831
832
\input{Doc_StartWAM.tex}
\input{Doc_Transcript.tex}
833
\input{Doc_ProStart.tex}
834

835

836
837
838
839
\subsection{\texttt{Content plugins}}
\input{Doc_BlastX.tex}
\input{Doc_Est.tex}
\input{Doc_Homology.tex}
840
841
842
843
\input{Doc_MarkovConst.tex}
\input{Doc_MarkovIMM.tex}
\input{Doc_MarkovProt.tex}
\input{Doc_Repeat.tex}
844
\input{Doc_NStretch.tex}
845
846

\subsection{\texttt{Mixed signal/content plugins}}
847
\input{Doc_AnnotaStruct.tex}
848
849
\input{Doc_IfElse.tex}
\input{Doc_Riken.tex}
850
\input{Doc_NcRNA.tex}
851
852

\subsection{\texttt{Others plugins}}
853
\input{Doc_GCPlot.tex}
Marie-Josee Cros's avatar
Marie-Josee Cros committed
854
\input{Doc_GFF.tex}
855
\input{Doc_Plotter.tex}
856
\input{Doc_Tester.tex}
857

858
859
860
861
862
863
\section{\EuGene\ as a combiner}

\EuGene\ is able to integrate predictions from many sources and to combine them in one prediction.
For that, you only need to use the AnnotaStruct plugin (see section~\ref{annotastruct}): create as many AnnotaStruct instances as files to combine.


864
The parameter file \texttt{EUGENEDIR/cfg/eugene.combine.par} is parametrized to combine two files. 
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
In the first file, information about start and stop codons is taken into account, whereas in the second one, it is information about splice sites and CDS.
\begin{Verbatim}[fontsize=\small]
 ##### Sensors AnnotaStruct #####
AnnotaStruct.FileExtension[0]      genefinder1
AnnotaStruct.TranscriptFeature[0]  transcript
AnnotaStruct.Start*[0]            2     # i: inline score (GFF3 format only) 
AnnotaStruct.StartType[0]         s      # p: probability  s: score
AnnotaStruct.Stop*[0] 1.5
AnnotaStruct.StopType[0] s
AnnotaStruct.Acc*[0] 0
AnnotaStruct.AccType[0] s
AnnotaStruct.Don*[0] 0
AnnotaStruct.DonType[0] s
AnnotaStruct.TrStart*[0] 0
AnnotaStruct.TrStartType[0] s
AnnotaStruct.TrStop*[0] 0
AnnotaStruct.TrStopType[0] s
882
883
884
885
AnnotaStruct.TrStartNpc*[0] 0
AnnotaStruct.TrStartNpcType[0] s
AnnotaStruct.TrStopNpc*[0] 0
AnnotaStruct.TrStopNpcType[0] s
886
887
888
AnnotaStruct.Exon*[0] 0
AnnotaStruct.Intron*[0] 0
AnnotaStruct.CDS*[0] 0
889
AnnotaStruct.npcRNA*[0]  0
890
AnnotaStruct.Intergenic*[0]  0
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
AnnotaStruct.format[0]             GFF3
\end{Verbatim}

\begin{Verbatim}[fontsize=\small]
AnnotaStruct.FileExtension[1]      genefinder2
AnnotaStruct.TranscriptFeature[1]  transcript
AnnotaStruct.Start*[1]            0     # i: inline score (GFF3 format only)
AnnotaStruct.StartType[1]          s      # p: probability  s: score
AnnotaStruct.Stop*[1] 0
AnnotaStruct.StopType[1] s
AnnotaStruct.Acc*[1] 3
AnnotaStruct.AccType[1] s
AnnotaStruct.Don*[1] 2.5
AnnotaStruct.DonType[1] s
AnnotaStruct.TrStart*[1] 0
AnnotaStruct.TrStartType[1] s
AnnotaStruct.TrStop*[1] 0
AnnotaStruct.TrStopType[1] s
909
910
911
912
AnnotaStruct.TrStartNpc*[1] 0
AnnotaStruct.TrStartNpcType[1] s
AnnotaStruct.TrStopNpc*[1] 0
AnnotaStruct.TrStopNpcType[1] s
913
914
915
AnnotaStruct.Exon*[1] 0
AnnotaStruct.Intron*[1] 0
AnnotaStruct.CDS*[1] 4
916
AnnotaStruct.npcRNA*[1]  0
917
AnnotaStruct.Intergenic*[0]  0
918
919
920
921
922
923
924
925
926
AnnotaStruct.format[1]             GFF3
#
# SIGNAL/CONTENT SENSORS
Sensor.AnnotaStruct.use 2
#
\end{Verbatim}
More details about AnnotaStruct parameters in the section~\ref{annotastruct}.


927

928
\section{Optimization of Plugins parameters}
929

930
The value of some numerical plugins parameters (specified in the
931
932
parameter file with a name finishing with an '*') can be optimized on
a reference set of sequences (with their related information) for
933
which genes positions are known. The idea is to adapt the values of
934
935
936
parameters to increase as much as possible the quality of prediction
of genes and exons. The figure \ref{fig:ParaOptimization} details the
general function of the software with input and ouput files.
937
938
939
940
941
942
943
944
945

\begin{figure}[htbp]
  \begin{center}
    \includegraphics[width=7cm]{ParaOptimization}
  \end{center}
  \caption{Input and output files for parameters optimization} 
  \label{fig:ParaOptimization}
\end{figure}

946
947
The optimization can be lauched with the \texttt{-Z} argument
\index{CmdFlags}{[parameters optimization] Z} on the command line or
Philippe Bardou's avatar
Philippe Bardou committed
948
with the \texttt{ParaOptimi\-zation.Use} parameter set to \texttt{1}.
949

950
951
952
953
After updating the parameter file \texttt{eugene.par} (which sensors
to use,...), the software is lauched with the usual command line
specifying as argument the reference sequences to consider. At the
end, the software creates a new parameter file called
954
\texttt{eugene.<date>.OPTI.par} (for example,
955
\texttt{eugene.30Sep\-2003.OPTI.par}) with the new value for the
956
957
958
959
optimized parameters.

For parameters optimization, the inputs to be specified in the
parameter file are:
960
961
\begin{itemize}
\item the parameters to optimize with their value domain,
962
963
964
965
966
\item the optimization algorithm to use: genetic algorithm, Line
  Search, genetic algorithm and Line Search,
\item the parameters of the optimization algorithm, and for the Line
  Search algorithm complementary information on the parameters to
  optimize (initial value, step of discretization, ...),
967
\item a file with the coordinates of genes for the sequences set. See below its description.
968
969
970
971
972
\item it is possible to include a regularizer term in the criteria
  optimized using the \texttt{ParaOptimization.Regularizer} parameter.
  The sum of all the absolute values of the parameters (L1-norm)
  multiplied by the value of this parameter is used to penalize the
  original fitness to define the final fiteness optimized.
973
974
975
976
977
978
\item \texttt{Eval.offset}: during the evaluation of a prediction, the prediction is compared with a 
reference (the real gene coordinates). The region in which compare the prediction 
and the reference is defined as the reference positions +/- the offset.
\item \texttt{Eval.ignoreNpcRNA}: Put 1 to ignore the npcRNA for the fitness computing.
\item \texttt{Fitness.wsng}, \texttt{Fitness.wsne}, 
\texttt{Fitness.wsnn}, \texttt{Fitness.wspg}, 
979
\texttt{Fitness.wspe}, \texttt{Fitness.wsspn} : indicate respectively the weight of the gene sensitivity, of the exon sensitivity, 
980
981
982
of the nucleotide sensitivity, of the gene specificity, of the exon specificity 
and of the nucleotide specificity in the fitness computing.
\end{itemize}
983

984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\paragraph{Description of the file with the coordinates of genes}
One line of the file describes one gene. The first field of the line is the sequence name. The followed fields are the list of respectively start and stop positions of the exons of the gene. 
The fields are separated by spaces.
An empty line is required to separate two different sequences.
Note that the order of the sequences is important: the order has to be similar to the result of the 'ls 'command.

Example:
{\scriptsize \begin{verbatim}
SEQ1 429 545 665 750
SEQ1 -2001 -2342 -2424 -2522

SEQ2 1000 1230 1521 1690 2510 2600

\end{verbatim}}
This example describes 2 sequences named SEQ1 et SEQ2. SEQ1 is composed of two genes: the first gene is composed of two exons on the forward strand [429-545] [665-750], 
the second of two exons on the reverse strand [2001-2342] [2424-2522].
SEQ2 has a unique gene composed of three exons [1000-1230] [1521-1690] [2510-2600]
For faster browsing, not all history is shown. View entire blame