Purpose
The 'Process' pipeline enable users to analyse and load a transcriptoms if you have only the fasta of yours contigs and the fastq of yours libraries.
It can be launch by two way :
- with a config file
- with parameters
Create an instance
An instance correspond to an biomart instance with different project (species or applications).
usage: ngspipelines_cli.py addinstance [-h] --instance-name STR [--port INT]
[--mem INT] [--url STR]
[--metadata STR]
optional arguments:
-h, --help show this help message and exit
--instance-name STR Which is the name of the instance
--port INT HTTP deployment port [9000]
--mem INT Instance allocated memory in megabytes [1024]
--url STR HTTP public url
--metadata STR Which metadata should be linked to this workflow
Example :
python ./bin/ngspipelines_cli.py addinstance --instance-name myinstance --port 9090 --url http://myserver.fr
Your instance will be available at http://myserver.fr:9000
Usage
ngspipelines_cli.py rnaseqdenovo -h
Process a new transcriptome project using a config file
All command line options (describe above) can be provided in a configuration file.
Edit the rnaseqdenovo.cfg and change the in the line 'db1.file = /workflows/rnaseqdenovo/data/process/ebi_swissprot_filter' Example :
python ./bin/ngspipelines_cli.py rnaseqdenovo @workflows/rnaseqdenovo/data/process/rnaseqdenovo.cfg
Launch a new project with command line
Minimal command line
Here is a minimal command line (be aware that your web-server interface will be quiet poor) : Example :
python ./bin/ngspipelines_cli.py rnaseqdenovo --instance-name myinstance --project-name MyProject --species "Latin Name" --species-common-name "common" --description "Project description" \
--assembly file=workflows/rnaseqdenovo/data/process/contigs.fa software-name=oases software-parameters="" software-version="0.2.06" comments="Transcript assembly" \
--library library-name=brain_400 sample-name=Brain replicat=1 tissue=Brain type=se insert-size=400 remark="100bp to 400bp insert" sequencer=HiSeq2000 files=workflows/rnaseqdenovo/data/process/lib_152.1.fastq.gz \
--assembly-annot-db file=/path/to/ebi_swissprot type=protein evalue=1e-5 max-hit=10 software=blastx name=swissprot
At least one assembly, one library and one database are required.
General options
## --library [mandatory] You can provide fastq files which will be copied to the data directory and be available in the download page. If the fastq file is not provided, you must use the nb-sequence attribute to populate the database in order to compute the histograms presented in the user interface.
List of available attribute (with * mandatory attribute):
- library-name* : [string] the internal library name, must be uniq
- sample-name* : [string] sample name
- replicat* : [int] replicate number
- tissue : [string] tissue
- dev_stage : [string] developpement stage
- type* : [string] library type , available options :
- se : single end
- pe : paired end
- ose : oriented single end
- ope : oriented paired end
- mp : mate pair
- insert-size : [int] for paired end library you can provide the insert size
- remark : [string] any comment
- sequencer : [string] sequencer type
- public : [int] 0 if library is private, 1 if public
- accession : [string] accession number if the library has been published in SRA or ENA
- database : [string] database where library is store eg : SRA, ENA (if available)
- nb-sequence : [int] number of sequences in library, can be provide to avoid cputime consumption
- files* : [string] fastq file path ( if paired space separate file names)
If you have several library you have to use the library option several times. Example :
--library library-name=brain_400 sample-name=Brain replicat=1 tissue=Brain type=pe insert-size=400 remark="100bp to 400bp insert" \
sequencer=HiSeq2000 files=workflows/rnaseqdenovo/data/brain_400.1.fastq.gz,workflows/rnaseqdenovo/data/brain_400.2.fastq.gz
ASSEMBLY section
--assembly [mandatory]
The assembly option is mandatory. The possible attributes are :
- file* : Fasta file, can be gz.
- software-name* : [string] assembly software name
- software-parameters* : [string] assembly software parameters
- software-version* : [string] assembly software version
- comments : [string] any comments on this analysis
Example :
--assembly file=workflows/rnaseqdenovo/data/contigs.fasta software-name=oases software-parameters="" software-version="0.2.06" comments="Transcript assembly"
--rename
Flag to set if you want to rename yours contigs with the gene name of the best annotation.
--prefix
Prefix value to set for all contig when renaming.
--rename --prefix "GG_"
ANNOTATION section
### --assembly-annot-db [mandatory] You can provide much as you want databases for annotation with ncbi blast+. The database must have been indexed with makeblastdb.
- file* : Database file (with index in the same directory).
- type* : [string] kind of data : [genome|nucleic|protein|transcript|unknown]
- name* : [string] The name of the databank (used to trace the source of annotations)
- weight : [float] This weight modifies the hsp's score calculation (score = old_score + weight * old_score) used during selection of the best annotation.
- species : [string] The species whose sequences come from. Fill this parameter if databank is built with only one species and if the header of sequences does not provide her name.
- evalue : [float] The maximum e-value for the alignements.
- software : [string] The type of NCBI-Blast+ used for the alignment. [blastx|blastp...] path to exec file is retrieve from PATH or from application.properties.
--min-identity
option to filter on minimum fraction of identity [0.00-1.00]
--min-coverage
option to filter on minimum fraction of query coverage [0.00-1.00]
--go
A GO (Gene Ontology) file enables to associate GO names, evidences ... to each contig Example :
--go go.txt
--skip-rm
This option enable to skip repeat masker annotation.
--skip-rnammer
Skip RNAmmer step (RNAmmer is used to add rRNA predictions).
--skip-trna
Skip RNAmmer step (RNAmmer is used to add rRNA predictions).
IPRscan section
--skip-iprscan
This option enable to skip iprscan annotation. Iprscan is long but provide a good annotation for protein domains, ORF, and GO.
--max-orf-nb
[int] The maximum number of ORF by contig to report in annotation.
VARIANT section
--variant
If you perform your own variant detection you can provide the file else the pipeline will detect it with GATK3.
This file contains for each the variation informations of the contigs contigs : snps, insertion or deletion. The expected file format is VCF (Variant Calling Format). If the VCF file has been produced using GATK [http://www.broadinstitute.org/gatk/], the allelic count per library will be extracted from the VCF file.
Here is the list of attributes for this options :
- file : in VCF [http://vcftools.sourceforge.net/specs.html]
- software-name : [string] detection software name
- software-parameters : [string] detection software parameters
- software-version : [string] detection software version
- comments : [string] comments on analysis
Example :
--variant file=variant.vcf software-name=GATK software-parameters="realignement/recalibration/glm BOTH" software-version="v2.4-9-g532efad"
--two-steps-calling
The SNP calling is realised in two step. The first step (recalibration, calling, filter) has hard filters. The second step (recalibration, calling, filter) has standard filters and the variants detected in the first step are used as database of known polymorphic sites.
--variant-annot-db
You can use a well known species to annotate your SNP by similarities. You must use a species from Ensembl.
- species : the species name for the species used as reference.
- fasta : the proteins sequences for the species used as reference.
- gtf : the genes and CDS annotations for the species used as reference.
- vcf [optional] : known variants for the species used as reference.
Example :
--variant-annot-db species="Danio rerio" fasta=Danio_rerio.Zv9.pep.all.fa gtf=Danio_rerio.Zv9.77.gtf vcf=Danio_rerio.vcf
Monitoring workflow
To get information about all workflows :
python ./bin/ngspipelines_cli.py status
To get information about a running workflow
python ./bin/ngspipelines_cli.py status --workflow-id XX
To get information about workflow errors
python ./bin/ngspipelines_cli.py status --workflow-id XX --errors
Delete a project
The deleteproject option permits to remove a project from an instance. Example :
python ./bin/ngspipelines_cli.py deleteproject --project-name MyProject
Delete an instance
This sub command will delete instance repository but NOT project inside instance. Example :
python ./bin/ngspipelines_cli.py deleteinstance --project-name myinstance
Launch web server
Once you have loaded the data in you project you can give access to the user interface by launching the instance using the runinstance option. This will start the corresponding web-server. Example :
python ./bin/ngspipelines_cli.py runinstance --instance-name myinstance
To stop the web-server use : Example :
python ./bin/ngspipelines_cli.py runinstance --instance-name myinstance --command stop
Web-server connection
Once the web-server is started you will be able to access it using the URL. The URL has to include the port separated by ':' . Example :
http://ngspipelines.toulouse.inra.fr:9000/