Purpose
The 'Load' pipeline enable users to load his own data (alignement, annotation, variant), if you have only the fasta of yours contigs and the fastq of yours libraries you should perform the 'Process' pipeline.
Each pipelines can be launch by two way :
- with a config file
- with parameters
Create an instance
An instance correspond to an biomart instance with different project (species or applications).
usage: ngspipelines_cli.py addinstance [-h] --instance-name STR [--port INT]
[--mem INT] [--url STR]
[--metadata STR]
optional arguments:
-h, --help show this help message and exit
--instance-name STR Which is the name of the instance
--port INT HTTP deployment port [9000]
--mem INT Instance allocated memory in megabytes [1024]
--url STR HTTP public url
--metadata STR Which metadata should be linked to this workflow
Example :
python ./bin/ngspipelines_cli.py addinstance --instance-name myinstance --port 9090 --url http://myserver.fr
Your instance will be available at http://myserver.fr:9090
Usage
ngspipelines_cli.py load-rnaseqdenovo -h
Load a new project using a config file
All command line options (describe above) can be provided in a configuration file. Example :
python ./bin/ngspipelines_cli.py load-rnaseqdenovo @workflows/rnaseqdenovo/data/rnaseqdenovo.cfg
Load a new project with command line
Minimal command line
Here is a minimal command line (be aware that your web-server interface will be quite poor) :
Example :
python ./bin/ngspipelines_cli.py load-rnaseqdenovo --instance-name myinstance --project-name MyProject --species "Latin Name" --species-common-name "common" --project-description "Project description" \
--assembly file=workflows/rnaseqdenovo/data/contigs.fasta software-name=oases software-parameters="" software-version="0.2.06" comments="Transcript assembly" \
--library library-name=brain_400 sample-name=Brain replicat=1 tissue=Brain type=pe insert-size=400 remark="100bp to 400bp insert" sequencer=HiSeq2000 files=workflows/rnaseqdenovo/data/brain_400.fastq.gz
--alignment file=workflows/rnaseqdenovo/data/brain_400.bam software-name=bwa software-parameters="sampe" software-version="0.9" \
--assembly-annot file=workflows/rnaseqdenovo/data/best_annotation_file.gff3 software-name=blastall software-parameters="-e 10e-10" software-version="2.2.26" comments="Best annotations against swissprot" is-best
At least one bam is required, and the bam file name must correspond to the library name. If you've done your own counting expression you can provide the matrix file with contigs names, see option --count-matrix for more details.
## General options
### --library [mandatory] You can provide fastq files which will be copied to the data directory and be available in the download page. If the fastq file is not provided, you must use the nb-sequence attribute to populate the database in order to compute the histograms presented in the user interface.
List of available attribute (with * mandatory attribute):
- library-name* : [string] the internal library name, must be uniq
- sample-name* : [string] sample name
- replicat* : [int] replicate number
- tissue : [string] tissue
- dev_stage : [string] developpement stage
- type* : [string] library type , available options :
- se : single end
- pe : paired end
- ose : oriented single end
- ope : oriented paired end
- mp : mate pair
- insert-size : [int] for paired end library you can provide the insert size
- remark : [string] any comment
- sequencer : [string] sequencer type
- public : [int] 0 if library is private, 1 if public
- accession : [string] accession number if the library has been published in SRA or ENA
- database : [string] database where library is store eg : SRA, ENA (if available)
- nb-sequence : [int] number of sequences in library, can be provide to avoid cputime consumption
- files* : [string] fastq file path ( if paired space separate file names)
If you have several library you have to use the library option several times. Example :
--library library-name=brain_400 sample-name=Brain replicat=1 tissue=Brain type=pe insert-size=400 remark="100bp to 400bp insert" \
sequencer=HiSeq2000 files=workflows/rnaseqdenovo/data/brain_400.1.fastq.gz,workflows/rnaseqdenovo/data/brain_400.2.fastq.gz
--assembly [mandatory]
The assembly option is mandatory. The possible attributes are :
- file* : Fasta file, can be gz.
- software-name* : [string] assembly software name
- software-parameters* : [string] assembly software parameters
- software-version* : [string] assembly software version
- comments : [string] any comments on this analysis
Example :
--assembly file=workflows/rnaseqdenovo/data/contigs.fasta software-name=oases software-parameters="" software-version="0.2.06" comments="Transcript assembly"
--assembly-annot [mandatory]
The annotation option attributes are :
- file* : contigs annotation file in GFF3 with some specials attributes.
- software-name* : [string] name of the software with which the annotation has been produced
- software-parameters* : [string] annotation software parameters
- software-version* : [string] annotation software version
- comments : [string] annotation software comments
- is-best : [bool] to define if the file corresponds to the best annotation file [true|false]
If you have computed several annotations you have to use the annotation option several times. Example :
--assembly-annot file=workflows/rnaseqdenovo/data/best_annotation_file.gff3 software-name=blastall software-parameters="-e 10e-10" \
software-version="2.2.26" comments="Annotations against swissprot"
If you do not provide a best annotation file (one contig per line in this file, see annotation file format) then you can specify which annotation source (source column of the GFF3 file) has to be used by the pipeline in order to compute the contigs best annotation.
Example :
--assembly-annot file=workflows/rnaseqdenovo/data/annotation_swissprot.gff3 software-name=blastall software-parameters="-e 10e-10" \
software-version="2.2.26" comments="Annotations against swissprot" --best-annotation-source swissprot
--alignment [mandatory]
To provide alignment file (bam) and associate analysis, the user must use the option --alignment. The pipeline will sort and index the bam files. If the --count-matrix option is not provided, the expression measurment is performed.
Here is the list of attributes of this options :
- file : [string] bam file, you can provide several times this attribute. The bam file name must match the library name : library_name.bam
- software-name : [string] name of alignment software
- software-parameters : [string] parameters of alignment software
- software-version : [string] version of alignment software
- comments : [string] any comments on this analysis
Example :
--alignment file=/path/to/lib1.bam file=/path/to/lib2.bam software-name=bwa software-parameters=aln/samse software-version="0.7.2-r351" comments="Library alignment against contigs"
--go
A GO (Gene Ontology) file enables to associate GO names, evidences ... to each contig
Example :
--go go.txt
--keyword
This file contains for each contig one line with keywords (separated by tabulation).
Example:
--keyword keywords.txt
--count-matrix
The pipeline performs the expression measurement (see above for more information). You can skip this step if you've built your own matrix and provide it using the --count-matrix option. Contigs are in line and library count in column. First line must contain the libraries names. Example :
--count-matrix matrix.txt
--variant
This file contains for each the variation informations of the contigs contigs : snps, insertion or deletion. The expected file format is VCF (Variant Calling Format). If the VCF file has been produced using GATK, the allelic count per library will be extracted from the VCF file.
Here is the list of attributes for this options :
- file : in VCF
- software-name : [string] detection software name
- software-parameters : [string] detection software parameters
- software-version : [string] detection software version
- comments : [string] comments on analysis
Example :
--variant file=variant.vcf software-name=GATK software-parameters="realignement/recalibration/glm BOTH" software-version="v2.4-9-g532efad"
--variant-annot
Variation annotation information has to be provided in gff3 format and includes some specific attributes. This option was design to store SNP annotations. Its usually produced by alignment versus a closely genome. The alignment position on the genome enables to extract :
- distance to the exon limits
- SNP position in the codon
- consequence (synonyme, stop gainned ...)
- amino acid modification
- related gene
Here is the list of attributes of this options :
- file : in GFF3
- software-name : [string] detection software name
- software-parameters : [string] detection software parameters
- software-version : [string] detection software version
- comments : [string] comments on analysis
- is-best : [bool] defines if the file correspond to the best annotation file [true|false]
If you have several annotation file you have to use this option several times. Example :
--variant-annot file=workflows/rnaseqdenovo/data/variant_best_annotation.gff software-name=tSNPannot software-parameters="-p blastall -e 10-e10 --species Danio rerio" \
software-version="1" comments="Best annotation of snp" is-best=true
As for contig annotations, if you don't have a best annotation file you can use the option --variant-best-annotation-source. Example :
--variant-annot file=workflows/rnaseqdenovo/data/variant_annotation.gff software-name=tSNPannot software-parameters="-p blastall -e 10-e10 --species Danio rerio" \
software-version="1" comments="Annotation of snp" --variant-best-annotation-source "tSNPannot"
Delete a project
The deleteproject option permits to remove a project from an instance. Example :
python ./bin/ngspipelines_cli.py deleteproject --project-name MyProject
Launch web server
Once you have loaded the data in you project you can give access to the user interface by launching the instance using the runinstance option. This will start the corresponding web-server. Example :
python ./bin/ngspipelines_cli.py runinstance --instance-name myinstance
To stop the web-server use : Example :
python ./bin/ngspipelines_cli.py runinstance --instance-name myinstance --command stop
Web-server connection
Once the web-server is started you will be able to access it using the URL. The URL has to include the port separated by ':' . Example :
http://ngspipelines.toulouse.inra.fr:9000/