Newer
Older
Cnvpipelines workflow
---------------------
Cnvpipelines is a workflow to detect Copy Number Variation (CNV) variants (DEL, INV, MULTICOPY). It's build as a snakemake workflow with a wrapper to ease job run.
## In development...
This workflow is still in development. For now, only DEL and INV variants are available. Also, genotypage with genomestrip still fails and is disabled for now.
The cnvpipeline uses a number of open source projects to work properly:
* [snakemake](https://snakemake.readthedocs.io) - workflow management software
* [delly](https://github.com/dellytools/delly) - SV detection tool
* [lumpy](https://github.com/arq5x/lumpy-sv) - SV detection tool
* [pindel](https://github.com/genome/pindel) - SV detection tool
* [svtyper](https://github.com/hall-lab/svtyper) - sv genotyping tool
git clone --recursive https://forgemia.inra.fr/genotoul-bioinfo/cnvpipelines.git
Then, copy `application.properties.example` into `application.properties`. Configuration will be edited in next step.
Third party sofwtares can best be installed using conda.
```sh
$ cd cnvpipelines
$ conda create --name cnvpipeline --file requirements.yaml
```
### 4. Load new conda environment
```sh
source activate cnv
export PERL5LIB="$CONDA_HOME/envs/cnv/lib/perl5"
```
### 6. Additional softwares to install
You also need to install the RepBase (http://www.girinst.org/server/archive/RepBase21.12/ - choose this version as more recent ones are not compatible with 4.0.6 version of RepeatMaster). Download the Repbase-derived RepeatMasker libraries (repeatmaskerlibraries-20160829.tar.gz) Uncompress it in your save folder. It will create a Library folder. Then define the path to the Library folder inside the application.properties file (see below).
If you run simulation, you need additional python modules: matplotlib and seaborn. Once you loaded your conda environment, just install them like that:
Special case of genologin cluster (genotoul):
* Lumpy is already available through bioinfo/lumpy-v0.2.13. Just add it in the application.properties file.
* For genomestrip, you can use this folder: `/usr/local/bioinfo/src/GenomeSTRiP/svtoolkit_2.00.1774` (see configuration part, sv_dir point)
### 7. Future logins
For future logins, you must reactivate all conda environments. This means launching these commands:
export PATH=$CONDA_HOME/bin:$PATH
source activate cnv
export PERL5LIB="$CONDA_HOME/envs/cnv/lib/perl5"
Where `$CONDA_HOME` is the folder in which you install miniconda in previous step.
## Install for simulations
To do simulations, you need to compile pirs, which is included as submodule of your cnvpipeline installation. Go into `cnvpipelines/popsim/pirs` and just run:
make
## Configuration
Configuration should be edited in `application.properties` file. Sections and parameters are described below.
### Global section
* *batch_system_type*: `local` to run all jobs locally, or `slurm` or `sge` to submit on a cluster (depending on which cluster you have) (default: local).
* *modules*: list of modules to load before launching the workflow, space separated.
* *paths*: list of paths to add to the global PATH environment variable.
* *jobs*: maximum number of jobs to submit concurrently (default: 999).
* *sv_dir*: absolute path to the `svtoolkit` folder (for genomestrip).
### Cluster section
This section must be filled only if you don't use local as batch system type (see above).
* *submission_mode*: `drmaa` to submit jobs through DRMAA API, `cluster` to submit jobs through bash commands.
* *submission_command*: if you choose `cluster` for `submission_mode`, you must specify the command used to submit jobs (e.g.: srun, qsub).
* *drmaa*: if you choose `drmaa` for `submission_mode`, you must specify the absolute path to the DRMAA library on the cluster.
* *native_submission_options*: options passed to the submission command. Should be kept as it on most cases.
* *config*: absolute path to the config file defining for each rule the amount of memory and cluster threads to ask for (you should use the cluster.yaml file as a model). Can be kept as it.
### Reference bundle section
* *repeatmasker_lib_path*: PATH to the RepBase Libraries folder for RepeatMasker, if needed by RepeatMasker installation
./cnvpipelines.py run refbundle -r {fasta} -s {species} -w {working_dir}
With:
`fasta`: the path of the reference fasta file
`species`: species name, according to the [NCBI Taxonomy database](http://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html)
`-l`: read length (default: 100)
`-m`: maximum n-stretches length (default: 100)
`--chromosomes CHRS`: list of chromosomes to study, space separated. Regex accepted (using the [python syntax](https://docs.python.org/3/library/re.html#regular-expression-syntax)). Default: all valid chromosomes of the reference
`--force-all-chromosomes`: ignore filtering if `--chromosomes` is not set
`-p`: for each rule, show the shell command run.
`-n`: dry run: show which rules will be launched without run anything.
`--keep-wdir`: in dry run mode, don't remove working dir after launch
`-c`: clean after launch: keep only final files (reference.* files).
`-sc`: soft clean after launch: remove all log files.
`--out-step STEP`: specify the output rule file to only run the workflow until the associated rule (run all workflow if not specified)
`--cluster-config FILE`: Path to a cluster config file (erase the cluster
config present in configution)
#### Align Fastq files on the reference
##### Command
./cnvpipelines.py run align -r {fasta} -s {samples} -w {working_dir}
With:
`fasta`: the path of the reference fasta file
`samples`: a YAML file describing for each sample it's name and fastq files (reads2, and optionally reads2). Example:
Sample_1:
reads1: /path/to/reads_1.fq.gz
reads2: /path/to/reads_2.gq.gz
Sample_2:
reads: /path/to/reads.fq.gz
Where `Sample_1` and `Sample_2` are samples name.
`working_dir`: the folder into store data
##### Optional arguments
`-p`: for each rule, show the shell command run.
`-n`: dry run: show which rules will be launched without run anything.
`--keep-wdir`: in dry run mode, don't remove working dir after launch
`-c`: clean after launch: keep only final files (reference.* files).
`-sc`: soft clean after launch: remove all log files.
`-f`: Force run, erase working dir config and files automatically
`--out-step STEP`: specify the output rule file to only run the workflow until the associated rule (run all workflow if not specified)
`--cluster-config FILE`: Path to a cluster config file (erase the cluster
config present in configution)
./cnvpipelines.py run detection -r {fasta} -s {samples} -w {working_dir} -t {tools}
With:
`fasta`: the path to the fasta file (with all files of the reference bundle on the same folder).
`samples`: a file with in each line the path to a bam file to analyse.
`working_dir`: the folder into store data.
`tools`: list of tools, space separated. Choose among genomestrip, delly, lumpy and pindel.
`-b INT`: size of batches (default: -1, to always make only 1 batch)
`--chromosomes CHRS`: list of chromosomes to study, space separated. Regex accepted (using the [python syntax](https://docs.python.org/3/library/re.html#regular-expression-syntax)). Default: all valid chromosomes of the reference
`--force-all-chromosomes`: ignore filtering if `--chromosomes` is not set
`-v VARIANTS`: list of variant types to detect, space separated among: DEL (deletions), INV (inversions), DUP (duplications) and mCNV (copy number variations). Default: all types.
`-p`: for each rule, show the shell command run.
`-n`: dry run: show which rules will be launched without run anything.
`--keep-wdir`: in dry run mode, don't remove working dir after launch
`-c`: clean after launch: keep only filtered results.
`-sc`: soft clean after launch: remove all log files.
`-f`: Force run, erase working dir config and files automatically
`--out-step STEP`: specify the output rule file to only run the workflow until the associated rule (run all workflow if not specified)
`--cluster-config FILE`: Path to a cluster config file (erase the cluster
config present in configution)
#### Merge batches
##### Command
./cnvpipelines.py run mergebatches -w {working_dir}
With:
`working_dir`: the detection run output folder. Must have at least 2 batches to run correctly for now.
##### Optional arguments
`-p`: for each rule, show the shell command run.
`-n`: dry run: show which rules will be launched without run anything.
`--keep-wdir`: in dry run mode, don't remove working dir after launch
`-sc`: soft clean after launch: remove all log files.
`--out-step STEP`: specify the output rule file to only run the workflow until the associated rule (run all workflow if not specified)
`--cluster-config`: erase the default cluster config file (see above) with a new one.
#### Simulate a population
We integrate to cnvpipelines popsim, our tool to simulate a population with variants (DEL and/or INV). With simulation workflow, you generate a such population, launch the detection and then compare detected results to true ones, with a summary HTML or Jupyter file.
##### Command
./cnvpipelines.py run simulation -nb {nb_inds} -r {reference} -sp {species} -t {tools} -w {working_dir}
With:
`nb_inds`: number of individuals to generate.
`reference`: fasta file to use as reference for individuals simulated.
`species`: species of the reference (required only if you use genomestrip for detection).
`tools`: tools to use for detection, space separated. Choose among genomestrip, delly, lumpy and pindel.
`working_dir`: the folder into store data.
##### Description of variants
`-s {svlist}`: a file describing the size distributions of variants. If not given, a default distribution is used.
Structure of the file (tab separated columns):
>DEL minLength maxLength proba -> Create DELetion(s).
DUP minLength maxLength proba -> Create tandem DUPlication(s).
INV minLength maxLength proba -> Create in-place INVersion(s).
minLength and maxLength are int. Proba is the probability the variant has the above size, between 0 and 1. Ex: 0.7.
For each SV types, the sum of probabilities must be equal to 1.
1 line by SV type, like above. You can add several lines for a same SV type.
Example:
| Variant | Min | Max | Proba | Cumul. proba * |
|:-------:|:---:|:---:|:-----:|:--------------:|
| DEL | 100 | 200 | 0.7 | 0.7 |
| DEL | 200 | 500 | 0.2 | 0.9 |
| DEL | 500 | 1000| 0.07 | 0.97 |
| DEL | 1000| 2000| 0.02 | 0.99 |
| DEL | 2000|10000| 0.01 | 1.0 |
\* This column has not to be set in the file (it's automatically computed).
##### Recommended options
`-ns {nstretches}`: position of N stretches. Variants will be generated at a way from them.
`-fp`: to force polymorphism. For each variant, the genotype will be different for at least individual.
##### Optional arguments
`-cv {coverage}`: coverage to use for generated reads of simulated individuals (default: 15).
`-a`: to generate haploid individuals (default: diploid).
`-pd {proba_del}`: probability to have a deletion (default: 1e-06).
`-pi {proba_inv}`: probability to have an inversion (default: 1e-06).
`-l {read_len}`: Generate reads having a length like specified (default: 100).
`-m {insert_len_mean}`: Generate inserts (fragments) having an average length like specified (default: 300).
`-v {insert_len_sd}`: Set the standard deviation of the insert (fragment) length (%) (default: 30).
`-md {min-del}`: Minimum number of deletions to generate (default: 1).
`-mi {min-inv}`: Minimum number of inversions to generate (default: 1).
`--max-try {nb}`: number of tries to do reach the minimum values above. If it still fails, the program will exit with an error (default: 10).
`-g {file}`: give a genotypes vcf file with variants positions and genotypes per individual. If given, only do the genomes and reads generation.
`-mn {int}`: Max size of nstretches to consider them as it (for refbundle, only if genomestrip is in tools) (default: 100).
`--overlap-cutoff {float}`: cutoff for reciprocal overlap between detected variants and true variants. Beside this number, they are consider as the same ones (default: 0.5).
`--left-precision {int}`: left breakpoint precision. -1 to ignore (default: -1)
`--right-precision {int}`: right breakpoint precision. -1 to ignore (default: -1)
`--chromosomes CHRS`: list of chromosomes to study, space separated. Regex accepted (using the [python syntax](https://docs.python.org/3/library/re.html#regular-expression-syntax)). Default: all valid chromosomes of the reference
`--force-all-chromosomes`: ignore filtering if `--chromosomes` is not set
`-p`: for each rule, show the shell command run.
`-n`: dry run: show which rules will be launched without run anything.
`--keep-wdir`: in dry run mode, don't remove working dir after launch
`-c`: clean after launch: keep only filtered results.
`-sc`: soft clean after launch: remove all log files.
`--out-step STEP`: specify the output rule file to only run the workflow until the associated rule (run all workflow if not specified)
`--cluster-config`: erase the default cluster config file (see above) with a new one.
With:
`working_dir`: the folder into data is stored
##### Optional arguments
`--force-unlock`: unlock a workflow (should be locked if snakemake crashes or was killed)
`--rerun-incomplete`: to rerun incomplete rules
`-p`: for each rule, show the shell command run.
`-n`: dry run: show which rules will be launched without run anything.
`--keep-wdir`: in dry run mode, don't remove working dir after launch
`--cluster-config FILE`: erase the default cluster config file (see above) with a new one.
`-sc`: soft clean after launch: remove all log files.
`-f`: Force run, erase working dir config and files automatically
`--out-step STEP`: specify the output rule file to only run the workflow until the associated rule (run all workflow if not specified)
##### Optional arguments
`-f`: fake mode, only show files and folder to delete without make any changes.
#### Soft clean
./cnvpipelines.py soft-clean -w {working_dir}
With:
`working_dir`: the folder into data is stored
##### Optional arguments
`-f`: fake mode, only show files and folder to delete without make any changes.
### Unlock a workflow
If snakemake crashes or was killed, workflow should be locked. You can launch it by:
./cnvpipelines.py unlock -w {working_dir}
With:
`working_dir`: the folder into data is stored
Note: you can also use the `--force-unlock` of the rerun mode.
## Authors
Thomas Faraut <thomas.faraut@inra.fr>
Floréal Cabanettes <floreal.cabanettes@inra.fr>