Newer
Older
Cnvpipelines workflow
---------------------
Cnvpipelines is a workflow to detect Copy Number Variation (CNV) variants (DEL, INV, MULTICOPY). It's build as a snakemake workflow with a wrapper to ease job run.
## In development...
This workflow is still in development. For now, only DEL and INV variants are available. Also, genotypage with genomestrip still fails and is disabled for now.
## Requirements
All tools used by the workflow must be available in your PATH (can be loaded as modules or added to PATH in config file, see below):
- delly >= 0.7.7
- lumpy >= 0.2.13
- pindel >= 0.2.5b9
- svtyper >= 0.1.4
- samtools >= 1.4
- bedtools >= 2.26.0
- bcftools >= 1.6
- vcftools >= 0.1.11
Required for reference bundle:
- RepeatMasker >= 4.0.6
- exonerate >= 2.4.0
- picard tools >= 2.17.11
For genomestrip, you must define the `sv_dir` parameter in the configuration (see below).
Other dependencies:
- python3 >= 3.4
- python 2.7
Python 3 modules required:
- pysam >= 0.14
- pysamstats == master (from repository)
Python 3 modules for reference bundle:
- pyfaixd
- biopython
Python 3 modules if DRMAA used for submissions:
- drmaa
## Installation
Clone this repository:
git clone git@forgemia.inra.fr:genotoul-bioinfo/cnvpipelines.git
Then, copy `application.properties.example` into `application.properties`. Configuration will be edited in next step
We use a special version of svtyper available here (awaiting pull request): https://github.com/florealcab/svtyper
## Quick install with conda
All tools except Genomestrip, svtyper and lumpy can be installed via anaconda or miniconda.
We test the install with python3 miniconda.
### 1. Install miniconda
wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
sh Miniconda3-latest-Linux-x86_64.sh
Then, follow the steps.
### 2. Load base conda environment
Where `$CONDA_HOME` is the folder in which you install miniconda in previous step.
### 3. Create the conda environment
conda create --name cnv --file requirements.txt
### 4. Load new conda environment
source activate cnv
export PERL5LIB="$CONDA_HOME/envs/cnv/lib/perl5"
Where `$CONDA_HOME` is the folder in which you install miniconda in previous step.
### 5. Additional softwares to install
You must install pysamstats from the master branch of their repository to have a compatibility to pysam 0.14 (required by other components of the pipeline):
pip install git+https://github.com/alimanfoo/pysamstats.git
For svtyper, you need to have the parallel python3-recoded version. For now (awaiting pull request), install it like this:
pip install git+https://github.com/florealcab/svtyper.git
You must install [genomestrip](http://software.broadinstitute.org/software/genomestrip/) and [lumpy](https://github.com/arq5x/lumpy-sv) using their install procedure.
You also need to install the RepBase (http://www.girinst.org/server/archive/RepBase21.12/ - choose this version as more recent ones are not compatible with 4.0.6 version of RepeatMaster), then define the path to the Library folder inside the application.properties file (see below).
Special case of genologin cluster (genotoul):
* Lumpy is already available through bioinfo/lumpy-v0.2.13. Just add it in the application.properties file.
* For genomestrip, you can use this folder: `/usr/local/bioinfo/src/GenomeSTRiP/svtoolkit_2.00.1774` (see configuration part, sv_dir point)
### 6. Future logins
For future logins, you must reactivate all conda environments. This means launching these commands:
export PATH=$CONDA_HOME/bin:$PATH
source activate cnv
export PERL5LIB="$CONDA_HOME/envs/cnv/lib/perl5"
Where `$CONDA_HOME` is the folder in which you install miniconda in previous step.
## Configuration
Configuration should be edited in `application.properties` file. Sections and parameters are described below.
### Global section
* *batch_system_type*: `local` to run all jobs locally, or `slurm` or `sge` to submit on a cluster (depending on which cluster you have) (default: local).
* *modules*: list of modules to load before launching the workflow, space separated.
* *paths*: list of paths to add to the global PATH environment variable.
* *jobs*: maximum number of jobs to submit concurrently (default: 999).
* *sv_dir*: absolute path to the `svtoolkit` folder (for genomestrip).
### Cluster section
This section must be filled only if you don't use local as batch system type (see above).
* *submission_mode*: `drmaa` to submit jobs through DRMAA API, `cluster` to submit jobs through bash commands.
* *submission_command*: if you choose `cluster` for `submission_mode`, you must specify the command used to submit jobs (e.g.: srun, qsub).
* *drmaa*: if you choose `drmaa` for `submission_mode`, you must specify the absolute path to the DRMAA library on the cluster.
* *native_submission_options*: options passed to the submission command. Should be kept as it on most cases.
* *config*: absolute path to the config file defining for each rule the amount of memory and cluster threads to ask for (you should use the cluster.yaml file as a model). Can be kept as it.
### Reference bundle section
* *repeatmasker_lib_path*: PATH to the RepBase Libraries folder for RepeatMasker, if needed by RepeatMasker installation
./cnvpipelines.py run refbundle -r {fasta} -s {species} -w {working_dir}
With:
`fasta`: the path of the reference fasta file
`species`: species name, according to the [NCBI Taxonomy database](http://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html)
`-l`: read length (default: 100)
`-m`: maximum n-stretches length (default: 100)
`-p`: for each rule, show the shell command run.
`-n`: dry run: show which rules will be launched without run anything.
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
`--keep-wdir`: in dry run mode, don't remove working dir after launch
`-c`: clean after launch: keep only final files (reference.* files).
`-sc`: soft clean after launch: remove all log files.
`--out-step STEP`: specify the output rule file to only run the workflow until the associated rule (run all workflow if not specified)
`--cluster-config FILE`: Path to a cluster config file (erase the cluster
config present in configution)
#### Align Fastq files on the reference
##### Command
./cnvpipelines.py run align -r {fasta} -s {samples} -w {working_dir}
With:
`fasta`: the path of the reference fasta file
`samples`: a YAML file describing for each sample it's name and fastq files (reads2, and optionally reads2). Example:
Sample_1:
reads1: /path/to/reads_1.fq.gz
reads2: /path/to/reads_2.gq.gz
Sample_2:
reads: /path/to/reads.fq.gz
Where `Sample_1` and `Sample_2` are samples name.
`working_dir`: the folder into store data
##### Optional arguments
`-p`: for each rule, show the shell command run.
`-n`: dry run: show which rules will be launched without run anything.
`--keep-wdir`: in dry run mode, don't remove working dir after launch
`-c`: clean after launch: keep only final files (reference.* files).
`-sc`: soft clean after launch: remove all log files.
`-f`: Force run, erase working dir config and files automatically
`--out-step STEP`: specify the output rule file to only run the workflow until the associated rule (run all workflow if not specified)
`--cluster-config FILE`: Path to a cluster config file (erase the cluster
config present in configution)
./cnvpipelines.py run detection -r {fasta} -s {samples} -w {working_dir} -t {tools}
With:
`fasta`: the path to the fasta file (with all files of the reference bundle on the same folder).
`samples`: a file with in each line the path to a bam file to analyse.
`working_dir`: the folder into store data
##### Optional arguments
`-b INT`: size of batches (default: -1, to always make only 1 batch)
`--chromosomes CHRS`: list of chromosomes to study, coma separated. Default: all valid chromosomes of the reference
`-v VARIANTS`: list of variant types to detect, space separated among: DEL (deletions), INV (inversions), DUP (duplications) and CNV (copy number variations). Default: all types.
`-p`: for each rule, show the shell command run.
`-n`: dry run: show which rules will be launched without run anything.
`--keep-wdir`: in dry run mode, don't remove working dir after launch
`-c`: clean after launch: keep only filtered results.
`-sc`: soft clean after launch: remove all log files.
`-f`: Force run, erase working dir config and files automatically
`--out-step STEP`: specify the output rule file to only run the workflow until the associated rule (run all workflow if not specified)
`--cluster-config FILE`: Path to a cluster config file (erase the cluster
config present in configution)
#### Merge batches
##### Command
./cnvpipelines.py run mergebatches -w {working_dir}
With:
`working_dir`: the detection run output folder. Must have at least 2 batches to run correctly for now.
##### Optional arguments
`-p`: for each rule, show the shell command run.
`-n`: dry run: show which rules will be launched without run anything.
`--keep-wdir`: in dry run mode, don't remove working dir after launch
`-sc`: soft clean after launch: remove all log files.
`--out-step STEP`: specify the output rule file to only run the workflow until the associated rule (run all workflow if not specified)
`--cluster-config`: erase the default cluster config file (see above) with a new one.
./cnvpipelines.py rerun -w {working_dir}
With:
`working_dir`: the folder into data is stored
##### Optional arguments
`--force-unlock`: unlock a workflow (should be locked if snakemake crashes or was killed)
`--rerun-incomplete`: to rerun incomplete rules
`-p`: for each rule, show the shell command run.
`-n`: dry run: show which rules will be launched without run anything.
`--keep-wdir`: in dry run mode, don't remove working dir after launch
`--cluster-config FILE`: erase the default cluster config file (see above) with a new one.
`-sc`: soft clean after launch: remove all log files.
`-f`: Force run, erase working dir config and files automatically
`--out-step STEP`: specify the output rule file to only run the workflow until the associated rule (run all workflow if not specified)
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
##### Optional arguments
`-f`: fake mode, only show files and folder to delete without make any changes.
#### Soft clean
./cnvpipelines.py soft-clean -w {working_dir}
With:
`working_dir`: the folder into data is stored
##### Optional arguments
`-f`: fake mode, only show files and folder to delete without make any changes.
### Unlock a workflow
If snakemake crashes or was killed, workflow should be locked. You can launch it by:
./cnvpipelines.py unlock -w {working_dir}
With:
`working_dir`: the folder into data is stored
Note: you can also use the `--force-unlock` of the rerun mode.
## Authors
Thomas Faraut <thomas.faraut@inra.fr>
Floréal Cabanettes <floreal.cabanettes@inra.fr>