Skip to content
Snippets Groups Projects
README.md 11.5 KiB
Newer Older
Floreal Cabanettes's avatar
Floreal Cabanettes committed
Cnvpipelines workflow
---------------------

Cnvpipelines is a workflow to detect Copy Number Variation (CNV) variants (DEL, INV, MULTICOPY). It's build as a snakemake workflow with a wrapper to ease job run.


## In development...

Floreal Cabanettes's avatar
Floreal Cabanettes committed
This workflow is still in development. For now, only DEL and INV variants are available. Also, genotypage with genomestrip still fails and is disabled for now.
Floreal Cabanettes's avatar
Floreal Cabanettes committed


## Requirements

All tools used by the workflow must be available in your PATH (can be loaded as modules or added to PATH in config file, see below):
- delly >= 0.7.7
- lumpy >= 0.2.13
- pindel >= 0.2.5b9
- svtyper >= 0.1.4
- samtools >= 1.4
- bedtools >= 2.26.0
- bcftools >= 1.6
- vcftools >= 0.1.11
Floreal Cabanettes's avatar
Floreal Cabanettes committed
- parallel

Required for reference bundle:
- RepeatMasker >= 4.0.6
- exonerate >= 2.4.0
- picard tools >= 2.17.11

Floreal Cabanettes's avatar
Floreal Cabanettes committed
For genomestrip, you must define the `sv_dir` parameter in the configuration (see below).

Requirements for genomestrip:
- R 

Floreal Cabanettes's avatar
Floreal Cabanettes committed
Other dependencies:
- python3 >= 3.4
- python 2.7

Python 3 modules required:
- pysam >= 0.14
- pysamstats == master (from repository)
Floreal Cabanettes's avatar
Floreal Cabanettes committed
- pybedtools
- numpy
- joblib
Floreal Cabanettes's avatar
Floreal Cabanettes committed

Python 3 modules for reference bundle:
- pyfaixd
- biopython

Floreal Cabanettes's avatar
Floreal Cabanettes committed
Python 3 modules if DRMAA used for submissions:
- drmaa

Floreal Cabanettes's avatar
Floreal Cabanettes committed

## Installation

Clone this repository:

    git clone git@forgemia.inra.fr:genotoul-bioinfo/cnvpipelines.git
    
Floreal Cabanettes's avatar
Floreal Cabanettes committed
Then, copy `application.properties.example` into `application.properties`. Configuration will be edited in next step
We use a special version of svtyper available here (awaiting pull request): https://github.com/florealcab/svtyper

## Quick install with conda
All tools except Genomestrip, svtyper and lumpy can be installed via anaconda or miniconda.
We test the install with python3 miniconda. 

### 1. Install miniconda

    wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
    
    sh Miniconda3-latest-Linux-x86_64.sh
    
Then, follow the steps.

### 2. Load base conda environment

    export PATH=$CONDA_HOME/bin:$PATH

Where `$CONDA_HOME` is the folder in which you install miniconda in previous step.
### 3. Create the conda environment
    conda create --name cnv --file requirements.txt
### 4. Load new conda environment
    source activate cnv
    export PERL5LIB="$CONDA_HOME/envs/cnv/lib/perl5"
Where `$CONDA_HOME` is the folder in which you install miniconda in previous step.
### 5. Additional softwares to install
You must install pysamstats from the master branch of their repository to have a compatibility to pysam 0.14 (required by other components of the pipeline):

    pip install git+https://github.com/alimanfoo/pysamstats.git
    
For svtyper, you need to have the parallel python3-recoded version. For now (awaiting pull request), install it like this:

    pip install git+https://github.com/florealcab/svtyper.git

You must install [genomestrip](http://software.broadinstitute.org/software/genomestrip/) and [lumpy](https://github.com/arq5x/lumpy-sv) using their install procedure. 
You also need to install the RepBase (http://www.girinst.org/server/archive/RepBase21.12/ - choose this version as more recent ones are not compatible with 4.0.6 version of RepeatMaster), then define the path to the Library folder inside the application.properties file (see below).
Special case of genologin cluster (genotoul):
* Lumpy is already available through bioinfo/lumpy-v0.2.13. Just add it in the application.properties file. 
* For genomestrip, you can use this folder: `/usr/local/bioinfo/src/GenomeSTRiP/svtoolkit_2.00.1774` (see configuration part, sv_dir point)

### 6. Future logins

For future logins, you must reactivate all conda environments. This means launching these commands:

    export PATH=$CONDA_HOME/bin:$PATH
    source activate cnv
    export PERL5LIB="$CONDA_HOME/envs/cnv/lib/perl5"
    
Where `$CONDA_HOME` is the folder in which you install miniconda in previous step.

Floreal Cabanettes's avatar
Floreal Cabanettes committed
   
## Configuration

Configuration should be edited in `application.properties` file. Sections and parameters are described below.

### Global section

* *batch_system_type*: `local` to run all jobs locally, or `slurm` or `sge` to submit on a cluster (depending on which cluster you have) (default: local).
* *modules*: list of modules to load before launching the workflow, space separated.
* *paths*: list of paths to add to the global PATH environment variable.
* *jobs*: maximum number of jobs to submit concurrently (default: 999).
* *sv_dir*: absolute path to the `svtoolkit` folder (for genomestrip).

### Cluster section

This section must be filled only if you don't use local as batch system type (see above).

* *submission_mode*: `drmaa` to submit jobs through DRMAA API, `cluster` to submit jobs through bash commands.
* *submission_command*: if you choose `cluster` for `submission_mode`, you must specify the command used to submit jobs (e.g.: srun, qsub).
* *drmaa*: if you choose `drmaa` for `submission_mode`, you must specify the absolute path to the DRMAA library on the cluster.
* *native_submission_options*: options passed to the submission command. Should be kept as it on most cases.
* *config*: absolute path to the config file defining for each rule the amount of memory and cluster threads to ask for (you should use the cluster.yaml file as a model). Can be kept as it.

Floreal Cabanettes's avatar
Floreal Cabanettes committed
### Reference bundle section

* *repeatmasker_lib_path*: PATH to the RepBase Libraries folder for RepeatMasker, if needed by RepeatMasker installation

Floreal Cabanettes's avatar
Floreal Cabanettes committed

## Run

### Run a new workflow

#### Reference bundle

    ./cnvpipelines.py run refbundle -r {fasta} -s {species} -w {working_dir}
Floreal Cabanettes's avatar
Floreal Cabanettes committed
    
With:  
`fasta`: the path of the reference fasta file  
`species`: species name, according to the [NCBI Taxonomy database](http://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html)  
Floreal Cabanettes's avatar
Floreal Cabanettes committed
`working_dir`: the folder into store data
##### Optional arguments  
`-l`: read length (default: 100)  
`-m`: maximum n-stretches length (default: 100)  
`-p`: for each rule, show the shell command run.  
`-n`: dry run: show which rules will be launched without run anything.  
`--keep-wdir`: in dry run mode, don't remove working dir after launch  
`-c`: clean after launch: keep only final files (reference.* files).  
`-sc`: soft clean after launch: remove all log files.
`--out-step STEP`: specify the output rule file to only run the workflow until the associated rule (run all workflow if not specified)  
`--cluster-config FILE`: Path to a cluster config file (erase the cluster
                    config present in configution)

#### Align Fastq files on the reference

##### Command

    ./cnvpipelines.py run align -r {fasta} -s {samples} -w {working_dir}
    
With:  
`fasta`: the path of the reference fasta file  
`samples`: a YAML file describing for each sample it's name and fastq files (reads2, and optionally reads2). Example:
    
    Sample_1:
      reads1: /path/to/reads_1.fq.gz
      reads2: /path/to/reads_2.gq.gz
      
    Sample_2:
      reads: /path/to/reads.fq.gz
Where `Sample_1` and `Sample_2` are samples name.

`working_dir`: the folder into store data

##### Optional arguments
    
`-p`: for each rule, show the shell command run.  
`-n`: dry run: show which rules will be launched without run anything.  
`--keep-wdir`: in dry run mode, don't remove working dir after launch  
Floreal Cabanettes's avatar
Floreal Cabanettes committed
`-c`: clean after launch: keep only final files (reference.* files).  
`-sc`: soft clean after launch: remove all log files.    
`-f`: Force run, erase working dir config and files automatically  
`--out-step STEP`: specify the output rule file to only run the workflow until the associated rule (run all workflow if not specified)  
`--cluster-config FILE`: Path to a cluster config file (erase the cluster
                    config present in configution)
    ./cnvpipelines.py run detection -r {fasta} -s {samples} -w {working_dir} -t {tools}
    
With:  
`fasta`: the path to the fasta file (with all files of the reference bundle on the same folder).  
`samples`: a file with in each line the path to a bam file to analyse.
`working_dir`: the folder into store data  
Floreal Cabanettes's avatar
Floreal Cabanettes committed
`tools`: list of tools, space separated

##### Optional arguments
  
`-b INT`: size of batches (default: -1, to always make only 1 batch)  
`--chromosomes CHRS`: list of chromosomes to study, coma separated. Default: all valid chromosomes of the reference  
`-v VARIANTS`: list of variant types to detect, space separated among: DEL (deletions), INV (inversions), DUP (duplications) and CNV (copy number variations). Default: all types.  
`-p`: for each rule, show the shell command run.  
`-n`: dry run: show which rules will be launched without run anything.  
`--keep-wdir`: in dry run mode, don't remove working dir after launch
`-c`: clean after launch: keep only filtered results.  
`-sc`: soft clean after launch: remove all log files.  
`-f`: Force run, erase working dir config and files automatically  
`--out-step STEP`: specify the output rule file to only run the workflow until the associated rule (run all workflow if not specified)  
`--cluster-config FILE`: Path to a cluster config file (erase the cluster
                    config present in configution)
                    
#### Merge batches

##### Command

    ./cnvpipelines.py run mergebatches -w {working_dir}
    
With:  
`working_dir`: the detection run output folder. Must have at least 2 batches to run correctly for now.

##### Optional arguments

`-p`: for each rule, show the shell command run.  
`-n`: dry run: show which rules will be launched without run anything.  
`--keep-wdir`: in dry run mode, don't remove working dir after launch  
Floreal Cabanettes's avatar
Floreal Cabanettes committed
`-c`: clean after launch: keep only filtered results.  
`-sc`: soft clean after launch: remove all log files.  
`--out-step STEP`: specify the output rule file to only run the workflow until the associated rule (run all workflow if not specified)  
`--cluster-config`: erase the default cluster config file (see above) with a new one.
Floreal Cabanettes's avatar
Floreal Cabanettes committed

### Rerun a workflow

Floreal Cabanettes's avatar
Floreal Cabanettes committed
    ./cnvpipelines.py rerun -w {working_dir}
    
With:
`working_dir`: the folder into data is stored

##### Optional arguments

`--force-unlock`: unlock a workflow (should be locked if snakemake crashes or was killed)  
`--rerun-incomplete`: to rerun incomplete rules  
`-p`: for each rule, show the shell command run.  
`-n`: dry run: show which rules will be launched without run anything.  
`--keep-wdir`: in dry run mode, don't remove working dir after launch  
`--cluster-config FILE`: erase the default cluster config file (see above) with a new one.  
Floreal Cabanettes's avatar
Floreal Cabanettes committed
`-c`: clean after launch  
`-sc`: soft clean after launch: remove all log files.  
`-f`: Force run, erase working dir config and files automatically  
`--out-step STEP`: specify the output rule file to only run the workflow until the associated rule (run all workflow if not specified)
Floreal Cabanettes's avatar
Floreal Cabanettes committed

### Clean a workflow

Floreal Cabanettes's avatar
Floreal Cabanettes committed
    ./cnvpipelines.py clean -w {working_dir}
    
Floreal Cabanettes's avatar
Floreal Cabanettes committed
`working_dir`: the folder into data is stored

##### Optional arguments

`-f`: fake mode, only show files and folder to delete without make any changes.

#### Soft clean

    ./cnvpipelines.py soft-clean -w {working_dir}
    
With:  
`working_dir`: the folder into data is stored

##### Optional arguments

`-f`: fake mode, only show files and folder to delete without make any changes.

### Unlock a workflow

If snakemake crashes or was killed, workflow should be locked. You can launch it by:

    ./cnvpipelines.py unlock -w {working_dir}
    
With:  
`working_dir`: the folder into data is stored

Note: you can also use the `--force-unlock` of the rerun mode.

Floreal Cabanettes's avatar
Floreal Cabanettes committed
## Authors

Thomas Faraut <thomas.faraut@inra.fr>  
Floréal Cabanettes <floreal.cabanettes@inra.fr>