README.md

Cnvpipelines workflow
---------------------

Cnvpipelines is a workflow to detect Copy Number Variation (CNV) variants (DEL, INV, MULTICOPY). It's build as a snakemake workflow with a wrapper to ease job run.


## In development...

This workflow is still in development. For now, only DEL and INV variants are available. Also, genotypage with genomestrip still fails and is disabled for now.


## Requirements

All tools used by the workflow must be available in your PATH (can be loaded as modules or added to PATH in config file, see below):
- delly >= 0.7.7
- lumpy >= 0.2.13
- pindel >= 0.2.5b9
- svtyper >= 0.1.4
- samtools >= 1.4
- bedtools >= 2.26.0
- bcftools >= 1.6
- vcftools >= 0.1.11
- parallel

Required for reference bundle:
- RepeatMasker >= 4.0.6
- exonerate >= 2.4.0
- picard tools >= 2.17.11

For genomestrip, you must define the `sv_dir` parameter in the configuration (see below).

Requirements for genomestrip:
- R 

Other dependencies:
- python3 >= 3.4
- python 2.7

Python 3 modules required:
- pysam >= 0.14
- pysamstats == master (from repository)
- pybedtools
- numpy
- joblib

Python 3 modules for reference bundle:
- pyfaixd
- biopython

Python 3 modules if DRMAA used for submissions:
- drmaa


## Installation

Clone this repository:

    git clone git@forgemia.inra.fr:genotoul-bioinfo/cnvpipelines.git
    
Then, copy `application.properties.example` into `application.properties`. Configuration will be edited in next step

We use a special version of svtyper available here (awaiting pull request): https://github.com/florealcab/svtyper


## Quick install with conda

All tools except Genomestrip, svtyper and lumpy can be installed via anaconda or miniconda.

We test the install with python3 miniconda. 

### 1. Install miniconda

    wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
    
    sh Miniconda3-latest-Linux-x86_64.sh
    
Then, follow the steps.

### 2. Load base conda environment

    export PATH=$CONDA_HOME/bin:$PATH

Where `$CONDA_HOME` is the folder in which you install miniconda in previous step.

### 3. Create the conda environment

    conda create --name cnv --file requirements.txt
    
### 4. Load new conda environment
    
    source activate cnv
    export PERL5LIB="$CONDA_HOME/envs/cnv/lib/perl5"
    
Where `$CONDA_HOME` is the folder in which you install miniconda in previous step.

### 5. Additional softwares to install

You must install pysamstats from the master branch of their repository to have a compatibility to pysam 0.14 (required by other components of the pipeline):

    pip install git+https://github.com/alimanfoo/pysamstats.git
    
For svtyper, you need to have the parallel python3-recoded version. For now (awaiting pull request), install it like this:

    pip install git+https://github.com/florealcab/svtyper.git

You must install [genomestrip](http://software.broadinstitute.org/software/genomestrip/) and [lumpy](https://github.com/arq5x/lumpy-sv) using their install procedure. 

You also need to install the RepBase (http://www.girinst.org/server/archive/RepBase21.12/ - choose this version as more recent ones are not compatible with 4.0.6 version of RepeatMaster), then define the path to the Library folder inside the application.properties file (see below).

Special case of genologin cluster (genotoul):

* Lumpy is already available through bioinfo/lumpy-v0.2.13. Just add it in the application.properties file. 
* For genomestrip, you can use this folder: `/usr/local/bioinfo/src/GenomeSTRiP/svtoolkit_2.00.1774` (see configuration part, sv_dir point)

### 6. Future logins

For future logins, you must reactivate all conda environments. This means launching these commands:

    export PATH=$CONDA_HOME/bin:$PATH
    source activate cnv
    export PERL5LIB="$CONDA_HOME/envs/cnv/lib/perl5"
    
Where `$CONDA_HOME` is the folder in which you install miniconda in previous step.

   
## Configuration

Configuration should be edited in `application.properties` file. Sections and parameters are described below.

### Global section

* *batch_system_type*: `local` to run all jobs locally, or `slurm` or `sge` to submit on a cluster (depending on which cluster you have) (default: local).
* *modules*: list of modules to load before launching the workflow, space separated.
* *paths*: list of paths to add to the global PATH environment variable.
* *jobs*: maximum number of jobs to submit concurrently (default: 999).
* *sv_dir*: absolute path to the `svtoolkit` folder (for genomestrip).

### Cluster section

This section must be filled only if you don't use local as batch system type (see above).

* *submission_mode*: `drmaa` to submit jobs through DRMAA API, `cluster` to submit jobs through bash commands.
* *submission_command*: if you choose `cluster` for `submission_mode`, you must specify the command used to submit jobs (e.g.: srun, qsub).
* *drmaa*: if you choose `drmaa` for `submission_mode`, you must specify the absolute path to the DRMAA library on the cluster.
* *native_submission_options*: options passed to the submission command. Should be kept as it on most cases.
* *config*: absolute path to the config file defining for each rule the amount of memory and cluster threads to ask for (you should use the cluster.yaml file as a model). Can be kept as it.

### Reference bundle section

* *repeatmasker_lib_path*: PATH to the RepBase Libraries folder for RepeatMasker, if needed by RepeatMasker installation


## Run

### Run a new workflow

#### Reference bundle

##### Command

    ./cnvpipelines.py run refbundle -r {fasta} -s {species} -w {working_dir}
    
With:  
`fasta`: the path of the reference fasta file  
`species`: species name, according to the [NCBI Taxonomy database](http://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html)  
`working_dir`: the folder into store data

##### Optional arguments  
`-l`: read length (default: 100)  
`-m`: maximum n-stretches length (default: 100)  
`-p`: for each rule, show the shell command run.  
`-n`: dry run: show which rules will be launched without run anything.  
`--keep-wdir`: in dry run mode, don't remove working dir after launch  
`-c`: clean after launch: keep only final files (reference.* files).  
`-sc`: soft clean after launch: remove all log files.
`--out-step STEP`: specify the output rule file to only run the workflow until the associated rule (run all workflow if not specified)  
`--cluster-config FILE`: Path to a cluster config file (erase the cluster
                    config present in configution)

#### Align Fastq files on the reference

##### Command

    ./cnvpipelines.py run align -r {fasta} -s {samples} -w {working_dir}
    
With:  
`fasta`: the path of the reference fasta file  
`samples`: a YAML file describing for each sample it's name and fastq files (reads2, and optionally reads2). Example:
    
    Sample_1:
      reads1: /path/to/reads_1.fq.gz
      reads2: /path/to/reads_2.gq.gz
      
    Sample_2:
      reads: /path/to/reads.fq.gz
Where `Sample_1` and `Sample_2` are samples name.

`working_dir`: the folder into store data

##### Optional arguments
    
`-p`: for each rule, show the shell command run.  
`-n`: dry run: show which rules will be launched without run anything.  
`--keep-wdir`: in dry run mode, don't remove working dir after launch  
`-c`: clean after launch: keep only final files (reference.* files).  
`-sc`: soft clean after launch: remove all log files.    
`-f`: Force run, erase working dir config and files automatically  
`--out-step STEP`: specify the output rule file to only run the workflow until the associated rule (run all workflow if not specified)  
`--cluster-config FILE`: Path to a cluster config file (erase the cluster
                    config present in configution)

#### Detection

##### Command

    ./cnvpipelines.py run detection -r {fasta} -s {samples} -w {working_dir} -t {tools}
    
With:  
`fasta`: the path to the fasta file (with all files of the reference bundle on the same folder).  
`samples`: a file with in each line the path to a bam file to analyse.
`working_dir`: the folder into store data  
`tools`: list of tools, space separated

##### Optional arguments
  
`-b INT`: size of batches (default: -1, to always make only 1 batch)  
`--chromosomes CHRS`: list of chromosomes to study, coma separated. Default: all valid chromosomes of the reference  
`-v VARIANTS`: list of variant types to detect, space separated among: DEL (deletions), INV (inversions), DUP (duplications) and CNV (copy number variations). Default: all types.  
`-p`: for each rule, show the shell command run.  
`-n`: dry run: show which rules will be launched without run anything.  
`--keep-wdir`: in dry run mode, don't remove working dir after launch
`-c`: clean after launch: keep only filtered results.  
`-sc`: soft clean after launch: remove all log files.  
`-f`: Force run, erase working dir config and files automatically  
`--out-step STEP`: specify the output rule file to only run the workflow until the associated rule (run all workflow if not specified)  
`--cluster-config FILE`: Path to a cluster config file (erase the cluster
                    config present in configution)
                    
#### Merge batches

##### Command

    ./cnvpipelines.py run mergebatches -w {working_dir}
    
With:  
`working_dir`: the detection run output folder. Must have at least 2 batches to run correctly for now.

##### Optional arguments

`-p`: for each rule, show the shell command run.  
`-n`: dry run: show which rules will be launched without run anything.  
`--keep-wdir`: in dry run mode, don't remove working dir after launch  
`-c`: clean after launch: keep only filtered results.  
`-sc`: soft clean after launch: remove all log files.  
`--out-step STEP`: specify the output rule file to only run the workflow until the associated rule (run all workflow if not specified)  
`--cluster-config`: erase the default cluster config file (see above) with a new one.

### Rerun a workflow

##### Command

    ./cnvpipelines.py rerun -w {working_dir}
    
With:
`working_dir`: the folder into data is stored

##### Optional arguments

`--force-unlock`: unlock a workflow (should be locked if snakemake crashes or was killed)  
`--rerun-incomplete`: to rerun incomplete rules  
`-p`: for each rule, show the shell command run.  
`-n`: dry run: show which rules will be launched without run anything.  
`--keep-wdir`: in dry run mode, don't remove working dir after launch  
`--cluster-config FILE`: erase the default cluster config file (see above) with a new one.  
`-c`: clean after launch  
`-sc`: soft clean after launch: remove all log files.  
`-f`: Force run, erase working dir config and files automatically  
`--out-step STEP`: specify the output rule file to only run the workflow until the associated rule (run all workflow if not specified)

### Clean a workflow

#### Full clean

    ./cnvpipelines.py clean -w {working_dir}
    
With:  
`working_dir`: the folder into data is stored

##### Optional arguments

`-f`: fake mode, only show files and folder to delete without make any changes.

#### Soft clean

    ./cnvpipelines.py soft-clean -w {working_dir}
    
With:  
`working_dir`: the folder into data is stored

##### Optional arguments

`-f`: fake mode, only show files and folder to delete without make any changes.

### Unlock a workflow

If snakemake crashes or was killed, workflow should be locked. You can launch it by:

    ./cnvpipelines.py unlock -w {working_dir}
    
With:  
`working_dir`: the folder into data is stored

Note: you can also use the `--force-unlock` of the rerun mode.

## Authors

Thomas Faraut <thomas.faraut@inra.fr>  
Floréal Cabanettes <floreal.cabanettes@inra.fr>