README.md

Cnvpipelines workflow
---------------------

Cnvpipelines is a workflow to detect Copy Number Variation (CNV) variants (DEL, INV, MULTICOPY). It's build as a snakemake workflow with a wrapper to ease job run.


## In development...

This workflow is still in development. For now, only DEL and INV variants are available. Also, genotypage with genomestrip still fails and is disabled for now.


## Requirements

All tools used by the workflow must be available in your PATH (can be loaded as modules or added to PATH in config file, see below):
- delly >= 0.7.7
- lumpy >= 0.2.13
- pindel >= 0.2.5b9
- svtyper >= 0.1.4
- samtools >= 1.4
- bedtools >= 2.26.0
- bcftools >= 1.6
- vcftools >= 0.1.11
- parallel

Required for reference bundle:
- RepeatMasker >= 4.0.6
- exonerate >= 2.4.0
- picard tools >= 2.17.11

For genomestrip, you must define the `sv_dir` parameter in the configuration (see below).

Requirements for genomestrip:
- R 

Other dependencies:
- python3 >= 3.4
- python 2.7

Python 3 modules required:
- pysam >= 0.14
- pysamstats == master (from repository)
- pybedtools
- numpy
- joblib

Python 3 modules for reference bundle:
- pyfaixd
- biopython

Python 3 modules required for simulation:
- matplotlib==2.2.*
- seaborn==0.8.*

Python 3 modules if DRMAA used for submissions:
- drmaa


## Installation

Clone this repository:

    git clone --recursive git@forgemia.inra.fr:genotoul-bioinfo/cnvpipelines.git
    
Then, copy `application.properties.example` into `application.properties`. Configuration will be edited in next step

We use a special version of svtyper available here (awaiting pull request): https://github.com/florealcab/svtyper


## Quick install with conda

All tools except Genomestrip, svtyper and lumpy can be installed via anaconda or miniconda.

We test the install with python3 miniconda. 

### 1. Install miniconda

    wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
    
    sh Miniconda3-latest-Linux-x86_64.sh
    
Then, follow the steps.

### 2. Load base conda environment

    export PATH=$CONDA_HOME/bin:$PATH

Where `$CONDA_HOME` is the folder in which you install miniconda in previous step.

### 3. Create the conda environment

    conda create --name cnv --file requirements.txt
    
### 4. Load new conda environment
    
    source activate cnv
    export PERL5LIB="$CONDA_HOME/envs/cnv/lib/perl5"
    
Where `$CONDA_HOME` is the folder in which you install miniconda in previous step.

### 5. Additional softwares to install

You must install pysamstats from the master branch of their repository to have a compatibility to pysam 0.14 (required by other components of the pipeline):

    pip install git+https://github.com/alimanfoo/pysamstats.git
    
For svtyper, you need to have the parallel python3-recoded version. For now (awaiting pull request), install it like this:

    pip install git+https://github.com/florealcab/svtyper.git

You must install [genomestrip](http://software.broadinstitute.org/software/genomestrip/) and [lumpy](https://github.com/arq5x/lumpy-sv) using their install procedure. 

You also need to install the RepBase (http://www.girinst.org/server/archive/RepBase21.12/ - choose this version as more recent ones are not compatible with 4.0.6 version of RepeatMaster), then define the path to the Library folder inside the application.properties file (see below).

If you run simulation, you need additional python modules: matplotlib and seaborn. Once you loaded your conda environment, just install them like that:

    pip install matplotlib seaborn

Special case of genologin cluster (genotoul):

* Lumpy is already available through bioinfo/lumpy-v0.2.13. Just add it in the application.properties file. 
* For genomestrip, you can use this folder: `/usr/local/bioinfo/src/GenomeSTRiP/svtoolkit_2.00.1774` (see configuration part, sv_dir point)

### 6. Future logins

For future logins, you must reactivate all conda environments. This means launching these commands:

    export PATH=$CONDA_HOME/bin:$PATH
    source activate cnv
    export PERL5LIB="$CONDA_HOME/envs/cnv/lib/perl5"
    
Where `$CONDA_HOME` is the folder in which you install miniconda in previous step.


## Install for simulations

To do simulations, you need to compile pirs, which is included as submodule of your cnvpipeline installation. Go into `cnvpipelines/popsim/pirs` and just run:

    make

   
## Configuration

Configuration should be edited in `application.properties` file. Sections and parameters are described below.

### Global section

* *batch_system_type*: `local` to run all jobs locally, or `slurm` or `sge` to submit on a cluster (depending on which cluster you have) (default: local).
* *modules*: list of modules to load before launching the workflow, space separated.
* *paths*: list of paths to add to the global PATH environment variable.
* *jobs*: maximum number of jobs to submit concurrently (default: 999).
* *sv_dir*: absolute path to the `svtoolkit` folder (for genomestrip).

### Cluster section

This section must be filled only if you don't use local as batch system type (see above).

* *submission_mode*: `drmaa` to submit jobs through DRMAA API, `cluster` to submit jobs through bash commands.
* *submission_command*: if you choose `cluster` for `submission_mode`, you must specify the command used to submit jobs (e.g.: srun, qsub).
* *drmaa*: if you choose `drmaa` for `submission_mode`, you must specify the absolute path to the DRMAA library on the cluster.
* *native_submission_options*: options passed to the submission command. Should be kept as it on most cases.
* *config*: absolute path to the config file defining for each rule the amount of memory and cluster threads to ask for (you should use the cluster.yaml file as a model). Can be kept as it.

### Reference bundle section

* *repeatmasker_lib_path*: PATH to the RepBase Libraries folder for RepeatMasker, if needed by RepeatMasker installation


## Run

### Run a new workflow

#### Reference bundle

##### Command

    ./cnvpipelines.py run refbundle -r {fasta} -s {species} -w {working_dir}
    
With:  
`fasta`: the path of the reference fasta file  
`species`: species name, according to the [NCBI Taxonomy database](http://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html)  
`working_dir`: the folder into store data

##### Optional arguments  
`-l`: read length (default: 100)  
`-m`: maximum n-stretches length (default: 100)  
`-p`: for each rule, show the shell command run.  
`-n`: dry run: show which rules will be launched without run anything.  
`--keep-wdir`: in dry run mode, don't remove working dir after launch  
`-c`: clean after launch: keep only final files (reference.* files).  
`-sc`: soft clean after launch: remove all log files.
`--out-step STEP`: specify the output rule file to only run the workflow until the associated rule (run all workflow if not specified)  
`--cluster-config FILE`: Path to a cluster config file (erase the cluster
                    config present in configution)

#### Align Fastq files on the reference

##### Command

    ./cnvpipelines.py run align -r {fasta} -s {samples} -w {working_dir}
    
With:  
`fasta`: the path of the reference fasta file  
`samples`: a YAML file describing for each sample it's name and fastq files (reads2, and optionally reads2). Example:
    
    Sample_1:
      reads1: /path/to/reads_1.fq.gz
      reads2: /path/to/reads_2.gq.gz
      
    Sample_2:
      reads: /path/to/reads.fq.gz
Where `Sample_1` and `Sample_2` are samples name.

`working_dir`: the folder into store data

##### Optional arguments
    
`-p`: for each rule, show the shell command run.  
`-n`: dry run: show which rules will be launched without run anything.  
`--keep-wdir`: in dry run mode, don't remove working dir after launch  
`-c`: clean after launch: keep only final files (reference.* files).  
`-sc`: soft clean after launch: remove all log files.    
`-f`: Force run, erase working dir config and files automatically  
`--out-step STEP`: specify the output rule file to only run the workflow until the associated rule (run all workflow if not specified)  
`--cluster-config FILE`: Path to a cluster config file (erase the cluster
                    config present in configution)

#### Detection

##### Command

    ./cnvpipelines.py run detection -r {fasta} -s {samples} -w {working_dir} -t {tools}
    
With:  
`fasta`: the path to the fasta file (with all files of the reference bundle on the same folder).  
`samples`: a file with in each line the path to a bam file to analyse.
`working_dir`: the folder into store data.  
`tools`: list of tools, space separated. Choose among genomestrip, delly, lumpy and pindel.

##### Optional arguments
  
`-b INT`: size of batches (default: -1, to always make only 1 batch)  
`--chromosomes CHRS`: list of chromosomes to study, coma separated. Default: all valid chromosomes of the reference  
`-v VARIANTS`: list of variant types to detect, space separated among: DEL (deletions), INV (inversions), DUP (duplications) and mCNV (copy number variations). Default: all types.  
`-p`: for each rule, show the shell command run.  
`-n`: dry run: show which rules will be launched without run anything.  
`--keep-wdir`: in dry run mode, don't remove working dir after launch
`-c`: clean after launch: keep only filtered results.  
`-sc`: soft clean after launch: remove all log files.  
`-f`: Force run, erase working dir config and files automatically  
`--out-step STEP`: specify the output rule file to only run the workflow until the associated rule (run all workflow if not specified)  
`--cluster-config FILE`: Path to a cluster config file (erase the cluster
                    config present in configution)
                    
#### Merge batches

##### Command

    ./cnvpipelines.py run mergebatches -w {working_dir}
    
With:  
`working_dir`: the detection run output folder. Must have at least 2 batches to run correctly for now.

##### Optional arguments

`-p`: for each rule, show the shell command run.  
`-n`: dry run: show which rules will be launched without run anything.  
`--keep-wdir`: in dry run mode, don't remove working dir after launch  
`-c`: clean after launch: keep only filtered results.  
`-sc`: soft clean after launch: remove all log files.  
`--out-step STEP`: specify the output rule file to only run the workflow until the associated rule (run all workflow if not specified)  
`--cluster-config`: erase the default cluster config file (see above) with a new one.


#### Simulate a population

We integrate to cnvpipelines popsim, our tool to simulate a population with variants (DEL and/or INV). With simulation workflow, you generate a such population, launch the detection and then compare detected results to true ones, with a summary HTML or Jupyter file.

##### Command

    ./cnvpipelines.py run simulation -nb {nb_inds} -r {reference} -sp {species} -t {tools} -w {working_dir}
    
With:
`nb_inds`: number of individuals to generate.  
`reference`: fasta file to use as reference for individuals simulated.  
`species`: species of the reference  (required only if you use genomestrip for detection).
`tools`: tools to use for detection, space separated. Choose among genomestrip, delly, lumpy and pindel.  
`working_dir`: the folder into store data.

##### Description of variants
 
`-s {svlist}`: a file describing the size distributions of variants. If not given, a default distribution is used.  
Structure of the file (tab separated columns):  
>DEL minLength maxLength proba -> Create DELetion(s).  
DUP minLength maxLength proba -> Create tandem DUPlication(s).  
INV minLength maxLength proba -> Create in-place INVersion(s).  

minLength and maxLength are int. Proba is the probability the variant has the above size, between 0 and 1. Ex: 0.7.

For each SV types, the sum of probabilities must be equal to 1.

1 line by SV type, like above. You can add several lines for a same SV type.

Example:

| Min | Max | Proba | Cumul. proba |
|:---:|:---:|:-----:|:------------:|
| 100 | 200 |  0.7  |     0.7      |  
| 200 | 500 |  0.2  |     0.9      |
| 500 | 1000|  0.07 |     0.97     |  
| 1000| 2000|  0.02 |     0.99     |
| 2000|10000|  0.01 |     1.0      |

##### Recommended options

`-ns {nstretches}`: position of N stretches. Variants will be generated at a way from them.  
`-fp`: to force polymorphism. For each variant, the genotype will be different for at least individual.

##### Optional arguments

`-cv {coverage}`: coverage to use for generated reads of simulated individuals (default: 15).  
`-a`: to generate haploid individuals (default: diploid).  
`-pd {proba_del}`: probability to have a deletion (default: 1e-06).  
`-pi {proba_inv}`: probability to have an inversion (default: 1e-06).  
`-l {read_len}`: Generate reads having a length like specified (default: 100).  
`-m {insert_len_mean}`: Generate inserts (fragments) having an average length like specified (default: 300).  
`-v {insert_len_sd}`: Set the standard deviation of the insert (fragment) length (%) (default: 30).  
`-md {min-del}`: Minimum number of deletions to generate (default: 1).  
`-mi {min-inv}`: Minimum number of inversions to generate (default: 1).  
`--max-try {nb}`: number of tries to do reach the minimum values above. If it still fails, the program will exit with an error (default: 10).  
`-g {file}`: give a genotypes vcf file with variants positions and genotypes per individual. If given, only do the genomes and reads generation.  
`-mn {int}`: Max size of nstretches to consider them as it (for refbundle, only if genomestrip is in tools) (default: 100).  
`--overlap-cutoff {float}`: cutoff for reciprocal overlap between detected variants and true variants. Beside this number, they are consider as the same ones (default: 0.5).  
`--left-precision {int}`: left breakpoint precision. -1 to ignore (default: -1)  
`--right-precision {int}`: right breakpoint precision. -1 to ignore (default: -1)  
`-p`: for each rule, show the shell command run.  
`-n`: dry run: show which rules will be launched without run anything.  
`--keep-wdir`: in dry run mode, don't remove working dir after launch  
`-c`: clean after launch: keep only filtered results.  
`-sc`: soft clean after launch: remove all log files.  
`--out-step STEP`: specify the output rule file to only run the workflow until the associated rule (run all workflow if not specified)  
`--cluster-config`: erase the default cluster config file (see above) with a new one.


### Rerun a workflow

##### Command

    ./cnvpipelines.py rerun -w {working_dir}
    
With:
`working_dir`: the folder into data is stored

##### Optional arguments

`--force-unlock`: unlock a workflow (should be locked if snakemake crashes or was killed)  
`--rerun-incomplete`: to rerun incomplete rules  
`-p`: for each rule, show the shell command run.  
`-n`: dry run: show which rules will be launched without run anything.  
`--keep-wdir`: in dry run mode, don't remove working dir after launch  
`--cluster-config FILE`: erase the default cluster config file (see above) with a new one.  
`-c`: clean after launch  
`-sc`: soft clean after launch: remove all log files.  
`-f`: Force run, erase working dir config and files automatically  
`--out-step STEP`: specify the output rule file to only run the workflow until the associated rule (run all workflow if not specified)

### Clean a workflow

#### Full clean

    ./cnvpipelines.py clean -w {working_dir}
    
With:  
`working_dir`: the folder into data is stored

##### Optional arguments

`-f`: fake mode, only show files and folder to delete without make any changes.

#### Soft clean

    ./cnvpipelines.py soft-clean -w {working_dir}
    
With:  
`working_dir`: the folder into data is stored

##### Optional arguments

`-f`: fake mode, only show files and folder to delete without make any changes.

### Unlock a workflow

If snakemake crashes or was killed, workflow should be locked. You can launch it by:

    ./cnvpipelines.py unlock -w {working_dir}
    
With:  
`working_dir`: the folder into data is stored

Note: you can also use the `--force-unlock` of the rerun mode.

## Authors

Thomas Faraut <thomas.faraut@inra.fr>  
Floréal Cabanettes <floreal.cabanettes@inra.fr>