README.md

Cnvpipelines workflow
---------------------

Cnvpipelines is a workflow to detect Copy Number Variation (CNV) variants (DEL, INV, MULTICOPY). It's build as a snakemake workflow with a wrapper to ease job run.

## In development...

This workflow is still in development. For now, only DEL and INV variants are available. Also, genotypage with genomestrip still fails and is disabled for now.

### Tech

The cnvpipeline uses a number of open source projects to work properly:

* [snakemake](https://snakemake.readthedocs.io) - workflow management software
* [delly](https://github.com/dellytools/delly) - SV detection tool
* [lumpy](https://github.com/arq5x/lumpy-sv) - SV detection tool
* [pindel](https://github.com/genome/pindel) - SV detection tool
* [svtyper](https://github.com/hall-lab/svtyper) - sv genotyping tool

and additionnal third party softwares.

### Installation
Clone this repository:

    git clone --recursive https://forgemia.inra.fr/genotoul-bioinfo/cnvpipelines.git

Then, copy `application.properties.example` into `application.properties`. Configuration will be edited in next step.
Third party sofwtares can best be installed using conda.

```sh
$ cd cnvpipelines
$ conda create --name cnvpipeline --file requirements.yaml
```

### 4. Load new conda environment

 ```sh
source activate cnv
export PERL5LIB="$CONDA_HOME/envs/cnv/lib/perl5"
```   


### 6. Additional softwares to install

You also need to install the RepBase (http://www.girinst.org/server/archive/RepBase21.12/ - choose this version as more recent ones are not compatible with 4.0.6 version of RepeatMaster). Download the Repbase-derived RepeatMasker libraries (repeatmaskerlibraries-20160829.tar.gz) Uncompress it in your save folder. It will create a Library folder. Then define the path to the Library folder inside the application.properties file (see below).

If you run simulation, you need additional python modules: matplotlib and seaborn. Once you loaded your conda environment, just install them like that:

    pip install matplotlib==2.2.* seaborn==0.8.*

Special case of genologin cluster (genotoul):

* Lumpy is already available through bioinfo/lumpy-v0.2.13. Just add it in the application.properties file.
* For genomestrip, you can use this folder: `/usr/local/bioinfo/src/GenomeSTRiP/svtoolkit_2.00.1774` (see configuration part, sv_dir point)

### 7. Future logins

For future logins, you must reactivate all conda environments. This means launching these commands:

    export PATH=$CONDA_HOME/bin:$PATH
    source activate cnv
    export PERL5LIB="$CONDA_HOME/envs/cnv/lib/perl5"

Where `$CONDA_HOME` is the folder in which you install miniconda in previous step.


## Install for simulations

To do simulations, you need to compile pirs, which is included as submodule of your cnvpipeline installation. Go into `cnvpipelines/popsim/pirs` and just run:

    make


## Configuration

Configuration should be edited in `application.properties` file. Sections and parameters are described below.

### Global section

* *batch_system_type*: `local` to run all jobs locally, or `slurm` or `sge` to submit on a cluster (depending on which cluster you have) (default: local).
* *modules*: list of modules to load before launching the workflow, space separated.
* *paths*: list of paths to add to the global PATH environment variable.
* *jobs*: maximum number of jobs to submit concurrently (default: 999).
* *sv_dir*: absolute path to the `svtoolkit` folder (for genomestrip).

### Cluster section

This section must be filled only if you don't use local as batch system type (see above).

* *submission_mode*: `drmaa` to submit jobs through DRMAA API, `cluster` to submit jobs through bash commands.
* *submission_command*: if you choose `cluster` for `submission_mode`, you must specify the command used to submit jobs (e.g.: srun, qsub).
* *drmaa*: if you choose `drmaa` for `submission_mode`, you must specify the absolute path to the DRMAA library on the cluster.
* *native_submission_options*: options passed to the submission command. Should be kept as it on most cases.
* *config*: absolute path to the config file defining for each rule the amount of memory and cluster threads to ask for (you should use the cluster.yaml file as a model). Can be kept as it.

### Reference bundle section

* *repeatmasker_lib_path*: PATH to the RepBase Libraries folder for RepeatMasker, if needed by RepeatMasker installation


## Run

### Run a new workflow

#### Reference bundle

##### Command

    ./cnvpipelines.py run refbundle -r {fasta} -s {species} -w {working_dir}

With:  
`fasta`: the path of the reference fasta file  
`species`: species name, according to the [NCBI Taxonomy database](http://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html)  
`working_dir`: the folder into store data

##### Optional arguments  
`-l`: read length (default: 100)  
`-m`: maximum n-stretches length (default: 100)  
`--chromosomes CHRS`: list of chromosomes to study, space separated. Regex accepted (using the [python syntax](https://docs.python.org/3/library/re.html#regular-expression-syntax)). Default: all valid chromosomes of the reference  
`--force-all-chromosomes`: ignore filtering if `--chromosomes` is not set  
`-p`: for each rule, show the shell command run.  
`-n`: dry run: show which rules will be launched without run anything.  
`--keep-wdir`: in dry run mode, don't remove working dir after launch  
`-c`: clean after launch: keep only final files (reference.* files).  
`-sc`: soft clean after launch: remove all log files.
`--out-step STEP`: specify the output rule file to only run the workflow until the associated rule (run all workflow if not specified)  
`--cluster-config FILE`: Path to a cluster config file (erase the cluster
                    config present in configution)

#### Align Fastq files on the reference

##### Command

    ./cnvpipelines.py run align -r {fasta} -s {samples} -w {working_dir}

With:  
`fasta`: the path of the reference fasta file  
`samples`: a YAML file describing for each sample it's name and fastq files (reads2, and optionally reads2). Example:

    Sample_1:
      reads1: /path/to/reads_1.fq.gz
      reads2: /path/to/reads_2.gq.gz

    Sample_2:
      reads: /path/to/reads.fq.gz
Where `Sample_1` and `Sample_2` are samples name.

`working_dir`: the folder into store data

##### Optional arguments

`-p`: for each rule, show the shell command run.  
`-n`: dry run: show which rules will be launched without run anything.  
`--keep-wdir`: in dry run mode, don't remove working dir after launch  
`-c`: clean after launch: keep only final files (reference.* files).  
`-sc`: soft clean after launch: remove all log files.    
`-f`: Force run, erase working dir config and files automatically  
`--out-step STEP`: specify the output rule file to only run the workflow until the associated rule (run all workflow if not specified)  
`--cluster-config FILE`: Path to a cluster config file (erase the cluster
                    config present in configution)

#### Detection

##### Command

    ./cnvpipelines.py run detection -r {fasta} -s {samples} -w {working_dir} -t {tools}

With:  
`fasta`: the path to the fasta file (with all files of the reference bundle on the same folder).  
`samples`: a file with in each line the path to a bam file to analyse.
`working_dir`: the folder into store data.  
`tools`: list of tools, space separated. Choose among genomestrip, delly, lumpy and pindel.

##### Optional arguments

`-b INT`: size of batches (default: -1, to always make only 1 batch)  
`--chromosomes CHRS`: list of chromosomes to study, space separated. Regex accepted (using the [python syntax](https://docs.python.org/3/library/re.html#regular-expression-syntax)). Default: all valid chromosomes of the reference  
`--force-all-chromosomes`: ignore filtering if `--chromosomes` is not set  
`-v VARIANTS`: list of variant types to detect, space separated among: DEL (deletions), INV (inversions), DUP (duplications) and mCNV (copy number variations). Default: all types.  
`-p`: for each rule, show the shell command run.  
`-n`: dry run: show which rules will be launched without run anything.  
`--keep-wdir`: in dry run mode, don't remove working dir after launch
`-c`: clean after launch: keep only filtered results.  
`-sc`: soft clean after launch: remove all log files.  
`-f`: Force run, erase working dir config and files automatically  
`--out-step STEP`: specify the output rule file to only run the workflow until the associated rule (run all workflow if not specified)  
`--cluster-config FILE`: Path to a cluster config file (erase the cluster
                    config present in configution)

#### Merge batches

##### Command

    ./cnvpipelines.py run mergebatches -w {working_dir}

With:  
`working_dir`: the detection run output folder. Must have at least 2 batches to run correctly for now.

##### Optional arguments

`-p`: for each rule, show the shell command run.  
`-n`: dry run: show which rules will be launched without run anything.  
`--keep-wdir`: in dry run mode, don't remove working dir after launch  
`-c`: clean after launch: keep only filtered results.  
`-sc`: soft clean after launch: remove all log files.  
`--out-step STEP`: specify the output rule file to only run the workflow until the associated rule (run all workflow if not specified)  
`--cluster-config`: erase the default cluster config file (see above) with a new one.


#### Simulate a population

We integrate to cnvpipelines popsim, our tool to simulate a population with variants (DEL and/or INV). With simulation workflow, you generate a such population, launch the detection and then compare detected results to true ones, with a summary HTML or Jupyter file.

##### Command

    ./cnvpipelines.py run simulation -nb {nb_inds} -r {reference} -sp {species} -t {tools} -w {working_dir}

With:
`nb_inds`: number of individuals to generate.  
`reference`: fasta file to use as reference for individuals simulated.  
`species`: species of the reference  (required only if you use genomestrip for detection).
`tools`: tools to use for detection, space separated. Choose among genomestrip, delly, lumpy and pindel.  
`working_dir`: the folder into store data.

##### Description of variants

`-s {svlist}`: a file describing the size distributions of variants. If not given, a default distribution is used.  
Structure of the file (tab separated columns):  
>DEL minLength maxLength proba -> Create DELetion(s).  
DUP minLength maxLength proba -> Create tandem DUPlication(s).  
INV minLength maxLength proba -> Create in-place INVersion(s).  

minLength and maxLength are int. Proba is the probability the variant has the above size, between 0 and 1. Ex: 0.7.

For each SV types, the sum of probabilities must be equal to 1.

1 line by SV type, like above. You can add several lines for a same SV type.

Example:

| Variant | Min | Max | Proba | Cumul. proba * |
|:-------:|:---:|:---:|:-----:|:--------------:|
|   DEL   | 100 | 200 |  0.7  |     0.7        |  
|   DEL   | 200 | 500 |  0.2  |     0.9        |
|   DEL   | 500 | 1000|  0.07 |     0.97       |  
|   DEL   | 1000| 2000|  0.02 |     0.99       |
|   DEL   | 2000|10000|  0.01 |     1.0        |

\* This column has not to be set in the file (it's automatically computed).

##### Recommended options

`-ns {nstretches}`: position of N stretches. Variants will be generated at a way from them.  
`-fp`: to force polymorphism. For each variant, the genotype will be different for at least individual.

##### Optional arguments

`-cv {coverage}`: coverage to use for generated reads of simulated individuals (default: 15).  
`-a`: to generate haploid individuals (default: diploid).  
`-pd {proba_del}`: probability to have a deletion (default: 1e-06).  
`-pi {proba_inv}`: probability to have an inversion (default: 1e-06).  
`-l {read_len}`: Generate reads having a length like specified (default: 100).  
`-m {insert_len_mean}`: Generate inserts (fragments) having an average length like specified (default: 300).  
`-v {insert_len_sd}`: Set the standard deviation of the insert (fragment) length (%) (default: 30).  
`-md {min-del}`: Minimum number of deletions to generate (default: 1).  
`-mi {min-inv}`: Minimum number of inversions to generate (default: 1).  
`--max-try {nb}`: number of tries to do reach the minimum values above. If it still fails, the program will exit with an error (default: 10).  
`-g {file}`: give a genotypes vcf file with variants positions and genotypes per individual. If given, only do the genomes and reads generation.  
`-mn {int}`: Max size of nstretches to consider them as it (for refbundle, only if genomestrip is in tools) (default: 100).  
`--overlap-cutoff {float}`: cutoff for reciprocal overlap between detected variants and true variants. Beside this number, they are consider as the same ones (default: 0.5).  
`--left-precision {int}`: left breakpoint precision. -1 to ignore (default: -1)  
`--right-precision {int}`: right breakpoint precision. -1 to ignore (default: -1)  
`--chromosomes CHRS`: list of chromosomes to study, space separated. Regex accepted (using the [python syntax](https://docs.python.org/3/library/re.html#regular-expression-syntax)). Default: all valid chromosomes of the reference  
`--force-all-chromosomes`: ignore filtering if `--chromosomes` is not set  
`-p`: for each rule, show the shell command run.  
`-n`: dry run: show which rules will be launched without run anything.  
`--keep-wdir`: in dry run mode, don't remove working dir after launch  
`-c`: clean after launch: keep only filtered results.  
`-sc`: soft clean after launch: remove all log files.  
`--out-step STEP`: specify the output rule file to only run the workflow until the associated rule (run all workflow if not specified)  
`--cluster-config`: erase the default cluster config file (see above) with a new one.


### Rerun a workflow

##### Command

    ./cnvpipelines.py rerun -w {working_dir}

With:
`working_dir`: the folder into data is stored

##### Optional arguments

`--force-unlock`: unlock a workflow (should be locked if snakemake crashes or was killed)  
`--rerun-incomplete`: to rerun incomplete rules  
`-p`: for each rule, show the shell command run.  
`-n`: dry run: show which rules will be launched without run anything.  
`--keep-wdir`: in dry run mode, don't remove working dir after launch  
`--cluster-config FILE`: erase the default cluster config file (see above) with a new one.  
`-c`: clean after launch  
`-sc`: soft clean after launch: remove all log files.  
`-f`: Force run, erase working dir config and files automatically  
`--out-step STEP`: specify the output rule file to only run the workflow until the associated rule (run all workflow if not specified)

### Clean a workflow

#### Full clean

    ./cnvpipelines.py clean -w {working_dir}

With:  
`working_dir`: the folder into data is stored

##### Optional arguments

`-f`: fake mode, only show files and folder to delete without make any changes.

#### Soft clean

    ./cnvpipelines.py soft-clean -w {working_dir}

With:  
`working_dir`: the folder into data is stored

##### Optional arguments

`-f`: fake mode, only show files and folder to delete without make any changes.

### Unlock a workflow

If snakemake crashes or was killed, workflow should be locked. You can launch it by:

    ./cnvpipelines.py unlock -w {working_dir}

With:  
`working_dir`: the folder into data is stored

Note: you can also use the `--force-unlock` of the rerun mode.

## Authors

Thomas Faraut <thomas.faraut@inra.fr>  
Floréal Cabanettes <floreal.cabanettes@inra.fr>