Skip to content
Snippets Groups Projects
Forked from SVdetection / cnvpipelines
97 commits behind the upstream repository.

Cnvpipelines workflow

Cnvpipelines is a workflow to detect Copy Number Variation (CNV) variants (DEL, INV, MULTICOPY). It's build as a snakemake workflow with a wrapper to ease job run.

In development...

This workflow is still in development. For now, only DEL and INV variants are available. Also, genotypage with genomestrip still fails and is disabled for now.

Tech

The cnvpipeline uses a number of open source projects to work properly:

and additionnal third party softwares. All those sofware will be installed automaticcaly using conda.

Installation

Clone this repository:

git clone --recursive https://forgemia.inra.fr/genotoul-bioinfo/cnvpipelines.git

Use conda to install the third party software.

$ cd cnvpipelines
$ conda env create --file environment.yaml

Load new conda environment

source activate cnvpipeline

Install for simulations

To do simulations, you need to compile pirs, which is included as submodule of your cnvpipeline installation. You must use a recent gcc compiler (for example on genologin, module load compiler/gcc-7.2.0). Then cd in cnvpipelines/popsim/pirs and just make:

make

Configuration

Copy application.properties.example into application.properties and make appropriate changes. Sections and parameters are described below.

Global section

  • batch_system_type: local to run all jobs locally, or slurm or sge to submit on a cluster (depending on which cluster you have) (default: local).
  • modules: list of modules to load before launching the workflow, space separated.
  • paths: list of paths to add to the global PATH environment variable.
  • jobs: maximum number of jobs to submit concurrently (default: 999).
  • sv_dir: absolute path to the svtoolkit folder (for genomestrip).

Cluster section

This section must be filled only if you don't use local as batch system type (see above).

  • submission_mode: drmaa to submit jobs through DRMAA API, cluster to submit jobs through bash commands.
  • submission_command: if you choose cluster for submission_mode, you must specify the command used to submit jobs (e.g.: srun, qsub).
  • drmaa: if you choose drmaa for submission_mode, you must specify the absolute path to the DRMAA library on the cluster.
  • native_submission_options: options passed to the submission command. Should be kept as it on most cases.
  • config: absolute path to the config file defining for each rule the amount of memory and cluster threads to ask for (you should use the cluster.yaml file as a model). Can be kept as it.

Reference bundle section

  • repeatmasker_lib_path: PATH to the RepBase Libraries folder for RepeatMasker, if needed by RepeatMasker installation

Genotoul specific configuration

  • Use the delly module specified
  • Ue the drmaa library specified

Run

Run a new workflow

Reference bundle

Command
./cnvpipelines.py run refbundle -r {fasta} -s {species} -w {working_dir}

With:
fasta: the path of the reference fasta file
species: species name, according to the NCBI Taxonomy database
working_dir: the folder into store data

Optional arguments

-l: read length (default: 100)
-m: maximum n-stretches length (default: 100)
--chromosomes CHRS: list of chromosomes to study, space separated. Regex accepted (using the python syntax). Default: all valid chromosomes of the reference
--force-all-chromosomes: ignore filtering if --chromosomes is not set
-p: for each rule, show the shell command run.
-n: dry run: show which rules will be launched without run anything.
--keep-wdir: in dry run mode, don't remove working dir after launch
-c: clean after launch: keep only final files (reference.* files).
-sc: soft clean after launch: remove all log files. --out-step STEP: specify the output rule file to only run the workflow until the associated rule (run all workflow if not specified)
--cluster-config FILE: Path to a cluster config file (erase the cluster config present in configution)

Align Fastq files on the reference

Command
./cnvpipelines.py run align -r {fasta} -s {samples} -w {working_dir}

With:
fasta: the path of the reference fasta file
samples: a YAML file describing for each sample it's name and fastq files (reads2, and optionally reads2). Example:

Sample_1:
  reads1: /path/to/reads_1.fq.gz
  reads2: /path/to/reads_2.gq.gz
  
Sample_2:
  reads: /path/to/reads.fq.gz

Where Sample_1 and Sample_2 are samples name.

working_dir: the folder into store data

Optional arguments

-p: for each rule, show the shell command run.
-n: dry run: show which rules will be launched without run anything.
--keep-wdir: in dry run mode, don't remove working dir after launch
-c: clean after launch: keep only final files (reference.* files).
-sc: soft clean after launch: remove all log files.
-f: Force run, erase working dir config and files automatically
--out-step STEP: specify the output rule file to only run the workflow until the associated rule (run all workflow if not specified)
--cluster-config FILE: Path to a cluster config file (erase the cluster config present in configution)

Detection

Command
./cnvpipelines.py run detection -r {fasta} -s {samples} -w {working_dir} -t {tools}

With:
fasta: the path to the fasta file (with all files of the reference bundle on the same folder).
samples: a file with in each line the path to a bam file to analyse. working_dir: the folder into store data.
tools: list of tools, space separated. Choose among genomestrip, delly, lumpy and pindel.

Optional arguments

-b INT: size of batches (default: -1, to always make only 1 batch)
--chromosomes CHRS: list of chromosomes to study, space separated. Regex accepted (using the python syntax). Default: all valid chromosomes of the reference
--force-all-chromosomes: ignore filtering if --chromosomes is not set
-v VARIANTS: list of variant types to detect, space separated among: DEL (deletions), INV (inversions), DUP (duplications) and mCNV (copy number variations). Default: all types.
-p: for each rule, show the shell command run.
-n: dry run: show which rules will be launched without run anything.
--keep-wdir: in dry run mode, don't remove working dir after launch -c: clean after launch: keep only filtered results.
-sc: soft clean after launch: remove all log files.
-f: Force run, erase working dir config and files automatically
--out-step STEP: specify the output rule file to only run the workflow until the associated rule (run all workflow if not specified)
--cluster-config FILE: Path to a cluster config file (erase the cluster config present in configution)

Merge batches

Command
./cnvpipelines.py run mergebatches -w {working_dir}

With:
working_dir: the detection run output folder. Must have at least 2 batches to run correctly for now.

Optional arguments

-p: for each rule, show the shell command run.
-n: dry run: show which rules will be launched without run anything.
--keep-wdir: in dry run mode, don't remove working dir after launch
-c: clean after launch: keep only filtered results.
-sc: soft clean after launch: remove all log files.
--out-step STEP: specify the output rule file to only run the workflow until the associated rule (run all workflow if not specified)
--cluster-config: erase the default cluster config file (see above) with a new one.

Simulate a population

We integrate to cnvpipelines popsim, our tool to simulate a population with variants (DEL and/or INV). With simulation workflow, you generate a such population, launch the detection and then compare detected results to true ones, with a summary HTML or Jupyter file.

Command
./cnvpipelines.py run simulation -nb {nb_inds} -r {reference} -sp {species} -t {tools} -w {working_dir}

With: nb_inds: number of individuals to generate.
reference: fasta file to use as reference for individuals simulated.
species: species of the reference (required only if you use genomestrip for detection). tools: tools to use for detection, space separated. Choose among genomestrip, delly, lumpy and pindel.
working_dir: the folder into store data.

Description of variants

-s {svlist}: a file describing the size distributions of variants. If not given, a default distribution is used.
Structure of the file (tab separated columns):

DEL minLength maxLength proba -> Create DELetion(s).
DUP minLength maxLength proba -> Create tandem DUPlication(s).
INV minLength maxLength proba -> Create in-place INVersion(s).

minLength and maxLength are int. Proba is the probability the variant has the above size, between 0 and 1. Ex: 0.7.

For each SV types, the sum of probabilities must be equal to 1.

1 line by SV type, like above. You can add several lines for a same SV type.

Example:

Variant Min Max Proba Cumul. proba *
DEL 100 200 0.7 0.7
DEL 200 500 0.2 0.9
DEL 500 1000 0.07 0.97
DEL 1000 2000 0.02 0.99
DEL 2000 10000 0.01 1.0

* This column has not to be set in the file (it's automatically computed).

Recommended options

-ns {nstretches}: position of N stretches. Variants will be generated at a way from them.
-fp: to force polymorphism. For each variant, the genotype will be different for at least individual.

Optional arguments

-cv {coverage}: coverage to use for generated reads of simulated individuals (default: 15).
-a: to generate haploid individuals (default: diploid).
-pd {proba_del}: probability to have a deletion (default: 1e-06).
-pi {proba_inv}: probability to have an inversion (default: 1e-06).
-l {read_len}: Generate reads having a length like specified (default: 100).
-m {insert_len_mean}: Generate inserts (fragments) having an average length like specified (default: 300).
-v {insert_len_sd}: Set the standard deviation of the insert (fragment) length (%) (default: 30).
-md {min-del}: Minimum number of deletions to generate (default: 1).
-mi {min-inv}: Minimum number of inversions to generate (default: 1).
--max-try {nb}: number of tries to do reach the minimum values above. If it still fails, the program will exit with an error (default: 10).
-g {file}: give a genotypes vcf file with variants positions and genotypes per individual. If given, only do the genomes and reads generation.
-mn {int}: Max size of nstretches to consider them as it (for refbundle, only if genomestrip is in tools) (default: 100).
--overlap-cutoff {float}: cutoff for reciprocal overlap between detected variants and true variants. Beside this number, they are consider as the same ones (default: 0.5).
--left-precision {int}: left breakpoint precision. -1 to ignore (default: -1)
--right-precision {int}: right breakpoint precision. -1 to ignore (default: -1)
--chromosomes CHRS: list of chromosomes to study, space separated. Regex accepted (using the python syntax). Default: all valid chromosomes of the reference
--force-all-chromosomes: ignore filtering if --chromosomes is not set
-p: for each rule, show the shell command run.
-n: dry run: show which rules will be launched without run anything.
--keep-wdir: in dry run mode, don't remove working dir after launch
-c: clean after launch: keep only filtered results.
-sc: soft clean after launch: remove all log files.
--out-step STEP: specify the output rule file to only run the workflow until the associated rule (run all workflow if not specified)
--cluster-config: erase the default cluster config file (see above) with a new one.

Rerun a workflow

Command
./cnvpipelines.py rerun -w {working_dir}

With: working_dir: the folder into data is stored

Optional arguments

--force-unlock: unlock a workflow (should be locked if snakemake crashes or was killed)
--rerun-incomplete: to rerun incomplete rules
-p: for each rule, show the shell command run.
-n: dry run: show which rules will be launched without run anything.
--keep-wdir: in dry run mode, don't remove working dir after launch
--cluster-config FILE: erase the default cluster config file (see above) with a new one.
-c: clean after launch
-sc: soft clean after launch: remove all log files.
-f: Force run, erase working dir config and files automatically
--out-step STEP: specify the output rule file to only run the workflow until the associated rule (run all workflow if not specified)

Clean a workflow

Full clean

./cnvpipelines.py clean -w {working_dir}

With:
working_dir: the folder into data is stored

Optional arguments

-f: fake mode, only show files and folder to delete without make any changes.

Soft clean

./cnvpipelines.py soft-clean -w {working_dir}

With:
working_dir: the folder into data is stored

Optional arguments

-f: fake mode, only show files and folder to delete without make any changes.

Unlock a workflow

If snakemake crashes or was killed, workflow should be locked. You can launch it by:

./cnvpipelines.py unlock -w {working_dir}

With:
working_dir: the folder into data is stored

Note: you can also use the --force-unlock of the rerun mode.

Authors

Thomas Faraut thomas.faraut@inra.fr
Floréal Cabanettes floreal.cabanettes@inra.fr