Compare revisions

b0857fa3 · 567b4f59 · 7dadb6ec · 4f54849d · 13e356e9 · b423c940
--- a/.gitignore
+++ b/.gitignore
@@ -5,3 +5,4 @@ application.properties
 **/*.pyo
 **/.snakemake
 **/old/*
+**/.ipynb_checkpoints
--- a/.gitmodules
+++ b/.gitmodules
 [submodule "snakecnv/lib"]
 	path = snakecnv/lib
-	url = git@forgemia.inra.fr:svdetection/svlib.git
-        branch=master
+	url = https://forgemia.inra.fr/svdetection/svlib.git
+    branch = master
 [submodule "popsim"]
 	path = popsim
-	url = git@forgemia.inra.fr:svdetection/popsim.git
+	url = https://forgemia.inra.fr/svdetection/popsim.git
 	branch = master
--- a/README.md
+++ b/README.md
@@ -3,6 +3,7 @@ Cnvpipelines workflow

 Cnvpipelines is a workflow to detect Copy Number Variation (CNV) variants (DEL, INV, MULTICOPY). It's build as a snakemake workflow with a wrapper to ease job run.

+
 ## In development...

 This workflow is still in development. For now, only DEL and INV variants are available. Also, genotypage with genomestrip still fails and is disabled for now.
@@ -23,7 +24,7 @@ and additionnal third party softwares. All those sofware will be installed autom
 Clone this repository:

    git clone --recursive https://forgemia.inra.fr/genotoul-bioinfo/cnvpipelines.git
-
+    
 Use conda to install the third party software.

 ```sh
@@ -34,20 +35,22 @@ $ conda env create --file environment.yaml
 #### Load new conda environment

 ```sh
-source activate cnvpipeline
+conda activate cnvpipeline
 ```   

 ### Install for simulations

 To do simulations, you need to compile pirs, which is included as submodule of your cnvpipeline installation. You must use a recent gcc compiler (for example on genologin, module load compiler/gcc-7.2.0).
 Then cd in `cnvpipelines/popsim/pirs` and just make:
-
+    
+    module load compiler/gcc-7.2.0  # for genologin users
    make

+   
 ## Configuration

 Copy `application.properties.example` into `application.properties` and make appropriate changes.
-Sections and parameters are described below.
+Sections and parameters are described below (on the genologin server you can simply copy `application.properties.genologin` into `application.properties`).

 ### Global section

@@ -69,12 +72,11 @@ This section must be filled only if you don't use local as batch system type (se

 ### Reference bundle section

-* *repeatmasker_lib_path*: PATH to the RepBase Libraries folder for RepeatMasker, if needed by RepeatMasker installation
+* *repeatmasker_lib_path*: PATH to the RepBase Libraries folder for RepeatMasker. This is not mandatory (see Install RepeatMasker Libraries in http://www.repeatmasker.org/RMDownload.html).

-### Genotoul specific configuration
+### Genotoul (genologin) specific configuration

-* Use the delly module specified
-* Ue the drmaa library specified
+* Use the drmaa library specified


 ## Run
@@ -86,7 +88,7 @@ This section must be filled only if you don't use local as batch system type (se
 ##### Command

    ./cnvpipelines.py run refbundle -r {fasta} -s {species} -w {working_dir}
-
+    
 With:  
 `fasta`: the path of the reference fasta file  
 `species`: species name, according to the [NCBI Taxonomy database](http://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html)  
@@ -111,15 +113,15 @@ With:
 ##### Command

    ./cnvpipelines.py run align -r {fasta} -s {samples} -w {working_dir}
-
+    
 With:  
 `fasta`: the path of the reference fasta file  
 `samples`: a YAML file describing for each sample it's name and fastq files (reads2, and optionally reads2). Example:
-
+    
    Sample_1:
      reads1: /path/to/reads_1.fq.gz
      reads2: /path/to/reads_2.gq.gz
-
+      
    Sample_2:
      reads: /path/to/reads.fq.gz
 Where `Sample_1` and `Sample_2` are samples name.
@@ -127,7 +129,7 @@ Where `Sample_1` and `Sample_2` are samples name.
 `working_dir`: the folder into store data

 ##### Optional arguments
-
+    
 `-p`: for each rule, show the shell command run.  
 `-n`: dry run: show which rules will be launched without run anything.  
 `--keep-wdir`: in dry run mode, don't remove working dir after launch  
@@ -143,7 +145,7 @@ Where `Sample_1` and `Sample_2` are samples name.
 ##### Command

    ./cnvpipelines.py run detection -r {fasta} -s {samples} -w {working_dir} -t {tools}
-
+    
 With:  
 `fasta`: the path to the fasta file (with all files of the reference bundle on the same folder).  
 `samples`: a file with in each line the path to a bam file to analyse.
@@ -151,7 +153,7 @@ With:
 `tools`: list of tools, space separated. Choose among genomestrip, delly, lumpy and pindel.

 ##### Optional arguments
-
+`-r rb`: the refbundle directory necessary when using genomestrip
 `-b INT`: size of batches (default: -1, to always make only 1 batch)  
 `--chromosomes CHRS`: list of chromosomes to study, space separated. Regex accepted (using the [python syntax](https://docs.python.org/3/library/re.html#regular-expression-syntax)). Default: all valid chromosomes of the reference  
 `--force-all-chromosomes`: ignore filtering if `--chromosomes` is not set  
@@ -165,13 +167,13 @@ With:
 `--out-step STEP`: specify the output rule file to only run the workflow until the associated rule (run all workflow if not specified)  
 `--cluster-config FILE`: Path to a cluster config file (erase the cluster
                    config present in configution)
-
+                    
 #### Merge batches

 ##### Command

    ./cnvpipelines.py run mergebatches -w {working_dir}
-
+    
 With:  
 `working_dir`: the detection run output folder. Must have at least 2 batches to run correctly for now.

@@ -193,7 +195,7 @@ We integrate to cnvpipelines popsim, our tool to simulate a population with vari
 ##### Command

    ./cnvpipelines.py run simulation -nb {nb_inds} -r {reference} -sp {species} -t {tools} -w {working_dir}
-
+    
 With:
 `nb_inds`: number of individuals to generate.  
 `reference`: fasta file to use as reference for individuals simulated.  
@@ -202,7 +204,7 @@ With:
 `working_dir`: the folder into store data.

 ##### Description of variants
-
+ 
 `-s {svlist}`: a file describing the size distributions of variants. If not given, a default distribution is used.  
 Structure of the file (tab separated columns):  
 >DEL minLength maxLength proba -> Create DELetion(s).  
@@ -266,7 +268,7 @@ Example:
 ##### Command

    ./cnvpipelines.py rerun -w {working_dir}
-
+    
 With:
 `working_dir`: the folder into data is stored

@@ -288,7 +290,7 @@ With:
 #### Full clean

    ./cnvpipelines.py clean -w {working_dir}
-
+    
 With:  
 `working_dir`: the folder into data is stored

@@ -299,7 +301,7 @@ With:
 #### Soft clean

    ./cnvpipelines.py soft-clean -w {working_dir}
-
+    
 With:  
 `working_dir`: the folder into data is stored

@@ -312,7 +314,7 @@ With:
 If snakemake crashes or was killed, workflow should be locked. You can launch it by:

    ./cnvpipelines.py unlock -w {working_dir}
-
+    
 With:  
 `working_dir`: the folder into data is stored


--- a/application.properties.genologin
+++ b/application.properties.genologin
+[global]
+
+# batch system type: local, slurm or sge
+# for genologin use slurm
+batch_system_type = slurm
+# list of modules to load (space separated)
+modules =
+# path environment to prepend (":" separated)
+paths =
+# number of concurrent jobs to launch
+jobs = 100
+# svtoolkit home: genomestrip is now installed via conda
+# sv_dir =
+
+
+[cluster]
+# Ignore these options for local batch system type
+
+# submission mode: drmaa or cluster
+submission_mode = drmaa
+
+# cluster submission command (ignore for DRMAA submission mode)
+submission_command =
+
+# DRMAA lib (ignore for cluster submission mode)
+# For genologin use /tools/libraries/slurm-drmaa/slurm-drmaa-1.0.7/lib/libdrmaa.so
+drmaa = /tools/libraries/slurm-drmaa/slurm-drmaa-1.0.7/lib/libdrmaa.so
+
+# native submission commands: keep default on most cases
+native_submission_options = ###DEFAULT###
+
+# cluster config file
+config = ###PROGRAM###/cluster.yaml
+
+# enable drmaa parallelisation through DRMAA for Genomestrip (may cause exceed of number of concurrent jobs launched)
+# True: enabled, False: disabled
+sv_parallel_drmaa = True
+sv_parallel_native_specs = ###DEFAULT###
+
+[refbundle]
+
+# Path to the Libraries folder extracted from RepBase archive file
+# Optional (will use default libraries instead) but highly recommended :-)
+repeatmasker_lib_path = /save/faraut/Libraries
--- a/cluster.yaml
+++ b/cluster.yaml
@@ -8,11 +8,12 @@ delly:
  time: "36:00:00"

 lumpy:
-  mem: 12
+  mem: 24
+  time: "36:00:00"

 pindel:
  mem: 16
-  time: "24:00:00"
+  time: "36:00:00"

 genomestrip:
  mem: 8
@@ -46,6 +47,11 @@ repeatmask:
  mem: 8
  h_vmem: 8

+buildpop:
+  time: "48:00:00"
+  mem: 8
+  h_vmem: 8
+
 bed2fasta:
  mem: 16
  h_vmem: 16

--- a/cnvpipelines.py
+++ b/cnvpipelines.py
@@ -13,14 +13,13 @@ import traceback
 import math
 import time
 import sys
-#from termcolor import colored, cprint
 from datetime import datetime
 from pathlib import Path
 from subprocess import run

 TOOLS = ["lumpy", "delly", "pindel", "genomestrip"]
 VTYPES = ["DEL", "INV", "DUP", "mCNV"]
-MIN_CHR_SIZE = 50000
+MIN_CHR_SIZE = 500000

 MD5_TEST = {
    "ERR470101_f_1.fq.gz": "2c30600e303c9bfb7b9538ec76055203",
@@ -87,8 +86,20 @@ class CnvPipeline:
        return hash_md5.hexdigest()

    def _write_samples(self, samples, sample_file):
-        with open(sample_file, "w") as final_samples:
-            final_samples.write("\n".join(samples) + "\n")
+        with open(sample_file, "w") as samples_fh:
+            samples_fh.write("\n".join(samples) + "\n")
+        # self._write_samples_bam_tsv(samples, sample_file)
+
+    def _write_samples_bam_tsv(self, samples, sample_file):
+        # write in addition a yaml file with sample names and bams
+        # we have therefore two files : sample list and sample tsv
+        # (with bam file path)
+        samples_tsv = os.path.splitext(sample_file)[0] + ".tsv"
+        with open(samples_tsv, "w") as samples_bam:
+            samples_bam.write("sample\tbam\n")
+            for sample in samples:
+                bam = "data/bams/" + sample + ".bam"
+                samples_bam.write("%s\t%s\n" % (sample, bam))

    def _link_bams_wdir(self, samples_file, bams_dir, smple_file, nb_samples):
        """
@@ -107,6 +118,7 @@ class CnvPipeline:
        with open(samples_file, "r") as samples:
            samples_name = []
            all_samples_name = []
+            all_bam_files = []
            # Ignore previous lines:
            added_samples = 0
            n_block = 1
@@ -165,18 +177,20 @@ class CnvPipeline:
                samples_name.append(id_sample)
                all_samples_name.append(id_sample)
                if added_samples == nb_samples:
+                    # write the block samples name
                    self._write_samples(samples_name, smple_file + "." + str(n_block))
                    samples_name = []
                    added_samples = 0
                    n_block += 1

-            if len(samples_name) > 0: ## Sample names for the batch
+            if len(samples_name) > 0:  # Sample names for the batch
                self._write_samples(samples_name, smple_file + "." + str(n_block))

-            if len(all_samples_name) > 0: ## All samples names
+            if len(all_samples_name) > 0:  # All samples names
                self._write_samples(all_samples_name, smple_file)


+
    @staticmethod
    def _fail(message):
        """
@@ -262,15 +276,15 @@ class CnvPipeline:
    def _get_snakefile(self):
        try:
            if self.wf_type == "detection":
-                snakefile = "detection.snk"
+                snakefile = "detection.smk"
            elif self.wf_type == "refbundle":
-                snakefile = "referencebundle.snk"
+                snakefile = "referencebundle.smk"
            elif self.wf_type == "align":
-                snakefile = "align.snk"
+                snakefile = "align.smk"
            elif self.wf_type == "mergebatches":
-                snakefile = "mergebatches.snk"
+                snakefile = "mergebatches.smk"
            elif self.wf_type == "simulation":
-                snakefile = "popsim.snk"
+                snakefile = "simulation.smk"
            elif self.wf_type == "test":
                snakefile = "test.snk"
            elif self.wf_type == "unknown":
@@ -423,6 +437,7 @@ class CnvPipeline:
        :param reason: if True, show reason of each executed rule
        :type reason: bool
        """
+
        if self.isrerun:
            run(["touch", self.config_file, "-d", self.config_timestamp])
        if mode == "rerun" and self.wf_type == "detection" and len(snake_args) == 0:
@@ -431,27 +446,31 @@ class CnvPipeline:
                del local_vars["snake_args"]
                return self.run_or_rerun_detection(**local_vars)
            # Fix snakemake bug: make reference files timestamp to the config.yaml timestamp value
-        if (mode == "rerun" or self.isrerun) and (self.wf_type == "simulation" or self.wf_type == "refbundle"):
-                reference = os.path.join(self.wdir, "reference.fasta")
-                fixed_date = datetime.fromtimestamp(os.stat(self.config_file).st_ctime).strftime("%Y-%m-%d %H:%M:%S")
-                if os.path.exists(reference):
-                    run(["touch", reference, "-d", fixed_date])
-                fai = reference + ".fai"
-                if os.path.exists(fai):
-                    run(["touch", fai, "-d", fixed_date])
-                if self.wf_type == "simulation":
-                    reference = os.path.join(self.wdir, "refbundle", "reference.fasta")
-                    if os.path.exists(reference):
-                        run(["touch", reference, "-d", fixed_date])
-                    fai = reference + ".fai"
-                    if os.path.exists(fai):
-                        run(["touch", fai, "-d", fixed_date])
+#        if (mode == "rerun" or self.isrerun) and (self.wf_type == "simulation" or self.wf_type == "refbundle"):
+#                #reference = os.path.join(self.wdir, "reference.fasta")
+#                reference = os.path.join(self.wdir, "reference_raw.fasta")
+#                fixed_date = datetime.fromtimestamp(os.stat(self.config_file).st_ctime).strftime("%Y-%m-%d %H:%M:%S")
+#                if os.path.exists(reference):
+#                    run(["touch", "-h", reference, "-d", fixed_date])
+#                fai = reference + ".fai"
+#                if os.path.exists(fai):
+#                    run(["touch", "-h", fai, "-d", fixed_date])
+#
+#                if self.wf_type == "simulation":
+#                    reference = os.path.join(self.wdir, "refbundle", "reference.fasta")
+#                    if os.path.exists(reference):
+#                        run(["touch", reference, "-d", fixed_date])
+#                    fai = reference + ".fai"
+#                    if os.path.exists(fai):
+#                        run(["touch", fai, "-d", fixed_date])

        if merge_batches:
            if self.wf_type != "detection":
                self._fail("Merge batches can only be done on a detection workflow")
            self._wf_type = "mergebatches"

+
+
        try:
            if not os.path.exists(self.config_file):
                raise FileNotFoundError("File not found: %s" % self.config_file)
@@ -547,14 +566,12 @@ class CnvPipeline:
        do_delete_after_dryrun = self.delete_wdir_after_dryrun

        for n_block in blocks:
-
            sample_file = os.path.join(self.wdir, "samples.list.%d" % n_block)

            with open(sample_file, "r") as smple_file:
                samples_name = [x.rstrip() for x in smple_file if x.rstrip() != ""]

            block_done = os.path.join(self.wdir, ".done%d" % n_block)
-
            if not os.path.exists(block_done):
                print("Launching block of samples number %d..." % n_block, flush=True)
                print("Samples:", flush=True)
@@ -563,11 +580,11 @@ class CnvPipeline:

                # if size_pbatches > 0 and n_block > 1:
                #     start_batch_nb = ((n_block - 1) * batches_parallel) + 1
-
                # Run:
                self.delete_wdir_after_dryrun = False
                exit_status = self.run_or_rerun(snake_args={"start_batch_nb": start_batch_nb,
-                                                            "sample_file": sample_file}, **kwargs)
+                                                            "sample_file": sample_file},
+                                                 **kwargs)

                if exit_status == 0 and (not "dryrun" in kwargs or not kwargs["dryrun"]):
                    Path(block_done).touch()
@@ -749,6 +766,7 @@ class CnvPipeline:
            size_pbatches = -1 if batches == -1 else batches * batches_parallel

            config = {
+                "wf_type": "detection",
                "wdir": self.wdir,
                "reference": "reference.fasta",
                "tools": tools,
@@ -836,6 +854,8 @@ class CnvPipeline:
                chromosomes = self._chromosomes_filtered(final_fai, chromosomes)

            config = {
+                "wf_type": "refbundle",
+                "reference": "reference_raw.fasta",
                "wdir": self.wdir,
                "species": species,
                "read_len": read_length,
@@ -856,9 +876,9 @@ class CnvPipeline:
                traceback.print_exc()
            self._fail(str(e))

-    def parse_align_sample_file(self, samples, final_samples):
-        with open(samples, "r") as old_s, open(final_samples, "w") as new_s:
-            final_samples = {}
+    def parse_align_sample_file(self, input_sample_file, valid_sample_file):
+        with open(input_sample_file, "r") as old_s, open(valid_sample_file, "w") as new_s:
+            valid_samples = {}
            try:
                y_samples = yaml.load(old_s, Loader=yaml.FullLoader)
            except yaml.YAMLError:
@@ -869,20 +889,38 @@ class CnvPipeline:
                if not isinstance(y_samples, dict):
                    raise ValueError("Error: sample file is not valid")
                for sample, files in y_samples.items():
-                    final_samples[sample] = {}
+                    valid_samples[sample] = {}
                    if not isinstance(files, dict):
                        raise ValueError("Error: sample file is not valid. File line must be: `reads1: /path/to/file`. "
-                                        "May be you forget a space?")
+                                        "May be you forgot a space?")
                    for name, file in files.items():
                        if name not in ["reads1", "reads2"]:
                            raise ValueError("Invalid sample \"%s\" in sample file: %s" % (sample, samples))
-                        elif name in final_samples[sample]:
+                        elif name in valid_samples[sample]:
                            raise ValueError("Invalid sample \"%s\" in sample file: %s" % (sample, samples))
                        if not os.path.exists(file) or not os.path.isfile(file):
                            raise ValueError("Sample \"%s\": file \"%s\" does not exists or is not a file" %
                                             (sample, file))
-                        final_samples[sample][name] = os.path.abspath(file)
-                yaml.dump(final_samples, new_s, default_flow_style=False)
+                        valid_samples[sample][name] = os.path.abspath(file)
+                yaml.dump(valid_samples, new_s, default_flow_style=False)
+                # We dump here also a tsv file
+                # FIXME keep a single sample file in work dir (the tsv file)
+                # FIXME remove the so called valid_sample_file yaml file
+                #self._dump_tsv_sample_file(valid_samples, valid_sample_file)
+
+    def _dump_tsv_sample_file(self, valid_samples, valid_sample_file):
+        samples_tsv = os.path.splitext(valid_sample_file)[0] + ".tsv"
+        with open(samples_tsv, "w") as samples_fh:
+            samples_fh.write("sample\treads\n")
+            for sample in valid_samples:
+                sample_fastqs = self._get_fastq_path(valid_samples[sample])
+                samples_fh.write("%s\t%s\n" % (sample, ",".join(sample_fastqs)))
+
+    def _get_fastq_path(self, sample_fastq):
+        if len(sample_fastq) == 1:
+            return [sample_fastq["reads"]]
+        else:
+            return [sample_fastq["reads1"], sample_fastq["reads2"]]

    def run_align(self, reference, samples, force_wdir=False, **kwargs):
        """
@@ -922,9 +960,10 @@ class CnvPipeline:
            self.parse_align_sample_file(samples, final_sample_file)

            config = {
+                "wf_type": "align",
                "wdir": self.wdir,
                "reference": os.path.join(ref_dir, os.path.basename(reference)),
-                "sample_file_align": final_sample_file
+                "sample_file": final_sample_file
            }

            with open(self.config_file, "w") as config_file:
@@ -960,6 +999,7 @@ class CnvPipeline:
        with open(self.config_file, "r") as cfg_file:
            conf = yaml.load(cfg_file, Loader=yaml.FullLoader)

+        conf["wf_type"] = "mergebatches"
        conf["sample_file"] = sample_file_final
        with open(self.config_file, "w") as cfg_file:
            yaml.dump(conf, cfg_file, default_flow_style=False)
@@ -1106,7 +1146,7 @@ class CnvPipeline:
                        self._fail("Indexing reference fasta file failed")
                        return False
            elif not os.path.exists(final_fai):
-                self._fail("Working directory is corrupted: reference.fasta.fai not found in the existing working "
+                self._fail("Working directory is corrupted: reference_raw.fasta.fai not found in the existing working "
                           "directory")
                return False

@@ -1143,9 +1183,10 @@ class CnvPipeline:
            if proba_dup > 0:
                variant_types.append("DUP")
            config = {
+                "wf_type": "simulation",
                "wdir": self.wdir,
                "nb_inds": nb_inds,
-                "reference": "reference.fasta",
+                "reference": "reference_raw.fasta",
                "coverage": coverage,
                "force_polymorphism": force_polymorphism,
                "haploid": haploid,
@@ -1162,9 +1203,9 @@ class CnvPipeline:
                "freq_inv": freq_inv,
                "freq_dup": freq_dup,
                "max_try": max_try,
-                "sample_file_align": samples_file_fq,
+                "sample_file_align": samples_file_fq,  # NOT used
                "tools": tools,
-                "sample_file": samples_file_d,
+                "sample_file": samples_file_d,   # NOT used
                "batches": -1,
                "parallel_batches": -1,
                "start_batch_nb": 1,
@@ -1276,20 +1317,21 @@ class CnvPipeline:
                raise ValueError("Type of workflow not set")
            else:
                raise ValueError("Type of workflow not supported: %s" % self.wf_type)
-
+            print("Cleaned!")
            return 0

        except ValueError as e:
            if self.debug:
                traceback.print_exc()
            self._fail(str(e))
-        print("Cleaned!")
+

    def soft_clean(self, fake=False):
        """
-        Clean only lof files
+        Clean only log files
        :param fake: if True, only show deleted files without removing them
        """
+        print("Cleaning...")
        logs = glob(os.path.join(self.wdir, "**", "logs"), recursive=True)
        for log in logs:
            if os.path.isdir(log):
@@ -1330,6 +1372,7 @@ if __name__ == "__main__":
    import argparse
    parser = argparse.ArgumentParser(description="Run cnv pipelines")
    subparsers = parser.add_subparsers(metavar="command")
+    subparsers.required = True

    def check_min_size_0(value):
        ivalue = int(value)
@@ -1473,9 +1516,10 @@ if __name__ == "__main__":
    simul_parser.add_argument("--max-try", help="Maximum of tries", default=10, type=check_min_size_0)
    simul_parser.add_argument("-g", "--genotypes", help="Position of SV with genotypes of individuals")
    simul_parser.add_argument('-t', '--tools', type=str, required=True, help="Tools to launch, space separated",
-                                  nargs="+", choices=TOOLS)
+                              nargs="+", choices=TOOLS)
    simul_parser.add_argument('-sp', '--species', type=str, required=False,
-                                  help="[refbundle] Species name, according to the NCBI Taxonomy database")
+                              help="[refbundle] Species name, according to the NCBI Taxonomy database",
+                              default="human")
    simul_parser.add_argument('-mn', '--max-n-stretches', type=int, required=False, default=100,
                                  help="[refbundle] Max size of nstretches to consider them as it")
    simul_parser.add_argument('--overlap-cutoff', type=float, default=0.5,
@@ -1537,7 +1581,10 @@ if __name__ == "__main__":

    args = parser.parse_args()

-
+    try:
+        func = args.func
+    except AttributeError:
+        parser.error("too few arguments use -h for help")

    if args.func == "run_detection" and args.refbundle is None and "genomestrip" in args.tools:
        print("Error: --refbundle option is required for genomestrip")

--- a/config_reader.py
+++ b/config_reader.py
@@ -135,7 +135,7 @@ class AppConfigReader:
            batch_type = self._get_batch_system_type()
            if batch_type == "slurm":
                return (" --mem-per-cpu={cluster.mem}000 --mincpus={threads} --time={cluster.time} -J {cluster.name}"
-                        " -N 1=1")
+                        " -N 1")
            elif batch_type == "sge":
                return " -l mem={cluster.mem}G -l h_vmem={cluster.mem}G -pe parallel_smp {threads} -N {cluster.name}"
            return None

--- a/detection_summary.ipynb
+++ b/detection_summary.ipynb
--- a/environment.yaml
+++ b/environment.yaml
@@ -14,12 +14,13 @@ dependencies:
  - bwa>=0.7.15
  - configargparse>=0.12.0
  - delly>=0.8.1
-  - docopt>=0.6.2
+  - docopt=>0.6.2
  - docutils>=0.14
  - drmaa>=0.7.9
  - exonerate>=2.4.0
  - genomestrip>=2.0
  - joblib>=0.13.2
+  - jupyter>=1.0.0
  - numpy>=1.13.3
  - openjdk>=8.0.121
  - pandas>=0.22.0
@@ -31,10 +32,12 @@ dependencies:
  - python>=3.6.2
  - pyvcf>=0.6.8
  - pyyaml>=3.12
+  - r-base>=3.5.1
  - scipy>=1.0.1
  - snakemake>=3.13.3
  - termcolor>=1.1.0
  - trf>=4.09
+  - upsetplot>=0.3.0
  - vcftools>=0.1.16
  - pip>=19.1.1termcolor-1.1.0
  - pip:

--- a/fixpermissions.sh
+++ b/fixpermissions.sh
+chmod u+x snakecnv/bin/*.py
+chmod u+x popsim/build_variant_pop.py
--- a/notebooks/.ipynb_checkpoints/Summary-checkpoint.ipynb
+++ b/notebooks/.ipynb_checkpoints/Summary-checkpoint.ipynb
--- a/notebooks/Summary.ipynb
+++ b/notebooks/Summary.ipynb
--- a/notebooks/detection_summary.ipynb
+++ b/notebooks/detection_summary.ipynb
--- a/popsim @ d608a847
+++ b/popsim @ d608a847
-Subproject commit 3e4317eaa8e300b6cb8332dfad36d9f3519d2c01
+Subproject commit d608a847a09da7362057e8fd101cb66cff72a34d
--- a/snakecnv/Summarized_results.ipynb
+++ b/snakecnv/Summarized_results.ipynb
-%% Cell type:raw id: tags:
-
-<script>
-  function code_toggle() {
-    if (code_shown){
-      $('div.input').hide('500');
-      $('#toggleButton').val('Show Code')
-    } else {
-      $('div.input').show('500');
-      $('#toggleButton').val('Hide Code')
-    }
-    code_shown = !code_shown
-  }
-
-  $( document ).ready(function(){
-    code_shown=false;
-    $('div.input').hide()
-  });
-</script>
-<form action="javascript:code_toggle()"><input type="submit" id="toggleButton" value="Show Code"></form>
-
-%% Cell type:code id: tags:
-
-``` python
-import sys
-import glob
-import os
-import json
-
-from collections import defaultdict,Counter
-import math
-
-import pandas as pd
-import plotly as py
-import numpy as np
-import seaborn as sns
-sns.set(style="whitegrid", color_codes=True)
-%matplotlib inline
-
-import matplotlib.pyplot as plt
-import matplotlib.patches as mpatches
-
-# vcf parser and associated utilities
-# VariantFile is the the pysam vcf file handler returning a pysam vcf record
-from pysam import VariantFile
-from pybedtools import BedTool, create_interval_from_list
-
-#Fot the http request to jvenn
-from IPython.display import display, HTML
-py.offline.init_notebook_mode(connected=True)
-
-from os import listdir
-from os.path import isfile, join
-```
-
-%% Output
-
-
-%% Cell type:code id: tags:
-
-``` python
-```
-
-%% Cell type:code id: tags:
-
-``` python
-# The rule file specifying the simulation protocol (size of variants and associated proportion)
-rules = pd.read_csv("rules.sim", sep='\s+',header=None, names=['type','min_size','max_size','proportion','cum'])
-rules[['type','min_size','max_size','proportion']]
-```
-
-%% Output
-
-      type  min_size  max_size  proportion
-    0  DEL       100       500         0.2
-    1  DEL       500      1000         0.2
-    2  DEL      1000      2000         0.3
-    3  DEL      2000     10000         0.2
-    4  DEL     10000     50000         0.1
-
-%% Cell type:code id: tags:
-
-``` python
-```
-%% Cell type:raw id: tags:
-
-<script>
-  function code_toggle() {
-    if (code_shown){
-      $('div.input').hide('500');
-      $('#toggleButton').val('Show Code')
-    } else {
-      $('div.input').show('500');
-      $('#toggleButton').val('Hide Code')
-    }
-    code_shown = !code_shown
-  }
-
-  $( document ).ready(function(){
-    code_shown=false;
-    $('div.input').hide()
-  });
-</script>
-<form action="javascript:code_toggle()"><input type="submit" id="toggleButton" value="Show Code"></form>
-
-%% Cell type:code id: tags:
-
-``` python
-import sys
-import glob
-import os
-import json
-
-from collections import defaultdict,Counter
-import math
-
-import pandas as pd
-import plotly as py
-import numpy as np
-import seaborn as sns
-sns.set(style="whitegrid", color_codes=True)
-%matplotlib inline
-
-import matplotlib.pyplot as plt
-import matplotlib.patches as mpatches
-
-# vcf parser and associated utilities
-# VariantFile is the the pysam vcf file handler returning a pysam vcf record
-from pysam import VariantFile
-from pybedtools import BedTool, create_interval_from_list
-
-#Fot the http request to jvenn
-from IPython.display import display, HTML
-py.offline.init_notebook_mode(connected=True)
-
-from os import listdir
-from os.path import isfile, join
-```
-
-%% Output
-
-
-%% Cell type:code id: tags:
-
-``` python
-```
-
-%% Cell type:code id: tags:
-
-``` python
-# The rule file specifying the simulation protocol (size of variants and associated proportion)
-rules = pd.read_csv("rules.sim", sep='\s+',header=None, names=['type','min_size','max_size','proportion','cum'])
-rules[['type','min_size','max_size','proportion']]
-```
-
-%% Output
-
-      type  min_size  max_size  proportion
-    0  DEL       100       500         0.2
-    1  DEL       500      1000         0.2
-    2  DEL      1000      2000         0.3
-    3  DEL      2000     10000         0.2
-    4  DEL     10000     50000         0.1
-
-%% Cell type:code id: tags:
-
-``` python
-```
--- a/snakecnv/align.smk
+++ b/snakecnv/align.smk
@@ -2,7 +2,6 @@

 include: "common.smk"

-
 cmd = CommandFactory.factory(config)

 workdir: config['wdir']

--- a/snakecnv/align_snakemake_utils.py
+++ b/snakecnv/align_snakemake_utils.py
-class SnakemakeUtils:
-
-    def __init__(self, samples):
-        self.samples = samples
-
-    def get_inputs_fastq(self, wildcards):
-        fastq = []
-        files = self.samples[wildcards.sample]
-        fastq.append(files["reads1"])
-        if "reads2" in files:
-            fastq.append(files["reads2"])
-        return {"fastq": fastq}
--- a/snakecnv/bin/annotate.py
+++ b/snakecnv/bin/annotate.py
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+"""This script annotates a vcf file"""
+
+from svrunner_utils import eprint
+
+from svreader.annotation import AnnotateReader, AnnotateWriter
+
+from svreader.annotation import redundancy_annotator
+from svreader.annotation import get_connected_duplicates
+from svreader.annotation import variant_filtration, set_supporting_tools
+from svreader.annotation import rename_variants
+
+
+def filtering_by_support(variant_set, minsupp_reads=3):
+    """
+    Adding the LOWSUPPORT filter if the number of supporting reads is not
+    least minsupp_reads for a least on sample
+    """
+    for variant in variant_set:
+        info = variant.record.info
+        if info["MAX_SUPP_READS"] < minsupp_reads:
+            variant.filter.add("LOWSUPPORT")
+
+
+def annotate(inputfile, outputfile, genotyper):
+    """ Filtering the candidate CNVs according to the following criteria
+          - non duplicate sites
+          - variant sites
+          - call rate > 0.8
+          - at least one variant (homozygous or heterozygote) has a
+            genotype quality > 20
+          - the variant is not everywhere heterozygote or homozygote
+            (use NONVARIANTSCORE in both cases)
+          - sv with no overlap with CNVR filtered out by genomeSTRIP
+    """
+    # Reading the vcf file
+    variant_set = []
+    eprint(" Reading file %s" % (inputfile))
+    reader = AnnotateReader(inputfile)
+
+    for record in reader:
+        variant_set.append(record)
+    eprint("Working with " + str(len(variant_set)) + " records")
+
+    #  Annotations
+    num = 1
+    for variant in variant_set:
+        # PASS, and '.' become both PASS
+        variant.unify_pass_filtertag()
+        variant.add_supporting_infos()
+        variant.set_qual()
+
+    # Redundancy annotation
+    redundancy_annotator(variant_set, reader, overlap_cutoff=0.5,
+                         genotyper=genotyper)
+
+    # Now genomeSTRIP inspired variant filtration
+    if reader.numSamples() > 0:
+        variant_filtration(variant_set, reader)
+
+    # Now supporting info filtering
+    filtering_by_support(variant_set)
+
+    # Constructing connected component of duplicates
+    get_connected_duplicates(variant_set)
+    set_supporting_tools(variant_set)
+
+    # Rename sv
+    rename_variants(variant_set)
+
+    vcf_fout = AnnotateWriter(outputfile, reader)
+    num = 1
+    for record in sorted(variant_set, key=lambda k: k.start):
+        vcf_fout.write(record)
+        num += 1
+    vcf_fout.close()
+
+
+if __name__ == "__main__":
+    import argparse
+
+    parser = argparse.ArgumentParser(
+        prog="filter.py",
+        description="Filter by applying flags to variants")
+    parser.add_argument("-i", "--input-vcf", required=True,
+                        help="Genotypes vcf input file")
+    parser.add_argument("-o", "--output-vcf", required=True,
+                        help="Filtered vcf output file")
+    parser.add_argument("-g", "--genotyper", required=True,
+                        help="Genotyper tool")
+
+    args = parser.parse_args()
+
+    annotate(inputfile=args.input_vcf,
+             outputfile=args.output_vcf,
+             genotyper=args.genotyper)
--- a/snakecnv/bin/build_results.py
+++ b/snakecnv/bin/build_results.py
--- a/snakecnv/bin/build_results.py
+++ b/snakecnv/bin/build_results.py
+../../popsim/build_results.py
\ No newline at end of file
No results found