Skip to content
Snippets Groups Projects

Compare revisions

Changes are shown as if the source revision was being merged into the target revision. Learn more about comparing revisions.

Source

Select target project
No results found

Target

Select target project
  • philippe.bardou/cnvpipelines
  • svdetection/cnvpipelines
2 results
Show changes
Commits on Source (139)
Showing
with 2936 additions and 2315 deletions
......@@ -5,3 +5,4 @@ application.properties
**/*.pyo
**/.snakemake
**/old/*
**/.ipynb_checkpoints
[submodule "snakecnv/lib"]
path = snakecnv/lib
url = git@forgemia.inra.fr:svdetection/svlib.git
branch=master
url = https://forgemia.inra.fr/svdetection/svlib.git
branch = master
[submodule "popsim"]
path = popsim
url = git@forgemia.inra.fr:svdetection/popsim.git
url = https://forgemia.inra.fr/svdetection/popsim.git
branch = master
......@@ -3,6 +3,7 @@ Cnvpipelines workflow
Cnvpipelines is a workflow to detect Copy Number Variation (CNV) variants (DEL, INV, MULTICOPY). It's build as a snakemake workflow with a wrapper to ease job run.
## In development...
This workflow is still in development. For now, only DEL and INV variants are available. Also, genotypage with genomestrip still fails and is disabled for now.
......@@ -23,7 +24,7 @@ and additionnal third party softwares. All those sofware will be installed autom
Clone this repository:
git clone --recursive https://forgemia.inra.fr/genotoul-bioinfo/cnvpipelines.git
Use conda to install the third party software.
```sh
......@@ -34,20 +35,22 @@ $ conda env create --file environment.yaml
#### Load new conda environment
```sh
source activate cnvpipeline
conda activate cnvpipeline
```
### Install for simulations
To do simulations, you need to compile pirs, which is included as submodule of your cnvpipeline installation. You must use a recent gcc compiler (for example on genologin, module load compiler/gcc-7.2.0).
Then cd in `cnvpipelines/popsim/pirs` and just make:
module load compiler/gcc-7.2.0 # for genologin users
make
## Configuration
Copy `application.properties.example` into `application.properties` and make appropriate changes.
Sections and parameters are described below.
Sections and parameters are described below (on the genologin server you can simply copy `application.properties.genologin` into `application.properties`).
### Global section
......@@ -69,12 +72,11 @@ This section must be filled only if you don't use local as batch system type (se
### Reference bundle section
* *repeatmasker_lib_path*: PATH to the RepBase Libraries folder for RepeatMasker, if needed by RepeatMasker installation
* *repeatmasker_lib_path*: PATH to the RepBase Libraries folder for RepeatMasker. This is not mandatory (see Install RepeatMasker Libraries in http://www.repeatmasker.org/RMDownload.html).
### Genotoul specific configuration
### Genotoul (genologin) specific configuration
* Use the delly module specified
* Ue the drmaa library specified
* Use the drmaa library specified
## Run
......@@ -86,7 +88,7 @@ This section must be filled only if you don't use local as batch system type (se
##### Command
./cnvpipelines.py run refbundle -r {fasta} -s {species} -w {working_dir}
With:
`fasta`: the path of the reference fasta file
`species`: species name, according to the [NCBI Taxonomy database](http://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html)
......@@ -111,15 +113,15 @@ With:
##### Command
./cnvpipelines.py run align -r {fasta} -s {samples} -w {working_dir}
With:
`fasta`: the path of the reference fasta file
`samples`: a YAML file describing for each sample it's name and fastq files (reads2, and optionally reads2). Example:
Sample_1:
reads1: /path/to/reads_1.fq.gz
reads2: /path/to/reads_2.gq.gz
Sample_2:
reads: /path/to/reads.fq.gz
Where `Sample_1` and `Sample_2` are samples name.
......@@ -127,7 +129,7 @@ Where `Sample_1` and `Sample_2` are samples name.
`working_dir`: the folder into store data
##### Optional arguments
`-p`: for each rule, show the shell command run.
`-n`: dry run: show which rules will be launched without run anything.
`--keep-wdir`: in dry run mode, don't remove working dir after launch
......@@ -143,7 +145,7 @@ Where `Sample_1` and `Sample_2` are samples name.
##### Command
./cnvpipelines.py run detection -r {fasta} -s {samples} -w {working_dir} -t {tools}
With:
`fasta`: the path to the fasta file (with all files of the reference bundle on the same folder).
`samples`: a file with in each line the path to a bam file to analyse.
......@@ -151,7 +153,7 @@ With:
`tools`: list of tools, space separated. Choose among genomestrip, delly, lumpy and pindel.
##### Optional arguments
`-r rb`: the refbundle directory necessary when using genomestrip
`-b INT`: size of batches (default: -1, to always make only 1 batch)
`--chromosomes CHRS`: list of chromosomes to study, space separated. Regex accepted (using the [python syntax](https://docs.python.org/3/library/re.html#regular-expression-syntax)). Default: all valid chromosomes of the reference
`--force-all-chromosomes`: ignore filtering if `--chromosomes` is not set
......@@ -165,13 +167,13 @@ With:
`--out-step STEP`: specify the output rule file to only run the workflow until the associated rule (run all workflow if not specified)
`--cluster-config FILE`: Path to a cluster config file (erase the cluster
config present in configution)
#### Merge batches
##### Command
./cnvpipelines.py run mergebatches -w {working_dir}
With:
`working_dir`: the detection run output folder. Must have at least 2 batches to run correctly for now.
......@@ -193,7 +195,7 @@ We integrate to cnvpipelines popsim, our tool to simulate a population with vari
##### Command
./cnvpipelines.py run simulation -nb {nb_inds} -r {reference} -sp {species} -t {tools} -w {working_dir}
With:
`nb_inds`: number of individuals to generate.
`reference`: fasta file to use as reference for individuals simulated.
......@@ -202,7 +204,7 @@ With:
`working_dir`: the folder into store data.
##### Description of variants
`-s {svlist}`: a file describing the size distributions of variants. If not given, a default distribution is used.
Structure of the file (tab separated columns):
>DEL minLength maxLength proba -> Create DELetion(s).
......@@ -266,7 +268,7 @@ Example:
##### Command
./cnvpipelines.py rerun -w {working_dir}
With:
`working_dir`: the folder into data is stored
......@@ -288,7 +290,7 @@ With:
#### Full clean
./cnvpipelines.py clean -w {working_dir}
With:
`working_dir`: the folder into data is stored
......@@ -299,7 +301,7 @@ With:
#### Soft clean
./cnvpipelines.py soft-clean -w {working_dir}
With:
`working_dir`: the folder into data is stored
......@@ -312,7 +314,7 @@ With:
If snakemake crashes or was killed, workflow should be locked. You can launch it by:
./cnvpipelines.py unlock -w {working_dir}
With:
`working_dir`: the folder into data is stored
......
[global]
# batch system type: local, slurm or sge
# for genologin use slurm
batch_system_type = slurm
# list of modules to load (space separated)
modules =
# path environment to prepend (":" separated)
paths =
# number of concurrent jobs to launch
jobs = 100
# svtoolkit home: genomestrip is now installed via conda
# sv_dir =
[cluster]
# Ignore these options for local batch system type
# submission mode: drmaa or cluster
submission_mode = drmaa
# cluster submission command (ignore for DRMAA submission mode)
submission_command =
# DRMAA lib (ignore for cluster submission mode)
# For genologin use /tools/libraries/slurm-drmaa/slurm-drmaa-1.0.7/lib/libdrmaa.so
drmaa = /tools/libraries/slurm-drmaa/slurm-drmaa-1.0.7/lib/libdrmaa.so
# native submission commands: keep default on most cases
native_submission_options = ###DEFAULT###
# cluster config file
config = ###PROGRAM###/cluster.yaml
# enable drmaa parallelisation through DRMAA for Genomestrip (may cause exceed of number of concurrent jobs launched)
# True: enabled, False: disabled
sv_parallel_drmaa = True
sv_parallel_native_specs = ###DEFAULT###
[refbundle]
# Path to the Libraries folder extracted from RepBase archive file
# Optional (will use default libraries instead) but highly recommended :-)
repeatmasker_lib_path = /save/faraut/Libraries
......@@ -8,11 +8,12 @@ delly:
time: "36:00:00"
lumpy:
mem: 12
mem: 24
time: "36:00:00"
pindel:
mem: 16
time: "24:00:00"
time: "36:00:00"
genomestrip:
mem: 8
......@@ -46,6 +47,11 @@ repeatmask:
mem: 8
h_vmem: 8
buildpop:
time: "48:00:00"
mem: 8
h_vmem: 8
bed2fasta:
mem: 16
h_vmem: 16
......
......@@ -13,14 +13,13 @@ import traceback
import math
import time
import sys
#from termcolor import colored, cprint
from datetime import datetime
from pathlib import Path
from subprocess import run
TOOLS = ["lumpy", "delly", "pindel", "genomestrip"]
VTYPES = ["DEL", "INV", "DUP", "mCNV"]
MIN_CHR_SIZE = 50000
MIN_CHR_SIZE = 500000
MD5_TEST = {
"ERR470101_f_1.fq.gz": "2c30600e303c9bfb7b9538ec76055203",
......@@ -87,8 +86,20 @@ class CnvPipeline:
return hash_md5.hexdigest()
def _write_samples(self, samples, sample_file):
with open(sample_file, "w") as final_samples:
final_samples.write("\n".join(samples) + "\n")
with open(sample_file, "w") as samples_fh:
samples_fh.write("\n".join(samples) + "\n")
# self._write_samples_bam_tsv(samples, sample_file)
def _write_samples_bam_tsv(self, samples, sample_file):
# write in addition a yaml file with sample names and bams
# we have therefore two files : sample list and sample tsv
# (with bam file path)
samples_tsv = os.path.splitext(sample_file)[0] + ".tsv"
with open(samples_tsv, "w") as samples_bam:
samples_bam.write("sample\tbam\n")
for sample in samples:
bam = "data/bams/" + sample + ".bam"
samples_bam.write("%s\t%s\n" % (sample, bam))
def _link_bams_wdir(self, samples_file, bams_dir, smple_file, nb_samples):
"""
......@@ -107,6 +118,7 @@ class CnvPipeline:
with open(samples_file, "r") as samples:
samples_name = []
all_samples_name = []
all_bam_files = []
# Ignore previous lines:
added_samples = 0
n_block = 1
......@@ -165,18 +177,20 @@ class CnvPipeline:
samples_name.append(id_sample)
all_samples_name.append(id_sample)
if added_samples == nb_samples:
# write the block samples name
self._write_samples(samples_name, smple_file + "." + str(n_block))
samples_name = []
added_samples = 0
n_block += 1
if len(samples_name) > 0: ## Sample names for the batch
if len(samples_name) > 0: # Sample names for the batch
self._write_samples(samples_name, smple_file + "." + str(n_block))
if len(all_samples_name) > 0: ## All samples names
if len(all_samples_name) > 0: # All samples names
self._write_samples(all_samples_name, smple_file)
@staticmethod
def _fail(message):
"""
......@@ -262,15 +276,15 @@ class CnvPipeline:
def _get_snakefile(self):
try:
if self.wf_type == "detection":
snakefile = "detection.snk"
snakefile = "detection.smk"
elif self.wf_type == "refbundle":
snakefile = "referencebundle.snk"
snakefile = "referencebundle.smk"
elif self.wf_type == "align":
snakefile = "align.snk"
snakefile = "align.smk"
elif self.wf_type == "mergebatches":
snakefile = "mergebatches.snk"
snakefile = "mergebatches.smk"
elif self.wf_type == "simulation":
snakefile = "popsim.snk"
snakefile = "simulation.smk"
elif self.wf_type == "test":
snakefile = "test.snk"
elif self.wf_type == "unknown":
......@@ -423,6 +437,7 @@ class CnvPipeline:
:param reason: if True, show reason of each executed rule
:type reason: bool
"""
if self.isrerun:
run(["touch", self.config_file, "-d", self.config_timestamp])
if mode == "rerun" and self.wf_type == "detection" and len(snake_args) == 0:
......@@ -431,27 +446,31 @@ class CnvPipeline:
del local_vars["snake_args"]
return self.run_or_rerun_detection(**local_vars)
# Fix snakemake bug: make reference files timestamp to the config.yaml timestamp value
if (mode == "rerun" or self.isrerun) and (self.wf_type == "simulation" or self.wf_type == "refbundle"):
reference = os.path.join(self.wdir, "reference.fasta")
fixed_date = datetime.fromtimestamp(os.stat(self.config_file).st_ctime).strftime("%Y-%m-%d %H:%M:%S")
if os.path.exists(reference):
run(["touch", reference, "-d", fixed_date])
fai = reference + ".fai"
if os.path.exists(fai):
run(["touch", fai, "-d", fixed_date])
if self.wf_type == "simulation":
reference = os.path.join(self.wdir, "refbundle", "reference.fasta")
if os.path.exists(reference):
run(["touch", reference, "-d", fixed_date])
fai = reference + ".fai"
if os.path.exists(fai):
run(["touch", fai, "-d", fixed_date])
# if (mode == "rerun" or self.isrerun) and (self.wf_type == "simulation" or self.wf_type == "refbundle"):
# #reference = os.path.join(self.wdir, "reference.fasta")
# reference = os.path.join(self.wdir, "reference_raw.fasta")
# fixed_date = datetime.fromtimestamp(os.stat(self.config_file).st_ctime).strftime("%Y-%m-%d %H:%M:%S")
# if os.path.exists(reference):
# run(["touch", "-h", reference, "-d", fixed_date])
# fai = reference + ".fai"
# if os.path.exists(fai):
# run(["touch", "-h", fai, "-d", fixed_date])
#
# if self.wf_type == "simulation":
# reference = os.path.join(self.wdir, "refbundle", "reference.fasta")
# if os.path.exists(reference):
# run(["touch", reference, "-d", fixed_date])
# fai = reference + ".fai"
# if os.path.exists(fai):
# run(["touch", fai, "-d", fixed_date])
if merge_batches:
if self.wf_type != "detection":
self._fail("Merge batches can only be done on a detection workflow")
self._wf_type = "mergebatches"
try:
if not os.path.exists(self.config_file):
raise FileNotFoundError("File not found: %s" % self.config_file)
......@@ -547,14 +566,12 @@ class CnvPipeline:
do_delete_after_dryrun = self.delete_wdir_after_dryrun
for n_block in blocks:
sample_file = os.path.join(self.wdir, "samples.list.%d" % n_block)
with open(sample_file, "r") as smple_file:
samples_name = [x.rstrip() for x in smple_file if x.rstrip() != ""]
block_done = os.path.join(self.wdir, ".done%d" % n_block)
if not os.path.exists(block_done):
print("Launching block of samples number %d..." % n_block, flush=True)
print("Samples:", flush=True)
......@@ -563,11 +580,11 @@ class CnvPipeline:
# if size_pbatches > 0 and n_block > 1:
# start_batch_nb = ((n_block - 1) * batches_parallel) + 1
# Run:
self.delete_wdir_after_dryrun = False
exit_status = self.run_or_rerun(snake_args={"start_batch_nb": start_batch_nb,
"sample_file": sample_file}, **kwargs)
"sample_file": sample_file},
**kwargs)
if exit_status == 0 and (not "dryrun" in kwargs or not kwargs["dryrun"]):
Path(block_done).touch()
......@@ -749,6 +766,7 @@ class CnvPipeline:
size_pbatches = -1 if batches == -1 else batches * batches_parallel
config = {
"wf_type": "detection",
"wdir": self.wdir,
"reference": "reference.fasta",
"tools": tools,
......@@ -836,6 +854,8 @@ class CnvPipeline:
chromosomes = self._chromosomes_filtered(final_fai, chromosomes)
config = {
"wf_type": "refbundle",
"reference": "reference_raw.fasta",
"wdir": self.wdir,
"species": species,
"read_len": read_length,
......@@ -856,9 +876,9 @@ class CnvPipeline:
traceback.print_exc()
self._fail(str(e))
def parse_align_sample_file(self, samples, final_samples):
with open(samples, "r") as old_s, open(final_samples, "w") as new_s:
final_samples = {}
def parse_align_sample_file(self, input_sample_file, valid_sample_file):
with open(input_sample_file, "r") as old_s, open(valid_sample_file, "w") as new_s:
valid_samples = {}
try:
y_samples = yaml.load(old_s, Loader=yaml.FullLoader)
except yaml.YAMLError:
......@@ -869,20 +889,38 @@ class CnvPipeline:
if not isinstance(y_samples, dict):
raise ValueError("Error: sample file is not valid")
for sample, files in y_samples.items():
final_samples[sample] = {}
valid_samples[sample] = {}
if not isinstance(files, dict):
raise ValueError("Error: sample file is not valid. File line must be: `reads1: /path/to/file`. "
"May be you forget a space?")
"May be you forgot a space?")
for name, file in files.items():
if name not in ["reads1", "reads2"]:
raise ValueError("Invalid sample \"%s\" in sample file: %s" % (sample, samples))
elif name in final_samples[sample]:
elif name in valid_samples[sample]:
raise ValueError("Invalid sample \"%s\" in sample file: %s" % (sample, samples))
if not os.path.exists(file) or not os.path.isfile(file):
raise ValueError("Sample \"%s\": file \"%s\" does not exists or is not a file" %
(sample, file))
final_samples[sample][name] = os.path.abspath(file)
yaml.dump(final_samples, new_s, default_flow_style=False)
valid_samples[sample][name] = os.path.abspath(file)
yaml.dump(valid_samples, new_s, default_flow_style=False)
# We dump here also a tsv file
# FIXME keep a single sample file in work dir (the tsv file)
# FIXME remove the so called valid_sample_file yaml file
#self._dump_tsv_sample_file(valid_samples, valid_sample_file)
def _dump_tsv_sample_file(self, valid_samples, valid_sample_file):
samples_tsv = os.path.splitext(valid_sample_file)[0] + ".tsv"
with open(samples_tsv, "w") as samples_fh:
samples_fh.write("sample\treads\n")
for sample in valid_samples:
sample_fastqs = self._get_fastq_path(valid_samples[sample])
samples_fh.write("%s\t%s\n" % (sample, ",".join(sample_fastqs)))
def _get_fastq_path(self, sample_fastq):
if len(sample_fastq) == 1:
return [sample_fastq["reads"]]
else:
return [sample_fastq["reads1"], sample_fastq["reads2"]]
def run_align(self, reference, samples, force_wdir=False, **kwargs):
"""
......@@ -922,9 +960,10 @@ class CnvPipeline:
self.parse_align_sample_file(samples, final_sample_file)
config = {
"wf_type": "align",
"wdir": self.wdir,
"reference": os.path.join(ref_dir, os.path.basename(reference)),
"sample_file_align": final_sample_file
"sample_file": final_sample_file
}
with open(self.config_file, "w") as config_file:
......@@ -960,6 +999,7 @@ class CnvPipeline:
with open(self.config_file, "r") as cfg_file:
conf = yaml.load(cfg_file, Loader=yaml.FullLoader)
conf["wf_type"] = "mergebatches"
conf["sample_file"] = sample_file_final
with open(self.config_file, "w") as cfg_file:
yaml.dump(conf, cfg_file, default_flow_style=False)
......@@ -1106,7 +1146,7 @@ class CnvPipeline:
self._fail("Indexing reference fasta file failed")
return False
elif not os.path.exists(final_fai):
self._fail("Working directory is corrupted: reference.fasta.fai not found in the existing working "
self._fail("Working directory is corrupted: reference_raw.fasta.fai not found in the existing working "
"directory")
return False
......@@ -1143,9 +1183,10 @@ class CnvPipeline:
if proba_dup > 0:
variant_types.append("DUP")
config = {
"wf_type": "simulation",
"wdir": self.wdir,
"nb_inds": nb_inds,
"reference": "reference.fasta",
"reference": "reference_raw.fasta",
"coverage": coverage,
"force_polymorphism": force_polymorphism,
"haploid": haploid,
......@@ -1162,9 +1203,9 @@ class CnvPipeline:
"freq_inv": freq_inv,
"freq_dup": freq_dup,
"max_try": max_try,
"sample_file_align": samples_file_fq,
"sample_file_align": samples_file_fq, # NOT used
"tools": tools,
"sample_file": samples_file_d,
"sample_file": samples_file_d, # NOT used
"batches": -1,
"parallel_batches": -1,
"start_batch_nb": 1,
......@@ -1276,20 +1317,21 @@ class CnvPipeline:
raise ValueError("Type of workflow not set")
else:
raise ValueError("Type of workflow not supported: %s" % self.wf_type)
print("Cleaned!")
return 0
except ValueError as e:
if self.debug:
traceback.print_exc()
self._fail(str(e))
print("Cleaned!")
def soft_clean(self, fake=False):
"""
Clean only lof files
Clean only log files
:param fake: if True, only show deleted files without removing them
"""
print("Cleaning...")
logs = glob(os.path.join(self.wdir, "**", "logs"), recursive=True)
for log in logs:
if os.path.isdir(log):
......@@ -1330,6 +1372,7 @@ if __name__ == "__main__":
import argparse
parser = argparse.ArgumentParser(description="Run cnv pipelines")
subparsers = parser.add_subparsers(metavar="command")
subparsers.required = True
def check_min_size_0(value):
ivalue = int(value)
......@@ -1473,9 +1516,10 @@ if __name__ == "__main__":
simul_parser.add_argument("--max-try", help="Maximum of tries", default=10, type=check_min_size_0)
simul_parser.add_argument("-g", "--genotypes", help="Position of SV with genotypes of individuals")
simul_parser.add_argument('-t', '--tools', type=str, required=True, help="Tools to launch, space separated",
nargs="+", choices=TOOLS)
nargs="+", choices=TOOLS)
simul_parser.add_argument('-sp', '--species', type=str, required=False,
help="[refbundle] Species name, according to the NCBI Taxonomy database")
help="[refbundle] Species name, according to the NCBI Taxonomy database",
default="human")
simul_parser.add_argument('-mn', '--max-n-stretches', type=int, required=False, default=100,
help="[refbundle] Max size of nstretches to consider them as it")
simul_parser.add_argument('--overlap-cutoff', type=float, default=0.5,
......@@ -1537,7 +1581,10 @@ if __name__ == "__main__":
args = parser.parse_args()
try:
func = args.func
except AttributeError:
parser.error("too few arguments use -h for help")
if args.func == "run_detection" and args.refbundle is None and "genomestrip" in args.tools:
print("Error: --refbundle option is required for genomestrip")
......
......@@ -135,7 +135,7 @@ class AppConfigReader:
batch_type = self._get_batch_system_type()
if batch_type == "slurm":
return (" --mem-per-cpu={cluster.mem}000 --mincpus={threads} --time={cluster.time} -J {cluster.name}"
" -N 1=1")
" -N 1")
elif batch_type == "sge":
return " -l mem={cluster.mem}G -l h_vmem={cluster.mem}G -pe parallel_smp {threads} -N {cluster.name}"
return None
......
This diff is collapsed.
......@@ -14,12 +14,13 @@ dependencies:
- bwa>=0.7.15
- configargparse>=0.12.0
- delly>=0.8.1
- docopt>=0.6.2
- docopt=>0.6.2
- docutils>=0.14
- drmaa>=0.7.9
- exonerate>=2.4.0
- genomestrip>=2.0
- joblib>=0.13.2
- jupyter>=1.0.0
- numpy>=1.13.3
- openjdk>=8.0.121
- pandas>=0.22.0
......@@ -31,10 +32,12 @@ dependencies:
- python>=3.6.2
- pyvcf>=0.6.8
- pyyaml>=3.12
- r-base>=3.5.1
- scipy>=1.0.1
- snakemake>=3.13.3
- termcolor>=1.1.0
- trf>=4.09
- upsetplot>=0.3.0
- vcftools>=0.1.16
- pip>=19.1.1termcolor-1.1.0
- pip:
......
chmod u+x snakecnv/bin/*.py
chmod u+x popsim/build_variant_pop.py
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
Subproject commit 3e4317eaa8e300b6cb8332dfad36d9f3519d2c01
Subproject commit d608a847a09da7362057e8fd101cb66cff72a34d
%% Cell type:raw id: tags:
<script>
function code_toggle() {
if (code_shown){
$('div.input').hide('500');
$('#toggleButton').val('Show Code')
} else {
$('div.input').show('500');
$('#toggleButton').val('Hide Code')
}
code_shown = !code_shown
}
$( document ).ready(function(){
code_shown=false;
$('div.input').hide()
});
</script>
<form action="javascript:code_toggle()"><input type="submit" id="toggleButton" value="Show Code"></form>
%% Cell type:code id: tags:
``` python
import sys
import glob
import os
import json
from collections import defaultdict,Counter
import math
import pandas as pd
import plotly as py
import numpy as np
import seaborn as sns
sns.set(style="whitegrid", color_codes=True)
%matplotlib inline
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
# vcf parser and associated utilities
# VariantFile is the the pysam vcf file handler returning a pysam vcf record
from pysam import VariantFile
from pybedtools import BedTool, create_interval_from_list
#Fot the http request to jvenn
from IPython.display import display, HTML
py.offline.init_notebook_mode(connected=True)
from os import listdir
from os.path import isfile, join
```
%% Output
%% Cell type:code id: tags:
``` python
```
%% Cell type:code id: tags:
``` python
# The rule file specifying the simulation protocol (size of variants and associated proportion)
rules = pd.read_csv("rules.sim", sep='\s+',header=None, names=['type','min_size','max_size','proportion','cum'])
rules[['type','min_size','max_size','proportion']]
```
%% Output
type min_size max_size proportion
0 DEL 100 500 0.2
1 DEL 500 1000 0.2
2 DEL 1000 2000 0.3
3 DEL 2000 10000 0.2
4 DEL 10000 50000 0.1
%% Cell type:code id: tags:
``` python
```
......@@ -2,7 +2,6 @@
include: "common.smk"
cmd = CommandFactory.factory(config)
workdir: config['wdir']
......
class SnakemakeUtils:
def __init__(self, samples):
self.samples = samples
def get_inputs_fastq(self, wildcards):
fastq = []
files = self.samples[wildcards.sample]
fastq.append(files["reads1"])
if "reads2" in files:
fastq.append(files["reads2"])
return {"fastq": fastq}
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""This script annotates a vcf file"""
from svrunner_utils import eprint
from svreader.annotation import AnnotateReader, AnnotateWriter
from svreader.annotation import redundancy_annotator
from svreader.annotation import get_connected_duplicates
from svreader.annotation import variant_filtration, set_supporting_tools
from svreader.annotation import rename_variants
def filtering_by_support(variant_set, minsupp_reads=3):
"""
Adding the LOWSUPPORT filter if the number of supporting reads is not
least minsupp_reads for a least on sample
"""
for variant in variant_set:
info = variant.record.info
if info["MAX_SUPP_READS"] < minsupp_reads:
variant.filter.add("LOWSUPPORT")
def annotate(inputfile, outputfile, genotyper):
""" Filtering the candidate CNVs according to the following criteria
- non duplicate sites
- variant sites
- call rate > 0.8
- at least one variant (homozygous or heterozygote) has a
genotype quality > 20
- the variant is not everywhere heterozygote or homozygote
(use NONVARIANTSCORE in both cases)
- sv with no overlap with CNVR filtered out by genomeSTRIP
"""
# Reading the vcf file
variant_set = []
eprint(" Reading file %s" % (inputfile))
reader = AnnotateReader(inputfile)
for record in reader:
variant_set.append(record)
eprint("Working with " + str(len(variant_set)) + " records")
# Annotations
num = 1
for variant in variant_set:
# PASS, and '.' become both PASS
variant.unify_pass_filtertag()
variant.add_supporting_infos()
variant.set_qual()
# Redundancy annotation
redundancy_annotator(variant_set, reader, overlap_cutoff=0.5,
genotyper=genotyper)
# Now genomeSTRIP inspired variant filtration
if reader.numSamples() > 0:
variant_filtration(variant_set, reader)
# Now supporting info filtering
filtering_by_support(variant_set)
# Constructing connected component of duplicates
get_connected_duplicates(variant_set)
set_supporting_tools(variant_set)
# Rename sv
rename_variants(variant_set)
vcf_fout = AnnotateWriter(outputfile, reader)
num = 1
for record in sorted(variant_set, key=lambda k: k.start):
vcf_fout.write(record)
num += 1
vcf_fout.close()
if __name__ == "__main__":
import argparse
parser = argparse.ArgumentParser(
prog="filter.py",
description="Filter by applying flags to variants")
parser.add_argument("-i", "--input-vcf", required=True,
help="Genotypes vcf input file")
parser.add_argument("-o", "--output-vcf", required=True,
help="Filtered vcf output file")
parser.add_argument("-g", "--genotyper", required=True,
help="Genotyper tool")
args = parser.parse_args()
annotate(inputfile=args.input_vcf,
outputfile=args.output_vcf,
genotyper=args.genotyper)
This diff is collapsed.
../../popsim/build_results.py
\ No newline at end of file