Skip to content
Snippets Groups Projects

Compare revisions

Changes are shown as if the source revision was being merged into the target revision. Learn more about comparing revisions.

Source

Select target project
No results found

Target

Select target project
  • laure.heurtevin/workflows
  • cervin.guyomar/workflows
  • bios4biol/workflows
3 results
Show changes
Commits on Source (77)
Showing
with 591 additions and 674 deletions
# Template workflow readme
### Introduction
What is the purpose of the pipeline ?
eg: It pre-processes raw data from FastQ inputs (FastQC, Trim Galore!), aligns the reads (STAR or HiSAT2), generates gene counts (featureCounts, StringTie) and performs extensive quality-control on the results (RSeQC, dupRadar, Preseq, edgeR, MultiQC). See the output documentation for more details of the results.
```mermaid
graph TD
A(Raw reads <br><i>Fastq</i>) -- raw data report --> B((fastQC))
B --> C(report <br><i>HTML</i>)
A -- triming --> D((cutadapt))
D --> E(trimed reads <br><i>Fastq</i>)
E -- genome alignment --> F((Star))
F --> G(aligned reads <br><i>BAM</i>)
```
The pipeline is built using Nextflow, a bioinformatics workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It comes with docker / singularity containers making installation trivial and results highly reproducible.
### Which tools ?
List of software
- fastQC version XXX (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/)
- cutadapt version YYY (https://github.com/marcelm/cutadapt)
- star version ZZZ (https://github.com/alexdobin/STAR)
### Where to use it ?
This pipeline have been installed on [Genotoul Cluster] (http://bioinfo.genotoul.fr/) so you can loggon genologin.toulouse.inra.fr and access in /usr/local/bioinfo/workflows/Nextflow/template_dev/main.nf
### How to use it ?
Ligne de commande de lancement pour genotoul ...
### Who is maintaining the workflow ?
Bios4Biol CATI person implied ...
Be aware, that persons who are maintaining this workflow can be contact in case of bug but not in case of bioinformatics explaination need of your results.
### Credits
These scripts were copied from ....
Many thanks to other who have helped out along the way too, including (but not limited to):
[@loginTwitter](https://github.com/loginTwitter),...
---
params.outdir = "$baseDir/results/"
params.readfiles = "$baseDir/../../data/reads/*{1,2}.fastq.gz"
params.genome = "$baseDir/../../data/reference/reference.fa"
params.annotation = "$baseDir/../../data/reference/reference.gtf"
params.index = "$baseDir/../../index/"
outdir = params.outdir
readfiles = params.readfiles
genome = file(params.genome)
annotation = file(params.annotation)
index = params.index
Channel.fromPath(readfiles).set { reads }
process fastqc {
cpus 1
publishDir outdir, mode: 'copy', overwrite: true
input:
file read from reads
output:
file '*html' into fastqc
"""
fastqc --noextract ${read}
"""
}
Channel.fromFilePairs(readfiles).set { pair_reads }
process cutadapt {
cpus 1
input:
set pair_id, file(read) from pair_reads
output:
set pair_id, 'clean_*' into clean_files1, clean_files2
"""
cutadapt -a ACACTCTTTCCCTACACGACGCTCTTCCGATC -A GATCGGAAGAGCGGTTCAGCAGGAATGCCGAG -o clean_${pair_id}1.fastq.gz -p clean_${pair_id}2.fastq.gz ${read}
"""
}
process indexGenome {
storeDir index
input:
file ref from genome
file annot from annotation
output:
file "${dbname}.star/*" into startDb
script:
dbname = ref.baseName
"""
mkdir ${dbname}.star
STAR --runMode genomeGenerate --genomeDir ${dbname}.star --genomeFastaFiles ${ref} --sjdbGTFfile ${annot} --sjdbOverhang 99 --runThreadN ${task.cpus}
"""
}
process mapping {
publishDir outdir, mode: 'copy', overwrite: true
input:
set pair_id, file(read) from clean_files1
val starDB from startDb
output:
file "*bam" into aligned_sorted_files
script:
starDBpath = starDB[0].parent
"""
STAR --genomeDir $starDBpath --runThreadN ${task.cpus} --readFilesIn ${read} --outFileNamePrefix ${pair_id} --readFilesCommand zcat --outSAMtype BAM Unsorted --outFilterType BySJout
"""
}
process mapping2 {
publishDir outdir, mode: 'copy', overwrite: true
input:
set pair_id, file(read) from clean_files2
val starDB from startDb
output:
file "*sam" into aligned_files
script:
starDBpath = starDB[0].parent
"""
STAR --genomeDir $starDBpath --runThreadN ${task.cpus} --readFilesIn ${read} --outFileNamePrefix ${pair_id} --readFilesCommand zcat --outFilterType BySJout
"""
}
\ No newline at end of file
#!/usr/bin/perl -w
# http://wiki.bits.vib.be/index.php/Identify_the_Phred_scale_of_quality_scores_used_in_fastQ
use strict;
use File::Basename;
use List::MoreUtils qw( minmax );
# fastq_detect.pl fastq.file sample-size
# detect fastQ format from quality scores in fastQ input file
# Version 3
#
# Stephane Plaisance - VIB-BITS - July-04-2012
# Joachim Jacob - Aug-02-2012 - joachim.jacob@gmail.com
# - changed the maximum value of Sanger to 73
# - changed reading the file with a file handle
# (was a file handle !! supporting several archive formats. SP)
# - changed the diagnosing algoritm
# Stephane Plaisance - VIB-BITS - April-08-2013
# - merged both versions and corrected flaw in min/max
# thanks to Sergey Mitrfanov for perl reformatting
#####################################################################
# diagnose
# SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS.....................................................
# ..........................XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX......................
# ...............................IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII......................
# .................................JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ......................
# LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLL....................................................
# !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
# | | | | | |
# 33 59 64 73 104 126
# S 0........................26...31.......40
# X -5....0........9.............................40
# I 0........9.............................40
# J 3.....9.............................40
# L 0.2......................26...31........41
#
# S - Sanger Phred+33, raw reads typically (0, 40)
# X - Solexa Solexa+64, raw reads typically (-5, 40)
# I - Illumina 1.3+ Phred+64, raw reads typically (0, 40)
# J - Illumina 1.5+ Phred+64, raw reads typically (3, 40) with 0=unused, 1=unused, 2=Read Segment Quality Control Indicator (bold)
# L - Illumina 1.8+ Phred+33, raw reads typically (0, 41)
# - Illumina 1.9+ Phred+33, raw reads typically (3, 41)
#####################################################################
my $script = basename($0);
@ARGV gt 0 or die "usage: $script <fastq file> <opt:sample-size (100)>\n";
my ($inputfile, $limit) = @ARGV;
if (! defined $limit) { $limit = 100}; # check first 100 records
my $cnt=0;
my ($min, $max); # global min and max values
my $z = ReadFile ($inputfile) || die "Error: cannot read from variant file $inputfile: $!\n";
## parse
while (my $id = <$z>) {
$id =~ m/^@/ || die "expected @ not found in line 1!\n";
my $seq = <$z>;
my $sep = <$z>;
$sep =~ m/^\+/ || die "expected + not found in line 3!\n";
my $qual = <$z>;
chomp($qual);
$cnt++;
$cnt>=$limit && last;
# char to ascii
my @chars = split("", $qual);
my @nums = sort { $a <=> $b } (map { unpack("C*", $_ )} @chars);
if ($cnt==1) {
($min, $max) = minmax @nums;
} else {
my ($lmin, $lmax) = minmax @nums; # local values for this read
$lmin<$min ? $min=$lmin : $min=$min;
$lmax>$max ? $max=$lmax : $max=$max;
}
}
undef $z;
## diagnose
my %diag=(
'Sanger' => '.',
'Solexa' => '.',
'Illumina 1.3+' => '.',
'Illumina 1.5+' => '.',
'Illumina 1.8+' => '.',
'Illumina 1.9+' => '.'
);
my %comment=(
'Sanger' => 'Phred+33',
'Solexa' => 'Solexa+64',
'Illumina 1.3+' => 'Phred+64',
'Illumina 1.5+' => 'Phred+64',
'Illumina 1.8+' => 'Phred+33',
'Illumina 1.9+' => 'Phred+33'
);
if ($min<33 || $max>105) { die "Quality values corrupt. found [$min; $max] where [33; 104] was expected\n"; }
if ($min>=33 && $max<=73) {$diag{'Sanger'}='x';}
if ($min>=59 && $max<=104) {$diag{'Solexa'}='x';}
if ($min>=64 && $max<=104) {$diag{'Illumina 1.3+'}='x';}
if ($min>=66 && $max<=105) {$diag{'Illumina 1.5+'}='x';}
if ($min>=33 && $max<=74) {$diag{'Illumina 1.8+'}='x';}
if ($min>=33 && $max<=75) {$diag{'Illumina 1.9+'}='x';}
## report
foreach my $format (sort keys %diag) {
if ($diag{$format} eq 'x') {
print STDOUT "# sampled raw quality values are in the range of [".$min."; ".$max."]\n";
print STDOUT sprintf("%-30s\t$inputfile\n", $comment{$format});}
}
##############
#### Subs ####
# reads from uncompressed, gzipped and bgzip fastQ files
sub ReadFile {
my $infile = shift;
my $FH;
if ($infile =~ /.bz2$/) {
open ($FH, "bzcat $infile |") or die ("$!: can't open file $infile");
} elsif ($infile =~ /.gz$/) {
open ($FH, "zcat $infile |") or die ("$!: can't open file $infile");
} elsif ($infile =~ /.fq|.fastq|.txt$/) {
open ($FH, "cat $infile |") or die ("$!: can't open file $infile");
} else {
die ("$!: do not recognise file type $infile");
}
return $FH;
}
# Allel Specific Expression Analysis workflow
authors__: Maria Bernard (INRAE - SIGENAE) / Kevin Muret - Frederic Jehl - Sandrine Lagarrigue ( AGROCAMPUS OUEST / INRAE - PEGASE)
**<u>!!! STATUT : DEVELOPMENT !!!</u>**
## What it does ?
This pipeline will perform variant filtering, reads alignment on masked genome and allel specific quantification expression.
The pipeline is represented in the graph below:
![](img/full_dryrun.png)
## 1. Input
The user must use the [config_calling.yaml](config_calling.yaml.example) file to provide all necessary inputs for the pipeline:
- A masked reference fasta file (fasta_ref option), indexed with samtools faidx and GATK4 CreateSequenceDictionary. We recommend to mask genome with variant filtered bi-allelic and on GATK metrics FS < 30, QD > 2 and AF < 1. In our used case we mask genome on variant construct at the population level, and then we launch this pipeline at the tissue level.
- An genome annotation reference GTF file (gtf_ref option)
- A set of variant to include in the ASE analysis
- A sample reads description TSV file (sample_config option), with the following columns:
- idx : a unique index number to identify each fastq (paire) file
- name : the sample name
- forward_read and reverse_read are fastq file names. These files need to be in a directory precised with the data_dir option
- If single end, leave an empty column in the reverse_read
- If paired, file names need to ends with _R1.fastq.gz or _R2.fastq.gz (or fq instead of fastq, and not necessarly compressed)
- sequencer : sequencer used to generate the sequences : "ILLUMINA", "SLX", "SOLEXA", "SOLID", "454", "LS454", "COMPLETE", "PACBIO", "IONTORRENT", "CAPILLARY", "HELICOS", "UNKNOWN"
- oriented : to indicate if sequences are generated with strand-specific protocol. It corresponds to the forward_prob RSEM parameter. Indicate :
- 1 for a strand-specific protocol where all (upstream) reads are derived from the forward strand,
- 0 for a strand-specific protocol where all (upstream) read are derived from the reverse strand,
- 0.5 for a non-strand-specific protocol.
- phred scale: to indicate the phred score scale use to code base quality: either 33 (Sanger and illumina 1.8+) or 64 (Solexa, illumina 1.3 to 1.8 excluded)
## 2. Data Processing
### Raw reads trimming
First step is to filter and trim raw reads before alignment. This step will convert phred 64 scale fastq files into phred 33 scale fastq files and then trim reads with [trim_galore](https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/). It will trim adapter and filtered out trimmed reads that are smaller than the third of the read length.
By default the trimming quality threshold is set to 15.
You will find in Result/Summary directory, a Log_TrimGalore_summary.tsv that detail the trim_galor statistics.
### Sequence alignment
Trimmed reads will be align against the reference genome thanks to the [STAR](https://academic.oup.com/bioinformatics/article/29/1/15/272537) following the multi-sample 2-pass mapping procedure, with the GTF reference file.
The second alignment pass will take into account new junctions detected from the first pass. Among all samples new canonical junction covered by at least 2 reads in one sample will be kept.
STAR 2nd pass alignment will generate bam files in genome coordinates.
You will find in the Results/Summary directory , 2 files reporting alignment statistics : Log_STAR_Aln_1_summary.tsv and Log_STAR_Aln_2_summary.tsv
### Genomic Alignment post processing
Alignment will be filtered to keep only:
- properly paired alignment in case of paired libraries
- non duplicated alignment ( PCR or OPTICAL duplicates are removed)
- uniquely mapped reads (corresponding to STAR quality score of 255)
Finally reads that contain Ns in their cigar string will be splittd, and alignment score of 255 will be convert to 60.
### Variant Filtration
Variants will be filtered before submitting to ASE.
Only bi-allelic SNP with FS < or equal to 30.0 and QD < or equal to 2.0 will be used in ASE.
Filtering on SNP type is an option, see SNP_filter in the configuration file. By default it is set to False as the calling pipeline generate output VCF file filtered on SNP type.
The pipeline will also add annotation in the INFO field in the final VCF file (see: Results/vcfFiltration/*_FsQdBiall.vcf.gz):
- LOCALISATION will be set to :
- *jnct5nt* if the variants is closed ( +/- 5bp) to an exon/intron junction
- *homodimer5* if the variants is included in a 5bp homopolymer
- *cl3snp35nt* if the variants is in a 35 bp windows that included at least 3 others variants
- GTpopCR will indicate the percentage of known genotype
- DPgt05rPopCR will indicate the percentage of known genotype covered by at least 5 reads
Additionally, the pipeline will generate 3 tables that extract samples GT, DP and AD_REF and AD_ALT genotypes annotations. These tables will include only SNP that have a genotype call rate greater than 50% and a genotype with at least 5 reads coverage call rate greater than 20%.
### Compute Allele Specific Expression
Alleles Specific Expression will be obtained using phASER (https://github.com/secastel/phaser) and ASE Read Counter (https://gatk.broadinstitute.org/hc/en-us/articles/360036714351-ASEReadCounter), on bi-allelic variants filtered on FS < 30, and QD > 2.
ASEReadCounter and phASER returnby sample, allele expression on each SNP.
phASER will also return gene allele specific espression by constructing gene haplotypes by sample, and finally an expression matrix which agglomerate all the samples.
## 3. Dependencies
The workflow depends on:
* cutadapt (version 2.1)
* trimgalore (version 0.6.5)
* STAR (version 2.6.0c)
* samtools (version 1.9 )
* bcftools (version 1.9)
* bedtools( version 2.29.0)
* tabix (version 0.2.5)
* GenomeAnalysisTK.jar (version 3.7) which include Picard tools and ASE Read Counter
* java (version 8)
* phASER (cloned the 29th of May 2020)
Theses softwares need to be available in you PATH or in a bin directory precised in the config.yaml file with the bin_dir option.
## 4. Configure and running workflow
- Copy the config_calling.yaml.example file into your working directory and update it according to the instructions inside the file.
- launch snakemake command as presented in SLURM.sh to launch the workflow. This script explain how to launch a dryrun, a full run, perform quantification expression only or variant calling only
- If necessary update resources_SLURM.yaml to define cpu memory and cluster partition resources for each rule. The default section is the minimal resources per job. The other section take as name the rules name and define only resources that are different from the default section.
Resources_SLURM.yaml and SLURM.sh are specific to SLURM cluster. Therefore, if you use another cluster, you need to create other files. However, you can use these files to inspire you.
This is the official documentation of Snakemake if you need more information: <https://snakemake.readthedocs.io/en/stable/tutorial/tutorial.html>
......@@ -7,7 +7,6 @@ import pandas as pd
##########################################################
### system options
# Recover software memory and cpus resources
# Recover software memory and cpus resources
with open(config["resources"]) as yml:
......@@ -18,6 +17,8 @@ if not "__default__" in config or not "cpu" in config["__default__"] or not "mem
if int(config["__default__"]["mem"].replace("G","")) < 1 :
raise Exception("you need at least 5G of memory per job to run this pipeline")
#~ print(config)
def check_param(rule,resource):
if not rule in config:
config[rule]={resource:config["__default__"][resource]}
......@@ -26,9 +27,9 @@ def check_param(rule,resource):
rule_resources = {
'indexFasta' : ['mem'],
'VarForMasking' : ['mem'],
'VariantFiltration' : ['mem'],
'SelectVariantsForASE' : ['mem'],
'FS_QD_clust' : ['mem'],
'vcf_filtration' : ['cpu'],
'SelectVariants' : ['mem'],
'STAR_Index_1' : ['cpu'],
'STAR_Aln_1' : ['cpu'],
'STAR_Aln_2' : ['cpu'],
......@@ -47,6 +48,9 @@ for rule in rule_resources:
# add bin directory in PATH
if "bin_dir" in config:
os.environ['PATH'] = config["bin_dir"] + os.pathsep + os.environ['PATH']
# add script dir to PATH
SCRIPT_DIR= os.path.abspath(os.path.join(workflow.basedir, "script"))
os.environ['PATH'] = SCRIPT_DIR + os.pathsep + os.environ['PATH']
##########################################################
### workflows options
......@@ -54,24 +58,32 @@ if "bin_dir" in config:
# check the reference genome files
if not "fasta_ref" in config or not os.path.exists(config["fasta_ref"]):
raise Exception("You need to provide an existing fasta reference file, using fasta_ref in your config.yaml file")
elif not os.path.exists(config["fasta_ref"] + ".fai") or not os.path.exists(os.path.splitext(config["fasta_ref"])[0] + ".dict"):
raise Exception("Your fasta reference file need to be indexed with samtools faidx and GATK CreateSequenceDictionnary ")
if not "gtf_ref" in config or not os.path.exists(config["gtf_ref"]):
raise Exception("You need to provide an existing gtf reference file, using gtf_ref in your config.yaml file")
# check about VCF files
if not 'vcf' in config or not os.path.exists(config['vcf']):
raise Exception('Your need to provide an existing VCF files, using the vcf parameter in the config.yaml file')
raise Exception('Your need to provide an existing VCF files, using the vcf parameter in the config.yaml file')
if not os.path.exists(config['vcf'] + '.tbi'):
raise Exception('Your input VCF file : ' + config['vcf'] + ', is not indexed by tabix.')
raise Exception('Your input VCF file : ' + config['vcf'] + ', is not indexed by tabix.')
if not config['vcf'].endswith('.gz'):
raise Exception('Your input VCF file : ' + config['vcf'] + ', is not compressed with bgzip')
raise Exception('Your input VCF file : ' + config['vcf'] + ', is not compressed with bgzip')
# check sample config TSV file
if not "sample_config" in config or not os.path.exists(config["sample_config"]):
raise Exception("You need to provide an existing sample config tabular file, using sample_config in your config.yaml file")
if not 'SNP_filter' in config:
config['SNP_filter'] = False
elif config['SNP_filter'] != False and config['SNP_filter'] != True:
raise Exception('SNP_filter must be set to True or False')
# load sample_config
table = pd.read_csv(config["sample_config"], sep='\t', dtype=str)
# check sequencer type
......@@ -87,7 +99,7 @@ if table.shape[0] != len(table["forward_read"].unique()):
#################################################################################################
## FUNCTIONS
def find_jar(jar):
path=""
for bin_dir in os.environ['PATH'].split(os.pathsep):
......@@ -109,7 +121,6 @@ def gatk_multi_arg(flag, files):
# genome processing
include : "rules/basics.smk"
include : "rules/variantFiltration.smk"
include : "rules/genomeMasking.smk"
# reads preprocessing and alignment
include : "rules/TrimGalore.smk"
include : "rules/alignmentProcess.smk"
......@@ -119,42 +130,56 @@ include : "rules/ASE_counter.smk"
localrules: all
final_outputs=list()
# reference masked
final_outputs.append('Results/genomeMasked/' + os.path.splitext(os.path.basename(config['fasta_ref']))[0] + '_masked.fa')
final_outputs.append('Results/genomeMasked/' + os.path.splitext(os.path.basename(config['fasta_ref']))[0] + '_masked.fa.fai')
final_outputs.append('Results/genomeMasked/' + os.path.splitext(os.path.basename(config['fasta_ref']))[0] + '_masked.dict')
# STAR alignment
for sample in table["sample_name"].unique():
sample_table = table[table["sample_name"] == sample]
# paired end, filter on properly paired
if len(sample_table["forward_read"]) == len(sample_table["reverse_read"].dropna()) :
final_outputs.append("Results/STAR_Aln_2/" + sample + "_rg_genomic_pp_rmdup_uniq_split.bam")
final_outputs.append("Results_ASE/STAR_Aln_2/" + sample + "_rg_genomic_pp_rmdup_uniq_split.bam")
# single end, no filter on properly paired
elif len(sample_table["reverse_read"].dropna() ) == 0:
final_outputs.append("Results/STAR_Aln_2/" + sample + "_rg_genomic_rmdup_uniq_split.bam")
final_outputs.append("Results_ASE/STAR_Aln_2/" + sample + "_rg_genomic_rmdup_uniq_split.bam")
# Summary
final_outputs.append("Results/Summary/Log_TrimGalore_summary.tsv")
final_outputs.append("Results/Summary/Log_STAR_Aln_1_summary.tsv")
final_outputs.append("Results/Summary/Log_STAR_Aln_2_summary.tsv")
final_outputs.append("Results/Summary/VariantFiltration_summary.txt")
# # Summary
final_outputs.append("Results_ASE/Summary/Log_TrimGalore_summary.tsv")
final_outputs.append("Results_ASE/Summary/Log_STAR_Aln_1_summary.tsv")
final_outputs.append("Results_ASE/Summary/Log_STAR_Aln_2_summary.tsv")
# variant Filtration
final_outputs.append('Results/vcfFiltration/' + os.path.basename(config['vcf']).replace('.vcf.gz','_SNP_callRateFiltered.vcf.gz'))
final_outputs.append('Results_ASE/vcfFiltration/' + os.path.basename(config['vcf']).replace('.vcf.gz','_FsQdBiall.vcf.gz'))
final_outputs.append('Results_ASE/vcfFiltration/' + os.path.basename(config['vcf']).replace('.vcf.gz','_FsQdBiall_withGTpopCR50_DPgt05rPopCR20_GT.tsv'))
final_outputs.append('Results_ASE/vcfFiltration/' + os.path.basename(config['vcf']).replace('.vcf.gz','_FsQdBiall_withGTpopCR50_DPgt05rPopCR20_DP.tsv'))
final_outputs.append('Results_ASE/vcfFiltration/' + os.path.basename(config['vcf']).replace('.vcf.gz','_FsQdBiall_withGTpopCR50_DPgt05rPopCR20_AD.tsv'))
final_outputs.append('Results_ASE/Summary/VariantFiltration_summary.txt')
# counter
for sample in table["sample_name"].unique():
final_outputs.append('Results/Counter/' + sample + '_ASEReadCounter.csv')
final_outputs.append('Results/Counter/' + sample + '_phASER.allelic_counts.txt')
final_outputs.append('Results_ASE/ASEReadCounter/' + sample + '_ASEReadCounter.csv')
final_outputs.append('Results_ASE/phASER/' + sample + '_phASER.allelic_counts.txt')
final_outputs.append('Results_ASE/phASER_gene_ae/' + sample + '_phASER.gene_ae.txt')
final_outputs.append('Results_ASE/phASER_pop/' + os.path.basename(config['vcf']).replace('.vcf.gz','_phASER.bed.gz'))
final_outputs.append('Results_ASE/phASER_pop/' + os.path.basename(config['vcf']).replace('.vcf.gz','_phASER.gw_phased.bed.gz'))
# print(final_outputs)
# ['Results/genomeMasked/reference_masked.fa',
# 'Results/genomeMasked/reference_masked.fa.fai',
# 'Results/genomeMasked/reference_masked.dict',
# 'Results/STAR_Aln_2/samplePE_rg_genomic_pp_rmdup_uniq_split.bam',
# 'Results/Counter/samplePE_ASEReadCounter.csv',
# 'Results/Counter/samplePE_phASER.allelic_counts.txt']
# ['Results_ASE/STAR_Aln_2/samplePE_rg_genomic_pp_rmdup_uniq_split.bam',
# 'Results_ASE/STAR_Aln_2/sampleSE_rg_genomic_rmdup_uniq_split.bam',
# 'Results_ASE/Summary/Log_TrimGalore_summary.tsv',
# 'Results_ASE/Summary/Log_STAR_Aln_1_summary.tsv',
# 'Results_ASE/Summary/Log_STAR_Aln_2_summary.tsv',
# 'Results_ASE/vcfFiltration/variants_FsQdBiall.vcf.gz',
# 'Results_ASE/vcfFiltration/variants_FsQdBiall_withGTpopCR50_DPgt05rPopCR20_GT.tsv',
# 'Results_ASE/vcfFiltration/variants_FsQdBiall_withGTpopCR50_DPgt05rPopCR20_DP.tsv',
# 'Results_ASE/vcfFiltration/variants_FsQdBiall_withGTpopCR50_DPgt05rPopCR20_AD.tsv',
# 'Results_ASE/Summary/VariantFiltration_summary.txt',
# 'Results_ASE/ASEReadCounter/samplePE_ASEReadCounter.csv',
# 'Results_ASE/phASER/samplePE_phASER.allelic_counts.txt',
# 'Results_ASE/phASER_gene_ae/samplePE_phASER.gene_ae.txt',
# 'Results_ASE/ASEReadCounter/sampleSE_ASEReadCounter.csv',
# 'Results_ASE/phASER/sampleSE_phASER.allelic_counts.txt',
# 'Results_ASE/phASER_gene_ae/sampleSE_phASER.gene_ae.txt']
rule all:
input:
......
# bin_dir is a directory containing software binary used in this pipeline:
# trimgalore (version 0.4.5 with cutadapt (version 1.14))
# trimgalore (version 0.6.5 with cutadapt (version 2.1))
# tabix and bgzip (version 0.2.5)
# STAR (version 2.6.0c )
# GenomeAnalysisTK.jar (version 3.8.1)
# bedtools (version 2.27.1)
# picard.jar (version 2.1.1)
# samtools (version 1.8)
# GenomeAnalysisTK.jar (version 4.1.2.0)
# samtools (version 1.9)
# bedtools (version 2.29.0)
# md5sum
# phASER (version downloaded 07-06-2019 ) with :
# phASER (version downloaded 23-03-2020 ) with :
# python2.7 and associated Scipy and Numpy library
# bgzip (version 0.2.5, cf tabix)
# bcftools (version 1.8)
# bcftools (version 1.9)
# if not bin_dir, the binary need to be available in the PATH
bin_dir : bin
# fasta_ref file is the genome reference Fasta file (need to be indexed with samtools faidx and picard CreateSequenceDictionary )
# fasta_ref file is the masked genome reference Fasta file (need to be indexed with samtools faidx and GATK4 CreateSequenceDictionary )
fasta_ref : data/reference.fa
# gtf_ref file is the genome reference GTF file
gtf_ref : data/reference.gtf
# VCF input files to analyse. The VCF files need to be indexed by tabix
# VCF input files to analyse (including SNP only). The VCF files need to be indexed by tabix
vcf : data/variants.vcf.gz
# sample_config file is a tabular file describing each input fastq file(s)
# sample_config file is a tabular file describing each input fastq file(s) corresponding to samples included in the previous VCF file.
# header columns are :
# idx name forward_read reverse_read sequencer read_length phred_scale group
#
......@@ -32,16 +33,15 @@ vcf : data/variants.vcf.gz
# - If paired, file names need to ends with _R1.fastq.gz or _R2.fastq.gz (or fq instead of fastq, and not necessarly compressed)
# sequencer need to be choosen in the list : ["ILLUMINA","SLX","SOLEXA","SOLID","454","LS454","COMPLETE","PACBIO","IONTORRENT","CAPILLARY","HELICOS","UNKNOWN"]
# phred scale indicate the phred score scale use to code base quality: either 33 (Sanger and illumina 1.8+) or 64 (Solexa, illumina 1.3 to 1.8 excluded)
# group is used to defined sub population among all samples (in vcf input file). It will be used to filter variant that are not present in at least one group with a call rate > call_rate_th
sample_config : data/population.tsv.example
# data_dir contains fastq files described in sample_config file
data_dir : data
# minimum calling depth
depth : 10
# minimum percentage of sample with no missing genotype per group
call_rate_th : 0.95
# Filter on variant type if not already done (True or False)
SNP_filter : True
# quality trimming threshold used in trimgalore to remove low quality bases.
trimming_quality : 15
# minimum base quality
baseQuality : 20
# minimum mapping quality
......
@HD VN:1.6
@SQ SN:1 LN:196202544 M5:971cb1c7a7f62a402dab61cfe84a93b1 UR:file:/home/maria/workspace/workflows_develop/Snakemake/1000RNASeq_chicken/ASE2/data/reference.fa
1 196202544 3 60 61
Snakemake/1000RNASeq_chicken/ASE/img/full_dryrun.png

231 KiB

__default__:
mem : "8G"
cpu : 1
time : "4:00:00"
partition : workq
mem : "8G"
cpu : 1
time : "4:00:00"
partition : workq
VarForMasking:
mem : "10G"
trim_pe :
time : "24:00:00"
trim_se :
time : "24:00:00"
maskFasta:
FS_QD_clust:
mem : "20G"
SelectVariants:
mem : "20G"
STAR_Index_1 :
mem : "18G"
vcf_filtration:
cpu : 10
mem : "16G"
time : "96:00:00"
partition : workq
STAR_Index_1:
mem : "50G"
cpu : 8
time : "24:00:00"
STAR_Aln_1 :
mem : "12G"
STAR_Aln_1:
mem : "30G"
cpu : 8
time : "24:00:00"
STAR_Aln_2 :
mem : "12G"
STAR_Aln_2:
mem : "30G"
cpu : 8
time : "24:00:00"
AddOrReplaceReadGroups :
AddOrReplaceReadGroups:
mem: "12G"
RmDuplicates :
RmDuplicates:
mem: "20G"
SplitNCigarReads :
SplitNCigarReads:
mem : "30G"
VariantFiltration:
mem : "20G"
SelectVariantsForASE:
mem : "20G"
time : "8:00:00"
phASER:
mem : "20G"
cpu : 8
mem : "180G"
time : "8:00:00"
phASER_expr_matrix:
cpu : 8
ASEReadCounter:
mem : "20G"
......@@ -13,10 +13,10 @@ def get_splitNCigar_bam(wildcards):
# paired end, filter on properly paired and rmdup
if len(sample_table["forward_read"]) == len(sample_table["reverse_read"].dropna()) :
return "Results/STAR_Aln_2/" + wildcards.sample + "_rg_genomic_pp_rmdup_uniq_split.bam"
return "Results_ASE/STAR_Aln_2/" + wildcards.sample + "_rg_genomic_pp_rmdup_uniq_split.bam"
# single end, no filter
elif len(sample_table["reverse_read"].dropna() ) == 0:
return "Results/STAR_Aln_2/" + wildcards.sample + "_rg_genomic_rmdup_uniq_split.bam"
return "Results_ASE/STAR_Aln_2/" + wildcards.sample + "_rg_genomic_rmdup_uniq_split.bam"
else :
raise Exception(wildcards.sample + " is sequenced in pair end & single end mode\n This workflow do not accept this case!\n")
......@@ -26,40 +26,43 @@ def get_splitNCigar_bai(wildcards):
# paired end, filter on properly paired and rmdup
if len(sample_table["forward_read"]) == len(sample_table["reverse_read"].dropna()) :
return "Results/STAR_Aln_2/" + wildcards.sample + "_rg_genomic_pp_rmdup_uniq_split.bam.bai"
return "Results_ASE/STAR_Aln_2/" + wildcards.sample + "_rg_genomic_pp_rmdup_uniq_split.bam.bai"
# single end, no filter
elif len(sample_table["reverse_read"].dropna() ) == 0:
return "Results/STAR_Aln_2/" + wildcards.sample + "_rg_genomic_rmdup_uniq_split.bam.bai"
return "Results_ASE/STAR_Aln_2/" + wildcards.sample + "_rg_genomic_rmdup_uniq_split.bam.bai"
else :
raise Exception(wildcards.sample + " is sequenced in pair end & single end mode\n This workflow do not accept this case!\n")
# GATK3 :
# java -jar GenomeAnalysisTK.jar -T ASEReadCounter -R {input.ref} -o {output} -I {input.bam} -sites {input.vcf} -U ALLOW_N_CIGAR_READS --minMappingQuality {params.mappingQuality} --minBaseQuality {params.baseQuality}
# GATK4 : ALLOW_N_CIGAR_READS seems to be useless now
rule ASEReadCounter:
input:
ref = config["fasta_ref"],
vcf = 'Results/vcfFiltration/' + os.path.basename(config['vcf']).replace('.vcf.gz','_SNP_callRateFiltered.vcf.gz'),
idx = 'Results/vcfFiltration/' + os.path.basename(config['vcf']).replace('.vcf.gz','_SNP_callRateFiltered.vcf.gz.tbi'),
vcf = 'Results_ASE/vcfFiltration/' + os.path.basename(config['vcf']).replace('.vcf.gz','_FsQdBiall.vcf.gz'),
idx = 'Results_ASE/vcfFiltration/' + os.path.basename(config['vcf']).replace('.vcf.gz','_FsQdBiall.vcf.gz.tbi'),
bam = get_splitNCigar_bam,
bai = get_splitNCigar_bai
output:
cvs = 'Results/Counter/{sample}_ASEReadCounter.csv'
cvs = 'Results_ASE/ASEReadCounter/{sample}_ASEReadCounter.csv'
params:
mem = config["ASEReadCounter"]["mem"],
jar = find_jar("GenomeAnalysisTK.jar"),
mappingQuality = config['mappingQuality'],
baseQuality = config['baseQuality']
shell:
"""
java -Xmx{params.mem} -jar {params.jar} -T ASEReadCounter -R {input.ref} -o {output} -I {input.bam} -sites {input.vcf} -U ALLOW_N_CIGAR_READS --minMappingQuality {params.mappingQuality} --minBaseQuality {params.baseQuality}
gatk --java-options "-Xmx{params.mem}" ASEReadCounter -R {input.ref} -O {output} -I {input.bam} --tmp-dir Results_ASE/ASEReadCounter --variant {input.vcf} --min-mapping-quality {params.mappingQuality} --min-base-quality {params.baseQuality}
"""
rule phASER:
input:
vcf = 'Results/vcfFiltration/' + os.path.basename(config['vcf']).replace('.vcf.gz','_SNP_callRateFiltered.vcf.gz') ,
idx = 'Results/vcfFiltration/' + os.path.basename(config['vcf']).replace('.vcf.gz','_SNP_callRateFiltered.vcf.gz.tbi') ,
vcf = 'Results_ASE/vcfFiltration/' + os.path.basename(config['vcf']).replace('.vcf.gz','_FsQdBiall.vcf.gz'),
idx = 'Results_ASE/vcfFiltration/' + os.path.basename(config['vcf']).replace('.vcf.gz','_FsQdBiall.vcf.gz.tbi'),
bam = get_splitNCigar_bam,
bai = get_splitNCigar_bai
output:
csv = 'Results/Counter/{sample}_phASER.allelic_counts.txt'
csv = 'Results_ASE/phASER/{sample}_phASER.allelic_counts.txt',
hap_count = 'Results_ASE/phASER/{sample}_phASER.haplotypic_counts.txt'
threads: config["phASER"]["cpu"]
params:
mappingQuality = config['mappingQuality'],
......@@ -68,5 +71,45 @@ rule phASER:
idSeparator = '_' if config['id_separator'] == '' else config['id_separator']
shell:
"""
python2.7 {config[bin_dir]}/phaser.py --vcf {input.vcf} --bam {input.bam} --sample {wildcards.sample} --threads {threads} --o Results/Counter/{wildcards.sample}_phASER --paired_end {params.paired_option} --baseq {params.baseQuality} --mapq {params.mappingQuality} --id_separator {params.idSeparator}
python2.7 {config[bin_dir]}/phaser.py --vcf {input.vcf} --bam {input.bam} --sample {wildcards.sample} --threads {threads} --o Results_ASE/phASER/{wildcards.sample}_phASER --paired_end {params.paired_option} --unphased_vars 1 --gw_phase_vcf 1 --baseq {params.baseQuality} --mapq {params.mappingQuality} --id_separator {params.idSeparator}
"""
rule gtf2bed:
input:
gtf = config['gtf_ref']
output:
bed = 'Results_ASE/ref_bed/' + os.path.basename(config['gtf_ref']).replace('.gtf', '.bed')
script:
"../script/gtf2bed.py"
rule phASER_gene_ae:
input:
hap_count = "Results_ASE/phASER/{sample}_phASER.haplotypic_counts.txt",
bed = 'Results_ASE/ref_bed/' + os.path.basename(config['gtf_ref']).replace('.gtf', '.bed')
output:
gene_ae = "Results_ASE/phASER_gene_ae/{sample}_phASER.gene_ae.txt"
params:
idSeparator = '_' if config['id_separator'] == '' else config['id_separator']
shell:
"""
python2.7 {config[bin_dir]}/phaser_gene_ae.py --haplotypic_counts {input.hap_count} --features {input.bed} --o {output} --id_separator {params.idSeparator}
"""
rule phASER_expr_matrix:
input:
expand("Results_ASE/phASER_gene_ae/{sample}_phASER.gene_ae.txt", sample=table["sample_name"].unique()),
bed = 'Results_ASE/ref_bed/' + os.path.basename(config['gtf_ref']).replace('.gtf', '.bed')
output:
bed = 'Results_ASE/phASER_pop/' + os.path.basename(config['vcf']).replace('.vcf.gz','_phASER.bed.gz'),
phased_bed = 'Results_ASE/phASER_pop/' + os.path.basename(config['vcf']).replace('.vcf.gz','_phASER.gw_phased.bed.gz')
params:
gene_ae_dir = "Results_ASE/phASER_gene_ae/",
out_prefix = 'Results_ASE/phASER_pop/' + os.path.basename(config['vcf']).replace('.vcf.gz','_phASER')
threads: config["phASER_expr_matrix"]["cpu"]
shell:
"""
python2.7 {config[bin_dir]}/phaser_expr_matrix.py --t {threads} --gene_ae_dir {params.gene_ae_dir} --feature {input.bed} --o {params.out_prefix}
"""
# coding: utf-8
__author__ = 'Kenza BAZI KABBAJ / Maria BERNARD- Sigenae'
__author__ = 'Maria BERNARD- Sigenae'
__version__ = '1.0.0'
'''
......@@ -16,7 +16,7 @@ rule convertPhred:
input:
fastq_in = get_phred64
output:
fastq_out = temp("Results/Phred_convert/{prefix}.fastq.gz")
fastq_out = temp("Results_ASE/Phred_convert/{prefix}.fastq.gz")
script:
"../script/convertPhred64_to_Phred33.py"
......@@ -27,13 +27,13 @@ def get_fastq(wildcards):
if re.search('^' + wildcards.prefix + '(_R1){0,1}\.f.{0,3}q(\.gz){0,1}$',f):
is_phred64 = table[table["forward_read"] == f]["phred_scale"] == '64'
if is_phred64.all():
inputs.append("Results/Phred_convert/" + f)
inputs.append("Results_ASE/Phred_convert/" + f)
else:
inputs.append(config["data_dir"] + "/" + f)
r = table[table["forward_read"] == f]["reverse_read"]
if not r.isnull().values.any():
if is_phred64.all():
inputs.append("Results/Phred_convert/" + r.squeeze())
inputs.append("Results_ASE/Phred_convert/" + r.squeeze())
else:
inputs.append(config["data_dir"] + "/" + r.squeeze())
return inputs
......@@ -43,14 +43,15 @@ rule trim_se:
input:
get_fastq
output:
out=temp("Results/TrimGalore/{prefix}_trim.fastq.gz")
out=temp("Results_ASE/TrimGalore/{prefix}_trim.fastq.gz")
params:
min_trim=lambda wildcards : str(int([ int(table[table["forward_read"] == f]["read_length"].tolist()[0]) for f in table["forward_read"] if f.startswith(wildcards.prefix)][0] / 3 ))
min_trim=lambda wildcards : str(int([ int(table[table["forward_read"] == f]["read_length"].tolist()[0]) for f in table["forward_read"] if f.startswith(wildcards.prefix)][0] / 3 )),
qual=config["trimming_quality"]
log:
"Results/TrimGalore/{prefix}_trimSe.log"
"Results_ASE/TrimGalore/{prefix}_trimSe.log"
shell:
"""
trim_galore --no_report_file --length {params.min_trim} --quality 0 -o `dirname {output.out}` {input} 2> {log}
trim_galore --no_report_file --length {params.min_trim} --quality {params.qual} -o `dirname {output.out}` {input} 2> {log}
mv `dirname {output.out}`/{wildcards.prefix}*trimmed.fq.gz {output.out}
"""
......@@ -59,15 +60,16 @@ rule trim_pe:
input:
get_fastq
output:
out1=temp("Results/TrimGalore/{prefix}_trim_R1.fastq.gz"),
out2=temp("Results/TrimGalore/{prefix}_trim_R2.fastq.gz")
out1=temp("Results_ASE/TrimGalore/{prefix}_trim_R1.fastq.gz"),
out2=temp("Results_ASE/TrimGalore/{prefix}_trim_R2.fastq.gz")
log:
"Results/TrimGalore/{prefix}_trimPe.log"
"Results_ASE/TrimGalore/{prefix}_trimPe.log"
params:
min_trim=lambda wildcards : str(int([ int(table[table["forward_read"] == f]["read_length"].tolist()[0]) for f in table["forward_read"] if f.startswith(wildcards.prefix)][0] / 3 ))
min_trim=lambda wildcards : str(int([ int(table[table["forward_read"] == f]["read_length"].tolist()[0]) for f in table["forward_read"] if f.startswith(wildcards.prefix)][0] / 3 )),
qual=config["trimming_quality"]
shell:
"""
trim_galore --paired --no_report_file --length {params.min_trim} --quality 0 -o `dirname {output.out1}` {input} 2> {log}
trim_galore --paired --no_report_file --length {params.min_trim} --quality {params.qual} -o `dirname {output.out1}` {input} 2> {log}
mv `dirname {output.out1}`/{wildcards.prefix}*_val_1.fq.gz {output.out1}
mv `dirname {output.out1}`/{wildcards.prefix}*_val_2.fq.gz {output.out2}
"""
......@@ -79,9 +81,9 @@ def get_trim_Log(wildcards):
prefix=re.split('(_R1){0,1}\.f.{0,3}q(\.gz){0,1}$', f)[0]
r = table[table["forward_read"] == f]["reverse_read"]
if not r.isnull().values.any():
inputs.append("Results/TrimGalore/" + prefix + "_trimPe.log" )
inputs.append("Results_ASE/TrimGalore/" + prefix + "_trimPe.log" )
else:
inputs.append("Results/TrimGalore/" + prefix + "_trimSe.log" )
inputs.append("Results_ASE/TrimGalore/" + prefix + "_trimSe.log" )
return inputs
......@@ -89,7 +91,7 @@ rule log_merge_trim:
input:
logs=get_trim_Log
output:
out="Results/Summary/Log_TrimGalore_summary.tsv"
out="Results_ASE/Summary/Log_TrimGalore_summary.tsv"
script:
"../script/trimgalore_summary.py"
......@@ -14,75 +14,78 @@ STAR_INDEX_FILE=['genomeParameters.txt', 'chrName.txt', 'chrLength.txt', 'chrSta
rule STAR_Index_1:
input:
fasta = 'Results/genomeMasked/' + os.path.splitext(os.path.basename(config['fasta_ref']))[0] + '_masked.fa',
fasta = os.path.abspath(config['fasta_ref']),
gtf = os.path.abspath(config["gtf_ref"])
output:
expand('Results/STAR_Index_1/{file}', file=STAR_INDEX_FILE)
expand('Results_ASE/STAR_Index_1/{file}', file=STAR_INDEX_FILE)
threads: config["STAR_Index_1"]["cpu"]
params:
sjdbOverhang=max([int(x) for x in table.read_length]) - 1
shell:
"STAR --runThreadN {threads} --runMode genomeGenerate --genomeDir Results/STAR_Index_1 --genomeFastaFiles {input.fasta} --sjdbOverhang {params.sjdbOverhang} --sjdbGTFfile {input.gtf} "
"STAR --runThreadN {threads} --runMode genomeGenerate --genomeDir Results_ASE/STAR_Index_1 --genomeFastaFiles {input.fasta} --sjdbOverhang {params.sjdbOverhang} --sjdbGTFfile {input.gtf} "
def get_fastq_trim(wildcards):
inputs = list()
for f in table["forward_read"] :
if re.search('^' + wildcards.prefix + '(_R1){0,1}\.f.{0,3}q(\.gz){0,1}$',f):
inputs.append("Results/TrimGalore/" + f.replace(wildcards.prefix,wildcards.prefix + "_trim"))
inputs.append("Results_ASE/TrimGalore/" + f.replace(wildcards.prefix,wildcards.prefix + "_trim"))
r = table[table["forward_read"] == f]["reverse_read"]
if not r.isnull().values.any():
inputs.append("Results/TrimGalore/" + r.squeeze().replace(wildcards.prefix,wildcards.prefix + "_trim"))
inputs.append("Results_ASE/TrimGalore/" + r.squeeze().replace(wildcards.prefix,wildcards.prefix + "_trim"))
return inputs
rule STAR_Aln_1:
input:
expand('Results/STAR_Index_1/{file}', file=STAR_INDEX_FILE[1:]),
expand('Results_ASE/STAR_Index_1/{file}', file=STAR_INDEX_FILE[1:]),
fastq = get_fastq_trim
output:
temp("Results/STAR_Aln_1/{prefix}_Log.out"),
temp("Results/STAR_Aln_1/{prefix}_Log.progress.out"),
temp("Results/STAR_Aln_1/{prefix}_Log.final.out"),
bam=temp("Results/STAR_Aln_1/{prefix}_Aligned.sortedByCoord.out.bam"),
sj_out_tab="Results/STAR_Aln_1/{prefix}_SJ.out.tab"
temp("Results_ASE/STAR_Aln_1/{prefix}_Log.out"),
temp("Results_ASE/STAR_Aln_1/{prefix}_Log.progress.out"),
temp("Results_ASE/STAR_Aln_1/{prefix}_Log.final.out"),
bam=temp("Results_ASE/STAR_Aln_1/{prefix}_Aligned.sortedByCoord.out.bam"),
sj_out_tab="Results_ASE/STAR_Aln_1/{prefix}_SJ.out.tab"
threads: config["STAR_Aln_1"]["cpu"]
shell:
"STAR --runThreadN {threads} --genomeDir Results/STAR_Index_1 --readFilesIn {input.fastq} --readFilesCommand zcat --outFileNamePrefix `dirname {output.bam}`/{wildcards.prefix}_ --outSAMtype BAM SortedByCoordinate --outFilterType BySJout"
"STAR --runThreadN {threads} --genomeDir Results_ASE/STAR_Index_1 --readFilesIn {input.fastq} --readFilesCommand zcat --outFileNamePrefix `dirname {output.bam}`/{wildcards.prefix}_ --outSAMtype BAM SortedByCoordinate --outFilterType BySJout"
# see https://github.com/alexdobin/STAR/issues/524 and https://groups.google.com/forum/#!searchin/rna-star/2-pass|sort:date/rna-star/Cpsf-_rLK9I/gq-DaeyvBAAJ
def get_SJout_STAR1(wildcards):
inputs=list()
forward_list = table["forward_read"]
for f in forward_list:
prefix=re.split('(_R1){0,1}\.f.{0,3}q(\.gz){0,1}$', f)[0]
inputs.append("Results/STAR_Aln_1/" + prefix + "_SJ.out.tab" )
inputs.append("Results_ASE/STAR_Aln_1/" + prefix + "_SJ.out.tab" )
return inputs
rule STAR_SJout_filter:
input:
sj_out_tab=get_SJout_STAR1
output:
'Results/STAR_Index_1/SJ.out.filtered.tab'
'Results_ASE/STAR_Index_1/SJ.out.filtered.tab'
shell:
"cat {input.sj_out_tab} | awk '$6==0 && $5>0 && $7>=2' | cut -f1-6 | sort | uniq -c |awk '{{if($1>1){{print $2,$3,$4,$5,$6,$7}}}}' |sed 's/ /\t/g' > {output}"
"""
# filter junctions on: mitochondria , non canonical, already known, covered by less than 3 uniquely mapped reads in at least 1 sample
cat {input.sj_out_tab} | awk '$6==0 && $5>0 && $7>=2' | cut -f1-6 | sort | uniq -c |awk '{{if($1>1){{print $2,$3,$4,$5,$6,$7}}}}' |sed 's/ /\t/g' > {output}
"""
rule STAR_Aln_2:
input:
expand('Results/STAR_Index_1/{file}', file=STAR_INDEX_FILE[1:]),
expand('Results_ASE/STAR_Index_1/{file}', file=STAR_INDEX_FILE[1:]),
fastq = get_fastq_trim,
sj_out_tab = 'Results/STAR_Index_1/SJ.out.filtered.tab'
sj_out_tab = 'Results_ASE/STAR_Index_1/SJ.out.filtered.tab'
output:
temp("Results/STAR_Aln_2/{prefix}_Log.out"),
temp("Results/STAR_Aln_2/{prefix}_Log.progress.out"),
temp("Results/STAR_Aln_2/{prefix}_SJ.out.tab"),
g_bam=temp("Results/STAR_Aln_2/{prefix}_Aligned.sortedByCoord.out.bam"),
log=temp("Results/STAR_Aln_2/{prefix}_Log.final.out"),
g_stat="Results/STAR_Aln_2/{prefix}_Aligned.sortedByCoord.out.flagstat"
temp("Results_ASE/STAR_Aln_2/{prefix}_Log.out"),
temp("Results_ASE/STAR_Aln_2/{prefix}_Log.progress.out"),
temp("Results_ASE/STAR_Aln_2/{prefix}_SJ.out.tab"),
g_bam=temp("Results_ASE/STAR_Aln_2/{prefix}_Aligned.sortedByCoord.out.bam"),
log=temp("Results_ASE/STAR_Aln_2/{prefix}_Log.final.out"),
g_stat="Results_ASE/STAR_Aln_2/{prefix}_Aligned.sortedByCoord.out.flagstat"
threads: config["STAR_Aln_2"]["cpu"]
priority : 50
shell:
"""
STAR --runThreadN {threads} --genomeDir Results/STAR_Index_1/ --sjdbFileChrStartEnd {input.sj_out_tab} --readFilesIn {input.fastq} --readFilesCommand zcat --outFileNamePrefix `dirname {output.g_bam}`/{wildcards.prefix}_ --outSAMtype BAM SortedByCoordinate
STAR --runThreadN {threads} --genomeDir Results_ASE/STAR_Index_1/ --sjdbFileChrStartEnd {input.sj_out_tab} --readFilesIn {input.fastq} --readFilesCommand zcat --outFileNamePrefix `dirname {output.g_bam}`/{wildcards.prefix}_ --outSAMtype BAM SortedByCoordinate
samtools flagstat {output.g_bam} > {output.g_stat}
"""
......@@ -90,47 +93,49 @@ def get_STAR_log(wildcards):
inputs = list()
for f in table["forward_read"]:
prefix=re.split('(_R1){0,1}\.f.{0,3}q(\.gz){0,1}$', f)[0]
inputs.append("Results/" + wildcards.dir + "/" + prefix + "_Log.final.out" )
inputs.append("Results_ASE/" + wildcards.dir + "/" + prefix + "_Log.final.out" )
return inputs
rule log_merge_STAR:
input:
logs=get_STAR_log
output:
out="Results/Summary/Log_{dir}_summary.tsv"
out="Results_ASE/Summary/Log_{dir}_summary.tsv"
wildcard_constraints:
dir='STAR_Aln_[12]'
script:
"../script/star_summary.py"
# PICARD
# java -jar picard.jar AddOrReplaceReadGroups I={input.bam} O={output.bam} RGID={params.idx} RGLB={params.name} RGPL={params.sequencer} RGPU="-" RGSM={params.name}
# GATK4:
rule AddOrReplaceReadGroups:
input :
bam="Results/STAR_Aln_2/{prefix}_Aligned.sortedByCoord.out.bam",
bam="Results_ASE/STAR_Aln_2/{prefix}_Aligned.sortedByCoord.out.bam",
output :
bam=temp("Results/STAR_Aln_2/{prefix}_Aligned.sortedByCoord_rg.bam")
bam=temp("Results_ASE/STAR_Aln_2/{prefix}_Aligned.sortedByCoord_rg.bam")
params :
idx = lambda wildcards : [ table[table["forward_read"] == f]["idx"].tolist()[0] for f in table["forward_read"] if f.startswith(wildcards.prefix)][0],
name = lambda wildcards : [ table[table["forward_read"] == f]["sample_name"].tolist()[0] for f in table["forward_read"] if f.startswith(wildcards.prefix)][0],
sequencer = lambda wildcards : [ table[table["forward_read"] == f]["sequencer"].tolist()[0] for f in table["forward_read"] if f.startswith(wildcards.prefix)][0],
mem=str(int(config["AddOrReplaceReadGroups"]["mem"].replace("G","")) -4 )+"G" if int(config["AddOrReplaceReadGroups"]["mem"].replace("G","")) -4 > 1 else config["AddOrReplaceReadGroups"]["mem"] ,
jar=find_jar("picard.jar")
shell:
"""
java -Xmx{params.mem} -jar {params.jar} AddOrReplaceReadGroups I={input.bam} O={output.bam} RGID={params.idx} RGLB={params.name} RGPL={params.sequencer} RGPU="-" RGSM={params.name}
gatk --java-options "-Xmx{params.mem}" AddOrReplaceReadGroups -I {input.bam} -O {output.bam} --RGID {params.idx} --RGLB {params.name} --RGPL {params.sequencer} --RGPU "-" --RGSM {params.name}
"""
def get_genomic_bam(wildcards):
inputs = list()
for f in table[table["sample_name"] == wildcards.sample]["forward_read"]:
prefix=re.split('(_R1){0,1}\.f.{0,3}q(\.gz){0,1}$', f)[0]
inputs.append("Results/STAR_Aln_2/" + prefix + "_Aligned.sortedByCoord_rg.bam" )
inputs.append("Results_ASE/STAR_Aln_2/" + prefix + "_Aligned.sortedByCoord_rg.bam" )
return inputs
rule merge_bam :
input:
bam = get_genomic_bam
output:
bam = temp("Results/STAR_Aln_2/{sample}_rg_genomic.bam")
bam = temp("Results_ASE/STAR_Aln_2/{sample}_rg_genomic.bam")
threads: config["merge_bam"]["cpu"]
shell:
"""
......@@ -146,9 +151,9 @@ rule merge_bam :
rule properly_paired:
input :
bam = "Results/STAR_Aln_2/{sample}_rg_genomic.bam"
bam = "Results_ASE/STAR_Aln_2/{sample}_rg_genomic.bam"
output:
bam = temp("Results/STAR_Aln_2/{sample}_rg_genomic_pp.bam")
bam = temp("Results_ASE/STAR_Aln_2/{sample}_rg_genomic_pp.bam")
shell:
"""
samtools view -h -f 2 {input.bam} -bo {output.bam}
......@@ -159,52 +164,54 @@ def get_properly_paired_bam(wildcards):
sample_table = table[table["sample_name"] == wildcards.sample]
# paired end, filter on properly paired
if len(sample_table["forward_read"]) == len(sample_table["reverse_read"].dropna()) :
return "Results/STAR_Aln_2/" + wildcards.sample + "_rg_genomic_pp.bam"
return "Results_ASE/STAR_Aln_2/" + wildcards.sample + "_rg_genomic_pp.bam"
# single end, no filter
elif len(sample_table["reverse_read"].dropna() ) == 0:
return "Results/STAR_Aln_2/" + wildcards.sample + "_rg_genomic.bam"
return "Results_ASE/STAR_Aln_2/" + wildcards.sample + "_rg_genomic.bam"
# note : This tool uses the READ_NAME_REGEX and the OPTICAL_DUPLICATE_PIXEL_DISTANCE options as the primary methods to identify and differentiate duplicate types. Set READ_NAME_REGEX to null to skip optical duplicate detection, e.g. for RNA-seq or other data where duplicate sets are extremely large and estimating library complexity is not an aim
# PICARD
# java -jar picard.jar MarkDuplicates I={input.bam} O={output.bam} M={output.metrics} CREATE_INDEX=false READ_NAME_REGEX=null REMOVE_DUPLICATES=true VALIDATION_STRINGENCY=SILENT
# GATK4:
rule RmDuplicates:
input:
bam = get_properly_paired_bam
output:
bam = temp("Results/STAR_Aln_2/{sample}_rg_genomic{properlyPaired}rmdup.bam"),
metrics = "Results/STAR_Aln_2/{sample}_rg_genomic{properlyPaired}.metrics.txt",
bam = temp("Results_ASE/STAR_Aln_2/{sample}_rg_genomic{properlyPaired}rmdup.bam"),
metrics = "Results_ASE/STAR_Aln_2/{sample}_rg_genomic{properlyPaired}.metrics.txt",
params:
mem=str(int(config["rmdup"]["mem"].replace("G","")) -4 )+"G" if int(config["rmdup"]["mem"].replace("G","")) -4 > 1 else config["rmdup"]["mem"] ,
jar=find_jar("picard.jar")
shell:
"""
java -Xmx{params.mem} -jar {params.jar} MarkDuplicates CREATE_INDEX=false READ_NAME_REGEX=null I={input.bam} O={output.bam} REMOVE_DUPLICATES=true VALIDATION_STRINGENCY=SILENT M={output.metrics}
gatk --java-options "-Xmx{params.mem}" MarkDuplicates -I {input.bam} -O {output.bam} -M {output.metrics} --CREATE_INDEX false --READ_NAME_REGEX 'null' --REMOVE_DUPLICATES true --VALIDATION_STRINGENCY SILENT
"""
rule remove_multimap:
input:
bam = "Results/STAR_Aln_2/{sample}_rg_genomic{properlyPaired}rmdup.bam"
bam = "Results_ASE/STAR_Aln_2/{sample}_rg_genomic{properlyPaired}rmdup.bam"
output:
bam = temp("Results/STAR_Aln_2/{sample}_rg_genomic{properlyPaired}rmdup_uniq.bam")
bam = temp("Results_ASE/STAR_Aln_2/{sample}_rg_genomic{properlyPaired}rmdup_uniq.bam")
shell:
"""
samtools view -h -q 255 {input.bam} -bo {output.bam}
"""
# GATK3 :
# java -jar GenomeAnalysisTK.jar -T SplitNCigarReads -R {input.fasta} -I {input.bam} -o {output.bam} --disable_bam_indexing -rf ReassignOneMappingQuality -RMQF 255 -RMQT 60 -U ALLOW_N_CIGAR_READS
# GATK4 : default behavior, see --skip-mapping-quality-transform option
rule SplitNCigarReads:
input:
fasta = 'Results/genomeMasked/' + os.path.splitext(os.path.basename(config['fasta_ref']))[0] + '_masked.fa',
idx = 'Results/genomeMasked/' + os.path.splitext(os.path.basename(config['fasta_ref']))[0] + '_masked.fa.fai',
dict = 'Results/genomeMasked/' + os.path.splitext(os.path.basename(config['fasta_ref']))[0] + '_masked.dict',
bam = "Results/STAR_Aln_2/{sample}_rg_genomic{properlyPaired}rmdup_uniq.bam",
bai = "Results/STAR_Aln_2/{sample}_rg_genomic{properlyPaired}rmdup_uniq.bam.bai"
fasta = config['fasta_ref'],
bam = "Results_ASE/STAR_Aln_2/{sample}_rg_genomic{properlyPaired}rmdup_uniq.bam",
bai = "Results_ASE/STAR_Aln_2/{sample}_rg_genomic{properlyPaired}rmdup_uniq.bam.bai"
output:
bam = "Results/STAR_Aln_2/{sample}_rg_genomic{properlyPaired}rmdup_uniq_split.bam"
bam = "Results_ASE/STAR_Aln_2/{sample}_rg_genomic{properlyPaired}rmdup_uniq_split.bam"
params:
mem=config["SplitNCigarReads"]["mem"],
jar=find_jar("GenomeAnalysisTK.jar")
shell:
"""
java -Xmx{params.mem} -jar {params.jar} -T SplitNCigarReads -R {input.fasta} -I {input.bam} -o {output.bam} --disable_bam_indexing -rf ReassignOneMappingQuality -RMQF 255 -RMQT 60 -U ALLOW_N_CIGAR_READS
gatk --java-options "-Xmx{params.mem}" SplitNCigarReads -R {input.fasta} -I {input.bam} -O {output.bam} --create-output-bam-index false
"""
......@@ -23,8 +23,6 @@ rule bgzip:
'{file}.vcf'
output:
'{file}.vcf.gz'
wildcard_constraints:
file=".+_SNP_callRateFiltered|.+_SNP_exclude_nonVar"
shell:
"bgzip {input}"
......@@ -48,19 +46,3 @@ rule bam_index:
shell:
"samtools index {input} "
rule indexFasta:
input:
ref = "{file}.fa"
output:
idx = "{file}.fa.fai",
dict = "{file}.dict"
params:
mem=str(int(config["indexFasta"]["mem"].replace("G","")) -4 )+"G" if int(config["indexFasta"]["mem"].replace("G","")) -4 > 1 else config["indexFasta"]["mem"] ,
jar=find_jar("picard.jar")
shell:
"""
samtools faidx {input.ref}
java -Xmx{params.mem} -jar {params.jar} CreateSequenceDictionary R={input.ref} O={output.dict}
"""
# coding: utf-8
__author__ = 'Maria BERNARD- Sigenae'
__version__ = '1.0.0'
'''
Mask variant sites in the reference genome
'''
rule maskFasta:
input:
ref = config['fasta_ref'],
vcf = 'Results/vcfFiltration/' + os.path.basename(config['vcf']).replace('.vcf.gz','') + '_SNP_exclude_nonVar.vcf.gz'
output:
ref = 'Results/genomeMasked/' + os.path.splitext(os.path.basename(config['fasta_ref']))[0] + '_masked.fa'
shell:
"""
bedtools maskfasta -fi {input.ref} -bed {input.vcf} -fo {output.ref}
"""
# coding: utf-8
__author__ = 'Kenza BAZI KABBAJ / Maria BERNARD- Sigenae'
__author__ = 'Maria BERNARD- Sigenae'
__version__ = '1.0.0'
'''
Rules to 'hard' filtered SNP and INDEL following the GATK recommandations and remove non variants site
?????
'''
import sys
from pyfaidx import Fasta
genome = Fasta(config['fasta_ref'])
chrom_list = [ seq.name for seq in genome if seq.name.isdigit() ]
rule VarForMasking:
PREFIX = os.path.basename(config['vcf']).replace('.vcf.gz','')
# GATK4
rule SelectVariants :
input :
vcf = config['vcf']
output :
vcf = temp('Results/vcfFiltration/{prefix}_SNP_exclude_nonVar.vcf'),
stat = temp('Results/vcfFiltration/{prefix}_SNP_exclude_nonVar.stat')
params:
mem = config['VarForMasking']['mem'],
jar = find_jar("GenomeAnalysisTK.jar")
message:
"\nrule VarForMasking:\n" \
+ "\tinput: {input}\n" \
+ "\toutput: {output}\n" \
+ "Executing script/exclude_non_var.py to keep only SNP and exclude non variants site (all samples with the same genotype call).\n"
script:
'../script/exclude_non_var.py'
vcf = temp("Results_ASE/vcfFiltration/" + PREFIX + "_SNP.vcf")
params :
mem = config["SelectVariants"]["mem"]
shell:
'''
gatk --java-options "-Xmx{params.mem}" SelectVariants -V {input.vcf} -O {output.vcf} --select-type-to-include SNP
'''
def get_SNP_vcf(wildcards):
if config['SNP_filter']:
return "Results_ASE/vcfFiltration/" + PREFIX + "_SNP.vcf"
else:
return config['vcf']
rule VariantFiltration:
input :
vcf = 'Results/vcfFiltration/{prefix}_SNP_exclude_nonVar.vcf.gz',
idx = 'Results/vcfFiltration/{prefix}_SNP_exclude_nonVar.vcf.gz.tbi',
# GATK3:
# java -jar GenomeAnalysisTK.jar -T VariantFiltration -R {input.ref} -V {input.vcf} {params.filter_exp} -o {output.vcf} -filterName "FS" -filter "FS > 30.0" -filterName "QD" -filter "QD < 2.0" --cluster-window-size 35 --cluster-size 3
# GATK4
rule FS_QD_clust :
input :
vcf = get_SNP_vcf,
ref = config['fasta_ref']
output :
vcf = temp('Results/vcfFiltration/{prefix}_SNP_tagged.vcf.gz'),
params:
mem= config['VariantFiltration']['mem'],
jar=find_jar("GenomeAnalysisTK.jar"),
filter_exp = '-filterName FS -filter \"FS > 30.0\" -filterName QD -filter \"QD < 2.0\" -filterName Depth -filter \"DP < ' + str(config["depth"]) + '\"'
vcf = temp("Results_ASE/vcfFiltration/" + PREFIX + "_FS_QD_clust_tagged.vcf"),
idx = temp("Results_ASE/vcfFiltration/" + PREFIX + "_FS_QD_clust_tagged.vcf.idx")
params :
mem = config["FS_QD_clust"]["mem"]
shell:
'''
java -Xmx{params.mem} -jar {params.jar} -T VariantFiltration -R {input.ref} -V {input.vcf} {params.filter_exp} -o {output.vcf}
gatk --java-options "-Xmx{params.mem}" VariantFiltration -R {input.ref} -V {input.vcf} -O {output.vcf} --filter-name FS --filter-expression "FS > 30.0" --filter-name QD --filter-expression "QD < 2.0" -window 35 -cluster 3
'''
rule VariantFiltrationStat:
input:
vcf = 'Results/vcfFiltration/{prefix}_SNP_tagged.vcf.gz'
output:
stat = temp('Results/vcfFiltration/{prefix}_SNP_tagged.stat')
rule vcf_filtration :
input :
vcf = "Results_ASE/vcfFiltration/" + PREFIX + "_FS_QD_clust_tagged.vcf",
idx = "Results_ASE/vcfFiltration/" + PREFIX + "_FS_QD_clust_tagged.vcf.idx",
gtf = config['gtf_ref'],
fasta = config['fasta_ref']
output :
vcf = temp("Results_ASE/vcfFiltration/by_chrom/{chrom}_" + PREFIX + "_FsQdBiall.vcf") ,
gt = temp("Results_ASE/vcfFiltration/by_chrom/{chrom}_" + PREFIX + "_FsQdBiall_withGTpopCR50_DPgt05rPopCR20_GT.tsv") ,
dp = temp("Results_ASE/vcfFiltration/by_chrom/{chrom}_" + PREFIX + "_FsQdBiall_withGTpopCR50_DPgt05rPopCR20_DP.tsv" ),
ad = temp("Results_ASE/vcfFiltration/by_chrom/{chrom}_" + PREFIX + "_FsQdBiall_withGTpopCR50_DPgt05rPopCR20_AD.tsv" ),
summary = temp("Results_ASE/vcfFiltration/by_chrom/{chrom}_" + PREFIX + "_VariantFiltration_summary.txt")
params :
chrom_selection = lambda wildcards : "--accepted-chrom " + wildcards.chrom if wildcards.chrom != "other" else "--excluded-chrom " + " ".join(chrom_list),
cpu = config['vcf_filtration']['cpu']
shell:
'''
echo "# Number of SNP filtered by criteria:" >> {output.stat}
zcat {input.vcf} |grep -v "#" | cut -f 7 | sort | uniq -c >> {output.stat}
echo "# Number of non biallelic SNP filtered: " >> {output.stat}
zcat {input.vcf} |grep -v "#" | cut -f 5 | grep -c "," >> {output.stat}
variantFiltration.py --nb-cpus {params.cpu} --in-vcf {input.vcf} --in-gtf {input.gtf} --in-fasta {input.fasta} {params.chrom_selection} --out-vcf {output.vcf} --out-gt {output.gt} --out-dp {output.dp} --out-ad {output.ad} --summary {output.summary}
'''
rule SelectVariantsForASE:
input :
vcf = 'Results/vcfFiltration/{prefix}_SNP_tagged.vcf.gz',
idx = 'Results/vcfFiltration/{prefix}_SNP_tagged.vcf.gz.tbi',
ref = config['fasta_ref']
rule merge_filteredVariant :
input :
chr_vcf = expand("Results_ASE/vcfFiltration/by_chrom/{CHR}_" + PREFIX + "_FsQdBiall.vcf", CHR=chrom_list),
other_vcf = "Results_ASE/vcfFiltration/by_chrom/other_" + PREFIX + "_FsQdBiall.vcf"
output :
vcf = temp('Results/vcfFiltration/{prefix}_SNP_qualFiltered.vcf.gz'),
params:
mem= config['SelectVariantsForASE']['mem'],
jar=find_jar("GenomeAnalysisTK.jar")
vcf = "Results_ASE/vcfFiltration/" + PREFIX + "_FsQdBiall.vcf"
shell:
'''
java -Xmx{params.mem} -jar {params.jar} -T SelectVariants -R {input.ref} -V {input.vcf} --restrictAllelesTo BIALLELIC --excludeFiltered -o {output.vcf}
bcftools concat -o {output.vcf} -O v {input.chr_vcf} {input.other_vcf}
'''
rule callRateFilter:
input:
vcf = 'Results/vcfFiltration/{prefix}_SNP_qualFiltered.vcf.gz',
sample_config = config['sample_config']
output:
vcf = temp('Results/vcfFiltration/{prefix}_SNP_callRateFiltered.vcf'),
stat = temp('Results/vcfFiltration/{prefix}_SNP_callRateFiltered.stat')
params:
call_rate_th = config['call_rate_th']
message:
"\nrule callRateFilter:\n" \
+ "\tinput: {input}\n" \
+ "\toutput: {output}\n" \
+ "\tcall rate threshold param: {params}\n" \
+ "Executing script/call_rate_filter.py to keep sites where genotype call rate is >= to a the call rate threshold parameter per sample group. Warning all samples in the VCF must be defined in the sample config tsv file\n"
script:
"../script/call_rate_filter.py"
PREFIX = os.path.basename(config['vcf']).replace('.vcf.gz','')
rule vcfFilterSummary:
input:
VarForMasking = 'Results/vcfFiltration/' + PREFIX + '_SNP_exclude_nonVar.stat',
VariantFiltration = 'Results/vcfFiltration/' + PREFIX + '_SNP_tagged.stat',
callRateFilter = 'Results/vcfFiltration/' + PREFIX + '_SNP_callRateFiltered.stat'
output:
'Results/Summary/VariantFiltration_summary.txt'
def get_table(wildcards):
table_list = expand("Results_ASE/vcfFiltration/by_chrom/{chrom}_" + PREFIX + "_FsQdBiall_withGTpopCR50_DPgt05rPopCR20_"+wildcards.suffix + ".tsv", chrom=chrom_list )
return table_list
rule merge_filtered_table:
input :
chr_tsv = get_table,
other_tsv = "Results_ASE/vcfFiltration/by_chrom/other_" + PREFIX + "_FsQdBiall_withGTpopCR50_DPgt05rPopCR20_{suffix}.tsv"
output :
tsv = "Results_ASE/vcfFiltration/" + PREFIX + "_FsQdBiall_withGTpopCR50_DPgt05rPopCR20_{suffix}.tsv"
wildcard_constraints:
suffix="GT|AD|DP"
shell:
'''
echo "# Number of variants for the masking genome step" > {output}
cat {input.VarForMasking} >> {output}
echo -e "\n# Number of variants for the ASE count steps" >> {output}
cat {input.VariantFiltration} >> {output}
cat {input.callRateFilter} >> {output}
head -n 1 {input.other_tsv} > {output.tsv}
cat {input.chr_tsv} {input.other_tsv}| grep -v "#" >> {output.tsv}
'''
rule merge_filtered_summary:
input :
chr_tsv = expand("Results_ASE/vcfFiltration/by_chrom/{chrom}_" + PREFIX + "_VariantFiltration_summary.txt", chrom=chrom_list),
other_tsv = "Results_ASE/vcfFiltration/by_chrom/other_" + PREFIX + "_VariantFiltration_summary.txt"
output :
summary = "Results_ASE/Summary/VariantFiltration_summary.txt"
run :
dic_results = dict()
in_list = input.chr_tsv
in_list.append(input.other_tsv)
# in_list = ['Results_ASE/vcfFiltration/by_chrom/1_test_VariantFiltration_summary.txt', 'Results_ASE/vcfFiltration/by_chrom/25_test_VariantFiltration_summary.txt', 'Results_ASE/vcfFiltration/by_chrom/26_test_VariantFiltration_summary.txt', 'Results_ASE/vcfFiltration/by_chrom/other_test_VariantFiltration_summary.txt']
first=in_list.pop(0)
FH_first = open(first,'rt')
summary = ''
first_key = ''
percent_list = list()
for line in FH_first:
if ':' in line:
category = line.split(':')[0]
key = category.upper().strip().replace(' ','_')
# store first key concidered as the total of variant
if first_key == '':
first_key = key
# store key where we want to compute percentage
if '%)' in line:
percent_list.append(key)
value = int(line.split(':')[1].strip().split()[0])
summary += category + ': ###' + key + '###\n'
dic_results[key] = value
elif "# Input vcf file is" in line:
line = "# Input vcf file is " + "Results_ASE/vcfFiltration/" + PREFIX + "_FS_QD_clust_tagged.vcf\n"
summary += line
elif "those SNP are described in" in line:
line = "\t\t\tthose SNP are described in " + "Results_ASE/vcfFiltration/" + PREFIX + "_FsQdBiall_withGTpopCR50_DPgt05rPopCR20_GT.tsv, Results_ASE/vcfFiltration/" + PREFIX + "_FsQdBiall_withGTpopCR50_DPgt05rPopCR20_DP.tsv, Results_ASE/vcfFiltration/" + PREFIX + "_FsQdBiall_withGTpopCR50_DPgt05rPopCR20_AD.tsv\n"
summary += line
else:
summary += line
FH_first.close()
for f in in_list:
FH_in = open(f,'rt')
for line in FH_in:
if ':' in line:
category = line.strip().split(':')[0]
key = category.upper().strip().replace(' ','_')
value = int(line.strip().split(':')[1].strip().split()[0])
dic_results[key] += value
FH_in.close()
for k in dic_results:
if k in percent_list:
percent = round(dic_results[k]*100/dic_results[first_key],2)
summary = summary.replace('###' + k + '###' , str(dic_results[k]) + ' (' + str(percent) + '%)')
else:
summary = summary.replace('###' + k + '###' , str(dic_results[k]))
FH_out = open(output.summary,'wt')
FH_out.write(summary)
FH_out.close()
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import gzip
def is_snp(ref, alt):
for a in alt.split(','):
if len(a) != len(ref):
return False
return True
def is_gzip( file ):
"""
@return: [bool] True if the file is gziped.
@param file : [str] Path to processed file.
"""
is_gzip = None
FH_input = gzip.open( file )
try:
FH_input.readline()
is_gzip = True
except:
is_gzip = False
finally:
FH_input.close()
return is_gzip
def store_group(sample_config):
"""
@summary : for each sample store its group
@param sample_config : [str] Path to TAB detlimited sample's config file. header : idx name forward_read reverse_read sequencer read_length phred_scale group
@return : return dictionnary with key = sample name, value = group name
"""
FH = open(sample_config, 'rt')
group_dict = dict()
for line in FH:
if line.startswith("id"):
continue
sample_name = line.split("\t")[1]
group = line.strip().split("\t")[-1]
if not sample_name in group:
group_dict[sample_name] = group
else:
if group != group_dict[sample_name]:
raise Exception("Your sample, "+sample_name+ " is associated with different group")
FH.close()
return group_dict
def process(vcf_file, sample_config, call_rate, output_vcf, log):
# store sample's group and group counts
samples_group = store_group(sample_config)
group_count = dict()
for s in samples_group:
if not samples_group[s] in group_count:
group_count[samples_group[s]] = 0
group_count[samples_group[s]] += 1
# parse VCF
FH = None
if not is_gzip(vcf_file):
FH = open( vcf_file, 'rt' )
else:
FH = gzip.open( vcf_file, 'rt' )
FH_out = open(output_vcf,"w")
sample_names = list()
tot_snp = 0
valid_snp = 0
for line in FH:
if line.startswith("##") and not line.startswith("#CHROM"):
FH_out.write(line)
continue
# parse header and add call rate filter info
if line.startswith("#CHROM"):
FH_out.write('##INFO=<ID=Valid_CR_Group,Number=.,Type=String,Description="List of group that pass call-rate threshold of '+ str(call_rate)+'\">\n')
FH_out.write(line)
sample_names=line.strip().split("\t")[9:]
continue
record = line.strip().split()
tot_snp += 1
info = record[7]
group_nbCall = dict()
for idx, call in enumerate(record[9:]):
sample_name = sample_names[idx]
group = samples_group[sample_name]
gt = call.split(":")[0]
if gt != "./." :
if group not in group_nbCall:
group_nbCall[group] = 0
group_nbCall[group] += 1
if len(group_nbCall) == 0:
continue
ValidGroup = list()
for group in group_count:
if group in group_nbCall and group_nbCall[group] >= float(call_rate) * group_count[group]:
ValidGroup.append(group)
if len(ValidGroup) > 0:
valid_snp += 1
info += ";Valid_CR_Group=" + "-".join(ValidGroup)
record[7]=info
FH_out.write("\t".join(record)+"\n")
FH.close()
FH_out.close()
FH_log = open(log,"w")
FH_log.write("# Number of variant passing the call rate filter (threshold per group = " + call_rate + ") :\n")
FH_log.write(str(valid_snp) + " variants are kept among " + str(tot_snp) + "\n")
FH_log.close()
vcf_file = snakemake.input.vcf
sample_config = snakemake.input.sample_config
call_rate = snakemake.params.call_rate_th
output_vcf = snakemake.output.vcf
log = snakemake.output.stat
#~ vcf_file = "../Results/vcfFiltration/RpRm_SNP_qualFiltered.vcf.gz"
#~ sample_config = "../RpRm.tsv"
#~ call_rate = "0.95"
#~ output_vcf = "RpRm_SNP_callRateFiltered.vcf"
#~ log = "log"
# keep site where genotype call rate is >= to a threshol per pop.
# warning all samples in the VCF must be defined in the sample config tsv file
process(vcf_file, sample_config, call_rate, output_vcf, log)
#!/usr/bin/env python3
from Bio import SeqIO
from Bio.SeqIO.QualityIO import FastqGeneralIterator
import gzip
import os,sys
def is_gzip(file_in):
"""
@return: [bool] True if the file is gziped.
@param file : [str] Path to processed file.
"""
is_gzip = None
handle_input = gzip.open( file_in )
try:
handle_input.readline()
is_gzip = True
except:
is_gzip = False
finally:
handle_input.close()
return is_gzip
def checkQualityEncode ( file_in ):
"""
@summary : check the quality Phred scale. Inspire on https://github.com/brentp/bio-playground/blob/master/reads-utils/guess-encoding.py
@param input : [str] Path of fastq file
@return : either "Q33" or "Q64"
"""
RANGES = {
'Q33': (33, 74),
'Q64': (64, 105)
}
handle_in=None
if is_gzip(file_in):
handle_in = gzip.open(file_in,"rt")
else:
handle_in = open(file_in,"rt")
scale = ""
gmin, gmax = 99, 0
for title_line, seq_string, quality_string in FastqGeneralIterator(handle_in):
qual = [ ord(c) for c in quality_string ]
qmin = min(qual)
qmax = max(qual)
if qmin < gmin or qmax > gmax:
gmin, gmax = min(qmin, gmin), max(qmax, gmax)
valid_encodings = []
for encoding, (emin, emax) in RANGES.items():
if gmin >= emin and gmax <= emax:
valid_encodings.append(encoding)
if len(valid_encodings) == 1 :
scale = valid_encodings[0]
break
return scale
#~ f_in="/work/project/sigenae/Maria/Sandrine_SNP-ASE/1000_RNASeq/FLLL_livr_F0/tmp.fq.gz"
#~ f_out="tmp.gz"
f_in=snakemake.input.fastq_in
f_out=snakemake.output.fastq_out
encoding = checkQualityEncode(f_in)
if encoding == "Q33":
print(f_in+" is "+encoding+ " and DO NOT need to be converted\n", file=sys.stderr)
os.symlink(f_in, f_out)
else:
try:
print(f_in+" is "+encoding+ " and need to be converted\n", file=sys.stderr)
handle_in = gzip.open(f_in,"rt") if is_gzip(f_in) else open(f_in, "rt")
handle_out = gzip.open(f_out, "wt")
SeqIO.convert(handle_in, 'fastq-illumina' , handle_out, 'fastq-sanger')
except:
raise Exception("there is no valid encoding found")