Compare revisions

Kenza Bazi-Kabbaj · Kenza Bazi-Kabbaj · Kenza Bazi-Kabbaj · Kenza Bazi-Kabbaj · Kenza Bazi-Kabbaj · Maria Bernard
--- a/Nextflow/template_dev/README.md
+++ b/Nextflow/template_dev/README.md
-# Template workflow readme
-
-### Introduction
-What is the purpose of the pipeline ?
-eg: It pre-processes raw data from FastQ inputs (FastQC, Trim Galore!), aligns the reads (STAR or HiSAT2), generates gene counts (featureCounts, StringTie) and performs extensive quality-control on the results (RSeQC, dupRadar, Preseq, edgeR, MultiQC). See the output documentation for more details of the results.
-
-```mermaid
-graph TD
-A(Raw reads <br><i>Fastq</i>) -- raw data report --> B((fastQC))
-B --> C(report <br><i>HTML</i>)
-A -- triming --> D((cutadapt))
-D --> E(trimed reads <br><i>Fastq</i>)
-E -- genome alignment --> F((Star))
-F --> G(aligned reads <br><i>BAM</i>)
-```
-
-The pipeline is built using Nextflow, a bioinformatics workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It comes with docker / singularity containers making installation trivial and results highly reproducible.
-
-### Which tools ?
-List of software
-
- - fastQC version XXX (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/)
- - cutadapt version YYY (https://github.com/marcelm/cutadapt)
- - star version ZZZ (https://github.com/alexdobin/STAR)
-
-### Where to use it ?
-
-This pipeline have been installed on [Genotoul Cluster] (http://bioinfo.genotoul.fr/) so you can loggon genologin.toulouse.inra.fr and access in /usr/local/bioinfo/workflows/Nextflow/template_dev/main.nf
-
-### How to use it ?
-Ligne de commande de lancement pour genotoul ...
-
-
-### Who is maintaining the workflow ?
-Bios4Biol CATI person implied ...
-
-Be aware, that persons who are maintaining this workflow can be contact in case of bug but not in case of bioinformatics explaination need of your results.
-
-### Credits
-These scripts were copied from ....
-Many thanks to other who have helped out along the way too, including (but not limited to):
-[@loginTwitter](https://github.com/loginTwitter),...
-
---
--- a/Nextflow/template_dev/main.nf
+++ b/Nextflow/template_dev/main.nf
-params.outdir = "$baseDir/results/"
-params.readfiles = "$baseDir/../../data/reads/*{1,2}.fastq.gz"
-params.genome = "$baseDir/../../data/reference/reference.fa"
-params.annotation = "$baseDir/../../data/reference/reference.gtf"
-params.index = "$baseDir/../../index/"
-
-
-outdir = params.outdir
-readfiles = params.readfiles
-genome = file(params.genome)
-annotation = file(params.annotation)
-index = params.index
-
-Channel.fromPath(readfiles).set { reads }
-
-process fastqc {
-	cpus 1
-	publishDir outdir, mode: 'copy', overwrite: true
-	input:
-	file read from reads
-
-	output:
-	file '*html' into fastqc
-
-	"""
-	fastqc --noextract ${read}
-	"""
-}
-
-Channel.fromFilePairs(readfiles).set { pair_reads }
-
-process cutadapt {
-	cpus 1
-	input:
-	set pair_id, file(read) from pair_reads
-
-	output:
-	set pair_id, 'clean_*' into clean_files1, clean_files2
-
-	"""
-	cutadapt -a ACACTCTTTCCCTACACGACGCTCTTCCGATC -A GATCGGAAGAGCGGTTCAGCAGGAATGCCGAG -o clean_${pair_id}1.fastq.gz -p clean_${pair_id}2.fastq.gz ${read}
-	"""
-}
-
-process indexGenome {
-	storeDir index
-
-	input:
-	file ref from genome
-	file annot from annotation
-
-	output:
-	file "${dbname}.star/*" into startDb
-
-	script:
-	dbname = ref.baseName
-	"""
-	mkdir ${dbname}.star
-	STAR --runMode genomeGenerate --genomeDir ${dbname}.star --genomeFastaFiles ${ref} --sjdbGTFfile ${annot} --sjdbOverhang 99 --runThreadN ${task.cpus}
-	"""
-}
-
-
-process mapping {
-	publishDir outdir, mode: 'copy', overwrite: true
-	input:
-	set pair_id, file(read) from clean_files1
-	val starDB from startDb
-
-	output:
-	file "*bam" into aligned_sorted_files
-
-	script:
-	starDBpath = starDB[0].parent
-	"""
-	STAR --genomeDir $starDBpath --runThreadN ${task.cpus} --readFilesIn ${read} --outFileNamePrefix ${pair_id} --readFilesCommand zcat --outSAMtype BAM Unsorted --outFilterType BySJout
-	"""
-}
-
-process mapping2 {
-	publishDir outdir, mode: 'copy', overwrite: true
-	input:
-	set pair_id, file(read) from clean_files2
-	val starDB from startDb
-
-	output:
-	file "*sam" into aligned_files
-
-	script:
-	starDBpath = starDB[0].parent
-	"""
-	STAR --genomeDir $starDBpath --runThreadN ${task.cpus}  --readFilesIn ${read} --outFileNamePrefix ${pair_id} --readFilesCommand zcat --outFilterType BySJout
-	"""
-}
\ No newline at end of file
--- a/Nextflow/template_dev/nextflow.config
+++ b/Nextflow/template_dev/nextflow.config
--- a/Nextflow/template_dev/test.sh
+++ b/Nextflow/template_dev/test.sh
--- a/Snakemake/1000GB/scripts/fastq_detect.pl
+++ b/Snakemake/1000GB/scripts/fastq_detect.pl
-#!/usr/bin/perl -w
- 
-# http://wiki.bits.vib.be/index.php/Identify_the_Phred_scale_of_quality_scores_used_in_fastQ
- 
-use strict;
-use File::Basename;
-use List::MoreUtils qw( minmax );
- 
-# fastq_detect.pl fastq.file sample-size
-# detect fastQ format from quality scores in fastQ input file
-# Version 3
-#
-# Stephane Plaisance - VIB-BITS - July-04-2012 
-# Joachim Jacob - Aug-02-2012 - joachim.jacob@gmail.com
-# - changed the maximum value of Sanger to 73
-# - changed reading the file with a file handle 
-#   (was a file handle !! supporting several archive formats. SP)
-# - changed the diagnosing algoritm
-# Stephane Plaisance - VIB-BITS - April-08-2013 
-# - merged both versions and corrected flaw in min/max
-# thanks to Sergey Mitrfanov for perl reformatting
- 
-#####################################################################
-# diagnose
-#   SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS.....................................................
-#   ..........................XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX......................
-#   ...............................IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII......................
-#   .................................JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ......................
-#   LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLL....................................................
-#   !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
-#   |                         |    |        |                              |                     |
-#  33                        59   64       73                            104                   126
-# S 0........................26...31.......40                               
-# X                          -5....0........9.............................40
-# I                                0........9.............................40
-# J                                   3.....9.............................40
-# L 0.2......................26...31........41                              
-# 
-#  S - Sanger        Phred+33,  raw reads typically (0, 40)
-#  X - Solexa        Solexa+64, raw reads typically (-5, 40)
-#  I - Illumina 1.3+ Phred+64,  raw reads typically (0, 40)
-#  J - Illumina 1.5+ Phred+64,  raw reads typically (3, 40) with 0=unused, 1=unused, 2=Read Segment Quality Control Indicator (bold)
-#  L - Illumina 1.8+ Phred+33,  raw reads typically (0, 41)
-#    - Illumina 1.9+ Phred+33,  raw reads typically (3, 41)
-#####################################################################
- 
-my $script = basename($0);
- 
-@ARGV gt 0 or die "usage: $script <fastq file> <opt:sample-size (100)>\n";
-my ($inputfile, $limit) = @ARGV;
-if (! defined $limit) { $limit = 100}; # check first 100 records
- 
-my $cnt=0;
-my ($min, $max); # global min and max values
- 
-my $z = ReadFile ($inputfile) || die "Error: cannot read from variant file $inputfile: $!\n";
- 
-## parse
-while (my $id = <$z>) {
-	$id =~ m/^@/ || die "expected @ not found in line 1!\n";
-	my $seq = <$z>;
-	my $sep = <$z>;
-	$sep =~ m/^\+/ || die "expected + not found in line 3!\n";
-	my $qual = <$z>;
-	chomp($qual);
-	$cnt++;
-	$cnt>=$limit && last;
- 
-	# char to ascii
-	my @chars = split("", $qual);
-	my @nums = sort { $a <=> $b } (map { unpack("C*", $_ )} @chars);
- 
-	if ($cnt==1) {
-		($min, $max) = minmax @nums;
-	} else {
-		my ($lmin, $lmax) = minmax @nums; # local values for this read
-		$lmin<$min ? $min=$lmin : $min=$min;
-		$lmax>$max ? $max=$lmax : $max=$max;
-	}
-}
- 
-undef $z;
- 
-## diagnose
-my %diag=(
-			'Sanger'		=> '.',
-			'Solexa'		=> '.',
-			'Illumina 1.3+'	=> '.',
-			'Illumina 1.5+'	=> '.',
-			'Illumina 1.8+'	=> '.',
-			'Illumina 1.9+'	=> '.'
-			);
- 
-my %comment=(
-			'Sanger'		=> 'Phred+33',
-			'Solexa'		=> 'Solexa+64',
-			'Illumina 1.3+'	=> 'Phred+64',
-			'Illumina 1.5+'	=> 'Phred+64',
-			'Illumina 1.8+'	=> 'Phred+33',
-			'Illumina 1.9+'	=> 'Phred+33'
-			);
-
-if ($min<33 || $max>105) { die "Quality values corrupt. found [$min; $max] where [33; 104] was expected\n"; }
-if ($min>=33 && $max<=73)  {$diag{'Sanger'}='x';}
-if ($min>=59 && $max<=104) {$diag{'Solexa'}='x';}
-if ($min>=64 && $max<=104) {$diag{'Illumina 1.3+'}='x';}
-if ($min>=66 && $max<=105) {$diag{'Illumina 1.5+'}='x';}
-if ($min>=33 && $max<=74)  {$diag{'Illumina 1.8+'}='x';}
-if ($min>=33 && $max<=75)  {$diag{'Illumina 1.9+'}='x';}
- 
-## report
-foreach my $format (sort keys %diag) {
-	if ($diag{$format} eq 'x') {
-		print STDOUT "# sampled raw quality values are in the range of [".$min."; ".$max."]\n";
-		print STDOUT sprintf("%-30s\t$inputfile\n", $comment{$format});}
-}
- 
- 
-##############
-#### Subs ####
- 
-# reads from uncompressed, gzipped and bgzip fastQ files
-sub ReadFile {
-	my $infile = shift;
-	my $FH;
-	if ($infile =~ /.bz2$/) {
-		open ($FH, "bzcat $infile |") or die ("$!: can't open file $infile");
-	} elsif ($infile =~ /.gz$/) {
-		open ($FH, "zcat $infile |") or die ("$!: can't open file $infile");
-	} elsif ($infile =~ /.fq|.fastq|.txt$/) {
-		open ($FH, "cat $infile |") or die ("$!: can't open file $infile");
-	} else {
-		die ("$!: do not recognise file type $infile");
-	}
-	return $FH;
-}
-
--- a/Snakemake/1000RNASeq_chicken/ASE/README.md
+++ b/Snakemake/1000RNASeq_chicken/ASE/README.md
+# Allel Specific Expression Analysis workflow
+
+authors__: Maria Bernard (INRAE - SIGENAE) / Kevin Muret - Frederic Jehl - Sandrine Lagarrigue ( AGROCAMPUS OUEST / INRAE - PEGASE)
+
+**<u>!!! STATUT : DEVELOPMENT !!!</u>**
+
+## What it does ? 
+
+This pipeline will perform variant filtering, reads alignment on masked genome and allel specific quantification expression.
+
+ The pipeline is represented in the graph below:
+
+![](img/full_dryrun.png)
+
+
+
+## 1. Input
+
+The user must use the [config_calling.yaml](config_calling.yaml.example) file to provide all necessary inputs for the pipeline:
+
+- A masked reference fasta file (fasta_ref option), indexed with samtools faidx and GATK4 CreateSequenceDictionary. We recommend to mask genome with variant filtered bi-allelic and on GATK metrics FS < 30, QD > 2 and AF < 1. In our used case we mask genome on variant construct at the population level, and then we launch this pipeline at the tissue level.
+
+- An genome annotation reference GTF file (gtf_ref option)
+
+- A set of variant to include in the ASE analysis
+
+- A sample reads description TSV file (sample_config option), with the following columns:
+
+  - idx : a unique index number to identify each fastq (paire) file
+  - name : the sample name
+  - forward_read and reverse_read are fastq file names. These files need to be in a directory precised with the data_dir option
+    - If single end, leave an empty column in the reverse_read
+    - If paired, file names need to ends with _R1.fastq.gz or _R2.fastq.gz (or fq instead of fastq, and not necessarly compressed)
+  - sequencer : sequencer used to generate the sequences : "ILLUMINA", "SLX", "SOLEXA", "SOLID", "454", "LS454", "COMPLETE", "PACBIO", "IONTORRENT", "CAPILLARY", "HELICOS", "UNKNOWN"
+  - oriented : to indicate if sequences are generated with strand-specific protocol. It corresponds to the forward_prob RSEM parameter. Indicate : 
+    - 1 for a strand-specific protocol where all (upstream) reads are derived from the forward strand,
+    - 0 for a strand-specific protocol where all (upstream) read are derived from the reverse strand,
+    - 0.5 for a non-strand-specific protocol. 
+  - phred scale: to indicate the phred score scale use to code base quality: either 33 (Sanger and illumina 1.8+) or 64 (Solexa, illumina 1.3 to 1.8 excluded)
+
+
+
+
+## 2. Data Processing
+
+### Raw reads trimming
+
+First step is to filter and trim raw reads before alignment. This step will convert phred 64 scale fastq files into phred 33 scale fastq files and then trim reads with [trim_galore](https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/). It will trim adapter and filtered out trimmed reads that are smaller than the third of the read length.
+
+By default the trimming quality threshold is set to 15.
+
+You will find in Result/Summary directory, a Log_TrimGalore_summary.tsv that detail the trim_galor statistics.
+
+
+
+### Sequence alignment
+
+Trimmed reads will be align against the reference genome thanks to the [STAR](https://academic.oup.com/bioinformatics/article/29/1/15/272537) following the multi-sample 2-pass mapping procedure, with the GTF  reference file.
+
+The second alignment pass will take into account new junctions detected from the first pass. Among all samples new canonical junction covered by at least 2 reads in one sample will be kept.
+
+STAR 2nd pass alignment will generate bam files in genome coordinates.
+
+You will find in the Results/Summary directory , 2 files reporting alignment statistics : Log_STAR_Aln_1_summary.tsv and Log_STAR_Aln_2_summary.tsv
+
+
+
+### Genomic Alignment post processing
+
+Alignment will be filtered to keep only:
+
+- properly paired alignment in case of paired libraries
+- non duplicated alignment ( PCR or OPTICAL duplicates are removed)
+- uniquely mapped reads (corresponding to STAR quality score of 255)
+
+Finally reads that contain Ns in their cigar string will be splittd, and alignment score of 255 will be convert to 60.
+
+### Variant Filtration
+
+Variants will be filtered before submitting to ASE.
+
+Only bi-allelic SNP with FS < or equal to 30.0 and QD < or equal to 2.0 will be used in ASE.
+
+Filtering on SNP type is an option, see SNP_filter in the configuration file. By default it is set to False as the calling pipeline generate output VCF file filtered on SNP type.
+
+The pipeline will also add annotation in the INFO field in the final VCF file (see: Results/vcfFiltration/*_FsQdBiall.vcf.gz):
+
+- LOCALISATION will be set to :
+  - *jnct5nt* if the variants is closed ( +/- 5bp) to an exon/intron junction
+  - *homodimer5* if the variants is included in a 5bp homopolymer
+  - *cl3snp35nt* if the variants is in a 35 bp windows that included at least 3 others variants
+- GTpopCR will indicate the percentage of known genotype
+- DPgt05rPopCR will indicate the percentage of known genotype covered by at least 5 reads
+
+Additionally, the pipeline will generate 3 tables that extract samples GT, DP and AD_REF and AD_ALT genotypes annotations. These tables will include only SNP that have a genotype call rate greater than 50% and a genotype with at least 5 reads coverage call rate greater than 20%.
+
+### Compute Allele Specific Expression
+
+Alleles Specific Expression will be obtained using phASER (https://github.com/secastel/phaser) and ASE Read Counter (https://gatk.broadinstitute.org/hc/en-us/articles/360036714351-ASEReadCounter), on bi-allelic variants filtered on FS < 30, and QD > 2.
+
+ASEReadCounter and phASER returnby sample, allele expression on each SNP.
+
+phASER will also return gene allele specific espression by constructing gene haplotypes by sample, and finally an expression matrix which agglomerate all the samples. 
+
+## 3. Dependencies
+
+The workflow depends on:
+
+* cutadapt (version 2.1)
+* trimgalore (version 0.6.5)
+* STAR (version 2.6.0c)
+* samtools (version 1.9 )
+* bcftools (version 1.9)
+* bedtools( version 2.29.0)
+* tabix (version 0.2.5)
+* GenomeAnalysisTK.jar (version 3.7) which include Picard tools and ASE Read Counter
+* java (version 8)
+* phASER (cloned the 29th of May 2020)
+
+Theses softwares need to be available in you PATH or in a bin directory precised in the config.yaml file with the bin_dir option.
+
+
+
+##  4. Configure and running workflow
+
+- Copy the config_calling.yaml.example file into your working directory and update it according to the instructions inside the file.
+
+- launch snakemake command as presented in SLURM.sh to launch the workflow. This script explain how to launch a dryrun, a full run, perform quantification expression only or variant calling only
+
+- If necessary update resources_SLURM.yaml to define cpu memory and cluster partition resources for each rule. The default section is the minimal resources per job. The other section take as name the rules name and define only resources that are different from the default section.
+
+
+Resources_SLURM.yaml and SLURM.sh are specific to SLURM cluster. Therefore, if you use another cluster, you need to create other files. However, you can use these files to inspire you.
+    
+
+This is the official documentation of Snakemake if you need more information: <https://snakemake.readthedocs.io/en/stable/tutorial/tutorial.html>
+
--- a/Snakemake/1000RNASeq_chicken/ASE/Snakefile
+++ b/Snakemake/1000RNASeq_chicken/ASE/Snakefile
@@ -7,7 +7,6 @@ import pandas as pd

 ##########################################################
 ### system options
-# Recover software memory and cpus resources

 # Recover software memory and cpus resources
 with open(config["resources"]) as yml:
@@ -18,6 +17,8 @@ if not "__default__" in config or not "cpu" in config["__default__"] or not "mem
 if int(config["__default__"]["mem"].replace("G","")) < 1 :
    raise Exception("you need at least 5G of memory per job to run this pipeline")

+#~ print(config)
+
 def check_param(rule,resource):
    if not rule in config:
        config[rule]={resource:config["__default__"][resource]}
@@ -26,9 +27,9 @@ def check_param(rule,resource):
        
 rule_resources = {
    'indexFasta' : ['mem'],
-    'VarForMasking' : ['mem'],
-    'VariantFiltration' : ['mem'],
-    'SelectVariantsForASE' : ['mem'],
+    'FS_QD_clust' : ['mem'],
+    'vcf_filtration' : ['cpu'],
+    'SelectVariants' : ['mem'],
    'STAR_Index_1' : ['cpu'],
    'STAR_Aln_1' : ['cpu'],
    'STAR_Aln_2' : ['cpu'],
@@ -47,6 +48,9 @@ for rule in rule_resources:
 # add bin directory in PATH
 if "bin_dir" in config:
    os.environ['PATH'] = config["bin_dir"] + os.pathsep + os.environ['PATH']
+# add script dir to PATH
+SCRIPT_DIR= os.path.abspath(os.path.join(workflow.basedir, "script"))
+os.environ['PATH'] = SCRIPT_DIR + os.pathsep + os.environ['PATH'] 

 ##########################################################
 ### workflows options
@@ -54,24 +58,32 @@ if "bin_dir" in config:
 # check the reference genome files
 if not "fasta_ref" in config or not os.path.exists(config["fasta_ref"]):
    raise Exception("You need to provide an existing fasta reference file, using fasta_ref in your config.yaml file")
+elif not os.path.exists(config["fasta_ref"] + ".fai") or not os.path.exists(os.path.splitext(config["fasta_ref"])[0] + ".dict"):
+    raise Exception("Your fasta reference file need to be indexed with samtools faidx and GATK CreateSequenceDictionnary ")

 if not "gtf_ref" in config or not os.path.exists(config["gtf_ref"]):
    raise Exception("You need to provide an existing gtf reference file, using gtf_ref in your config.yaml file")    

 # check about VCF files
 if not 'vcf' in config or not os.path.exists(config['vcf']):
-	raise Exception('Your need to provide an existing VCF files, using the vcf parameter in the config.yaml file')
+        raise Exception('Your need to provide an existing VCF files, using the vcf parameter in the config.yaml file')

 if not os.path.exists(config['vcf'] + '.tbi'):
-	raise Exception('Your input VCF file : ' + config['vcf'] + ', is not indexed by tabix.')
+        raise Exception('Your input VCF file : ' + config['vcf'] + ', is not indexed by tabix.')

 if not config['vcf'].endswith('.gz'):
-	raise Exception('Your input VCF file : ' + config['vcf'] + ', is not compressed with bgzip')
+        raise Exception('Your input VCF file : ' + config['vcf'] + ', is not compressed with bgzip')

 # check sample config TSV file
 if not "sample_config" in config or not os.path.exists(config["sample_config"]):
    raise Exception("You need to provide an existing sample config tabular file, using sample_config in your config.yaml file")

+if not 'SNP_filter' in config:
+    config['SNP_filter'] = False
+elif config['SNP_filter'] != False and config['SNP_filter'] != True:
+    raise Exception('SNP_filter must be set to True or False')
+
+# load sample_config
 table = pd.read_csv(config["sample_config"], sep='\t', dtype=str)

 # check sequencer type
@@ -87,7 +99,7 @@ if table.shape[0] != len(table["forward_read"].unique()):

 #################################################################################################
 ## FUNCTIONS
-   
+
 def find_jar(jar):
    path=""
    for bin_dir in os.environ['PATH'].split(os.pathsep):
@@ -109,7 +121,6 @@ def gatk_multi_arg(flag, files):
 # genome processing
 include : "rules/basics.smk"
 include : "rules/variantFiltration.smk"
-include : "rules/genomeMasking.smk"
 # reads preprocessing and alignment
 include : "rules/TrimGalore.smk"
 include : "rules/alignmentProcess.smk"
@@ -119,42 +130,56 @@ include : "rules/ASE_counter.smk"
 localrules: all

 final_outputs=list()
-# reference masked
-final_outputs.append('Results/genomeMasked/' + os.path.splitext(os.path.basename(config['fasta_ref']))[0] + '_masked.fa')
-final_outputs.append('Results/genomeMasked/' + os.path.splitext(os.path.basename(config['fasta_ref']))[0] + '_masked.fa.fai')
-final_outputs.append('Results/genomeMasked/' + os.path.splitext(os.path.basename(config['fasta_ref']))[0] + '_masked.dict')

 # STAR alignment
 for sample in table["sample_name"].unique():
 	sample_table = table[table["sample_name"] == sample]
 	# paired end, filter on properly paired
 	if len(sample_table["forward_read"]) == len(sample_table["reverse_read"].dropna()) :
-	    final_outputs.append("Results/STAR_Aln_2/" + sample + "_rg_genomic_pp_rmdup_uniq_split.bam")
+	    final_outputs.append("Results_ASE/STAR_Aln_2/" + sample + "_rg_genomic_pp_rmdup_uniq_split.bam")
 	# single end, no filter on properly paired
 	elif len(sample_table["reverse_read"].dropna() ) == 0:
-	    final_outputs.append("Results/STAR_Aln_2/" + sample + "_rg_genomic_rmdup_uniq_split.bam")
+	    final_outputs.append("Results_ASE/STAR_Aln_2/" + sample + "_rg_genomic_rmdup_uniq_split.bam")

-# Summary
-final_outputs.append("Results/Summary/Log_TrimGalore_summary.tsv")
-final_outputs.append("Results/Summary/Log_STAR_Aln_1_summary.tsv")
-final_outputs.append("Results/Summary/Log_STAR_Aln_2_summary.tsv")
-final_outputs.append("Results/Summary/VariantFiltration_summary.txt")
+# # Summary
+final_outputs.append("Results_ASE/Summary/Log_TrimGalore_summary.tsv")
+final_outputs.append("Results_ASE/Summary/Log_STAR_Aln_1_summary.tsv")
+final_outputs.append("Results_ASE/Summary/Log_STAR_Aln_2_summary.tsv")

 # variant Filtration
-final_outputs.append('Results/vcfFiltration/' + os.path.basename(config['vcf']).replace('.vcf.gz','_SNP_callRateFiltered.vcf.gz'))
+final_outputs.append('Results_ASE/vcfFiltration/' + os.path.basename(config['vcf']).replace('.vcf.gz','_FsQdBiall.vcf.gz'))
+final_outputs.append('Results_ASE/vcfFiltration/' + os.path.basename(config['vcf']).replace('.vcf.gz','_FsQdBiall_withGTpopCR50_DPgt05rPopCR20_GT.tsv'))
+final_outputs.append('Results_ASE/vcfFiltration/' + os.path.basename(config['vcf']).replace('.vcf.gz','_FsQdBiall_withGTpopCR50_DPgt05rPopCR20_DP.tsv'))
+final_outputs.append('Results_ASE/vcfFiltration/' + os.path.basename(config['vcf']).replace('.vcf.gz','_FsQdBiall_withGTpopCR50_DPgt05rPopCR20_AD.tsv'))
+final_outputs.append('Results_ASE/Summary/VariantFiltration_summary.txt')

 # counter
 for sample in table["sample_name"].unique():
-	final_outputs.append('Results/Counter/' + sample + '_ASEReadCounter.csv')
-	final_outputs.append('Results/Counter/' + sample + '_phASER.allelic_counts.txt')
+	final_outputs.append('Results_ASE/ASEReadCounter/' + sample + '_ASEReadCounter.csv')
+	final_outputs.append('Results_ASE/phASER/' + sample + '_phASER.allelic_counts.txt')
+	final_outputs.append('Results_ASE/phASER_gene_ae/' + sample + '_phASER.gene_ae.txt')
+
+final_outputs.append('Results_ASE/phASER_pop/' + os.path.basename(config['vcf']).replace('.vcf.gz','_phASER.bed.gz'))
+final_outputs.append('Results_ASE/phASER_pop/' + os.path.basename(config['vcf']).replace('.vcf.gz','_phASER.gw_phased.bed.gz'))

 # print(final_outputs)
-# ['Results/genomeMasked/reference_masked.fa', 
-# 'Results/genomeMasked/reference_masked.fa.fai', 
-# 'Results/genomeMasked/reference_masked.dict', 
-# 'Results/STAR_Aln_2/samplePE_rg_genomic_pp_rmdup_uniq_split.bam', 
-# 'Results/Counter/samplePE_ASEReadCounter.csv', 
-# 'Results/Counter/samplePE_phASER.allelic_counts.txt']
+# ['Results_ASE/STAR_Aln_2/samplePE_rg_genomic_pp_rmdup_uniq_split.bam', 
+# 'Results_ASE/STAR_Aln_2/sampleSE_rg_genomic_rmdup_uniq_split.bam', 
+# 'Results_ASE/Summary/Log_TrimGalore_summary.tsv',
+# 'Results_ASE/Summary/Log_STAR_Aln_1_summary.tsv',
+# 'Results_ASE/Summary/Log_STAR_Aln_2_summary.tsv',
+# 'Results_ASE/vcfFiltration/variants_FsQdBiall.vcf.gz',
+# 'Results_ASE/vcfFiltration/variants_FsQdBiall_withGTpopCR50_DPgt05rPopCR20_GT.tsv', 
+# 'Results_ASE/vcfFiltration/variants_FsQdBiall_withGTpopCR50_DPgt05rPopCR20_DP.tsv',
+# 'Results_ASE/vcfFiltration/variants_FsQdBiall_withGTpopCR50_DPgt05rPopCR20_AD.tsv', 
+# 'Results_ASE/Summary/VariantFiltration_summary.txt', 
+# 'Results_ASE/ASEReadCounter/samplePE_ASEReadCounter.csv', 
+# 'Results_ASE/phASER/samplePE_phASER.allelic_counts.txt', 
+# 'Results_ASE/phASER_gene_ae/samplePE_phASER.gene_ae.txt', 
+# 'Results_ASE/ASEReadCounter/sampleSE_ASEReadCounter.csv',
+# 'Results_ASE/phASER/sampleSE_phASER.allelic_counts.txt',
+# 'Results_ASE/phASER_gene_ae/sampleSE_phASER.gene_ae.txt']
+

 rule all:
    input:

--- a/Snakemake/1000RNASeq_chicken/ASE/config_ase.yaml.example
+++ b/Snakemake/1000RNASeq_chicken/ASE/config_ase.yaml.example
 # bin_dir is a directory containing software binary used in this pipeline:
-#   trimgalore (version 0.4.5 with cutadapt (version 1.14))
+#   trimgalore (version 0.6.5 with cutadapt (version 2.1))
 #   tabix and bgzip (version 0.2.5)
 #	STAR (version 2.6.0c )
-#   GenomeAnalysisTK.jar (version 3.8.1)
-#   bedtools (version 2.27.1)
-#   picard.jar (version 2.1.1)
-#   samtools (version 1.8)
+#   GenomeAnalysisTK.jar (version 4.1.2.0)
+#   samtools (version 1.9)
+#   bedtools (version 2.29.0)
 #   md5sum
-#   phASER (version downloaded 07-06-2019 ) with :
+#   phASER (version downloaded 23-03-2020 ) with :
 #       python2.7 and associated Scipy and Numpy library
 #       bgzip (version 0.2.5, cf tabix)
-#       bcftools (version 1.8)
+#       bcftools (version 1.9)

 # if not bin_dir, the binary need to be available in the PATH
 bin_dir : bin

-# fasta_ref file is the genome reference Fasta file (need to be indexed with samtools faidx and picard CreateSequenceDictionary )
+# fasta_ref file is the masked genome reference Fasta file (need to be indexed with samtools faidx and GATK4 CreateSequenceDictionary )
 fasta_ref : data/reference.fa
+
 # gtf_ref file is the genome reference GTF file
 gtf_ref : data/reference.gtf
-# VCF input files to analyse. The VCF files need to be indexed by tabix
+
+# VCF input files to analyse (including SNP only). The VCF files need to be indexed by tabix
 vcf : data/variants.vcf.gz

-# sample_config file is a tabular file describing each input fastq file(s)
+# sample_config file is a tabular file describing each input fastq file(s) corresponding to samples included in the previous VCF file.
 # header columns are : 
 #   idx  name  forward_read  reverse_read  sequencer read_length phred_scale   group
 #
@@ -32,16 +33,15 @@ vcf : data/variants.vcf.gz
 #       - If paired, file names need to ends with _R1.fastq.gz or _R2.fastq.gz (or fq instead of fastq, and not necessarly compressed)
 # sequencer need to be choosen in the list : ["ILLUMINA","SLX","SOLEXA","SOLID","454","LS454","COMPLETE","PACBIO","IONTORRENT","CAPILLARY","HELICOS","UNKNOWN"]
 # phred scale indicate the phred score scale use to code base quality: either 33 (Sanger and illumina 1.8+) or 64 (Solexa, illumina 1.3 to 1.8 excluded)
-# group is used to defined sub population among all samples (in vcf input file). It will be used to filter variant that are not present in at least one group with a call rate > call_rate_th
-
 sample_config : data/population.tsv.example
+
 # data_dir contains fastq files described in sample_config file
 data_dir : data

-# minimum calling depth
-depth : 10
-# minimum percentage of sample with no missing genotype per group
-call_rate_th : 0.95
+# Filter on variant type if not already done (True or False)
+SNP_filter : True
+# quality trimming threshold used in trimgalore to remove low quality bases.
+trimming_quality : 15
 # minimum base quality
 baseQuality : 20
 # minimum mapping quality

--- a/Snakemake/1000RNASeq_chicken/ASE/data/reference.dict
+++ b/Snakemake/1000RNASeq_chicken/ASE/data/reference.dict
+@HD	VN:1.6
+@SQ	SN:1	LN:196202544	M5:971cb1c7a7f62a402dab61cfe84a93b1	UR:file:/home/maria/workspace/workflows_develop/Snakemake/1000RNASeq_chicken/ASE2/data/reference.fa
--- a/Snakemake/1000RNASeq_chicken/ASE/data/reference.fa.fai
+++ b/Snakemake/1000RNASeq_chicken/ASE/data/reference.fa.fai
+1	196202544	3	60	61
--- a/Snakemake/1000RNASeq_chicken/ASE/img/full_dryrun.png
+++ b/Snakemake/1000RNASeq_chicken/ASE/img/full_dryrun.png
--- a/Snakemake/1000RNASeq_chicken/ASE/resources_SLURM.yaml
+++ b/Snakemake/1000RNASeq_chicken/ASE/resources_SLURM.yaml
 __default__:
-  mem : "8G"
-  cpu : 1
-  time : "4:00:00"
-  partition : workq
+    mem : "8G"
+    cpu : 1
+    time : "4:00:00"
+    partition : workq

-VarForMasking:
-    mem : "10G"
+trim_pe :
+    time : "24:00:00"
+
+trim_se :
+    time : "24:00:00"

-maskFasta:
+FS_QD_clust:
+    mem : "20G"
+    
+SelectVariants:
    mem : "20G"

-STAR_Index_1 :
-    mem : "18G"
+vcf_filtration:
+    cpu : 10
+    mem : "16G"
+    time : "96:00:00"
+    partition : workq
+
+STAR_Index_1:
+    mem : "50G"
    cpu : 8
    time : "24:00:00"
    
-STAR_Aln_1 :
-    mem : "12G"
+STAR_Aln_1:
+    mem : "30G"
    cpu : 8
    time : "24:00:00"
    
-STAR_Aln_2 :
-    mem : "12G"
+STAR_Aln_2:
+    mem : "30G"
    cpu : 8
    time : "24:00:00"

-AddOrReplaceReadGroups :
+AddOrReplaceReadGroups:
    mem: "12G"

-RmDuplicates :
+RmDuplicates:
    mem: "20G"
    
-SplitNCigarReads :
+SplitNCigarReads:
    mem : "30G"
-
-VariantFiltration:
-    mem : "20G"
-
-SelectVariantsForASE:
-    mem : "20G"
-
+    time : "8:00:00"
+    
 phASER:
-    mem : "20G"
+    cpu : 8
+    mem : "180G"
+    time : "8:00:00"
+
+phASER_expr_matrix:
+    cpu : 8

 ASEReadCounter:
    mem : "20G"
--- a/Snakemake/1000RNASeq_chicken/ASE/rules/ASE_counter.smk
+++ b/Snakemake/1000RNASeq_chicken/ASE/rules/ASE_counter.smk
@@ -13,10 +13,10 @@ def get_splitNCigar_bam(wildcards):
    
    # paired end, filter on properly paired and rmdup
    if len(sample_table["forward_read"]) == len(sample_table["reverse_read"].dropna()) :
-        return "Results/STAR_Aln_2/" + wildcards.sample + "_rg_genomic_pp_rmdup_uniq_split.bam"
+        return "Results_ASE/STAR_Aln_2/" + wildcards.sample + "_rg_genomic_pp_rmdup_uniq_split.bam"
    # single end, no filter
    elif len(sample_table["reverse_read"].dropna() ) == 0:
-        return "Results/STAR_Aln_2/" + wildcards.sample + "_rg_genomic_rmdup_uniq_split.bam"
+        return "Results_ASE/STAR_Aln_2/" + wildcards.sample + "_rg_genomic_rmdup_uniq_split.bam"
    else :
    	raise Exception(wildcards.sample + " is sequenced in pair end & single end mode\n 	This workflow do not accept this case!\n")

@@ -26,40 +26,43 @@ def get_splitNCigar_bai(wildcards):
    
    # paired end, filter on properly paired and rmdup
    if len(sample_table["forward_read"]) == len(sample_table["reverse_read"].dropna()) :
-        return "Results/STAR_Aln_2/" + wildcards.sample + "_rg_genomic_pp_rmdup_uniq_split.bam.bai"
+        return "Results_ASE/STAR_Aln_2/" + wildcards.sample + "_rg_genomic_pp_rmdup_uniq_split.bam.bai"
    # single end, no filter
    elif len(sample_table["reverse_read"].dropna() ) == 0:
-        return "Results/STAR_Aln_2/" + wildcards.sample + "_rg_genomic_rmdup_uniq_split.bam.bai"
+        return "Results_ASE/STAR_Aln_2/" + wildcards.sample + "_rg_genomic_rmdup_uniq_split.bam.bai"
    else :
    	raise Exception(wildcards.sample + " is sequenced in pair end & single end mode\n 	This workflow do not accept this case!\n")

+	# GATK3 :
+	# java -jar GenomeAnalysisTK.jar -T ASEReadCounter -R {input.ref} -o {output} -I {input.bam} -sites {input.vcf}  -U ALLOW_N_CIGAR_READS --minMappingQuality {params.mappingQuality} --minBaseQuality {params.baseQuality}
+	# GATK4 : ALLOW_N_CIGAR_READS seems to be useless now
 rule ASEReadCounter:
 	input:
 		ref = config["fasta_ref"],
-		vcf = 'Results/vcfFiltration/' + os.path.basename(config['vcf']).replace('.vcf.gz','_SNP_callRateFiltered.vcf.gz'),
-		idx = 'Results/vcfFiltration/' + os.path.basename(config['vcf']).replace('.vcf.gz','_SNP_callRateFiltered.vcf.gz.tbi'),
+		vcf = 'Results_ASE/vcfFiltration/' + os.path.basename(config['vcf']).replace('.vcf.gz','_FsQdBiall.vcf.gz'),
+		idx = 'Results_ASE/vcfFiltration/' + os.path.basename(config['vcf']).replace('.vcf.gz','_FsQdBiall.vcf.gz.tbi'),
 		bam = get_splitNCigar_bam,
 		bai = get_splitNCigar_bai
 	output:
-		cvs = 'Results/Counter/{sample}_ASEReadCounter.csv'
+		cvs = 'Results_ASE/ASEReadCounter/{sample}_ASEReadCounter.csv'
 	params:
 		mem  = config["ASEReadCounter"]["mem"],
-		jar = find_jar("GenomeAnalysisTK.jar"),
 		mappingQuality = config['mappingQuality'],
   		baseQuality = config['baseQuality']
 	shell:
 		"""
-		java -Xmx{params.mem} -jar {params.jar} -T ASEReadCounter -R {input.ref} -o {output} -I {input.bam} -sites {input.vcf}  -U ALLOW_N_CIGAR_READS --minMappingQuality {params.mappingQuality} --minBaseQuality {params.baseQuality}
+		gatk --java-options "-Xmx{params.mem}" ASEReadCounter -R {input.ref} -O {output} -I {input.bam} --tmp-dir Results_ASE/ASEReadCounter --variant {input.vcf}   --min-mapping-quality {params.mappingQuality} --min-base-quality {params.baseQuality}
 		"""

 rule phASER:
 	input:
-		vcf = 'Results/vcfFiltration/' + os.path.basename(config['vcf']).replace('.vcf.gz','_SNP_callRateFiltered.vcf.gz') ,
-		idx = 'Results/vcfFiltration/' + os.path.basename(config['vcf']).replace('.vcf.gz','_SNP_callRateFiltered.vcf.gz.tbi') ,
+		vcf = 'Results_ASE/vcfFiltration/' + os.path.basename(config['vcf']).replace('.vcf.gz','_FsQdBiall.vcf.gz'),
+		idx = 'Results_ASE/vcfFiltration/' + os.path.basename(config['vcf']).replace('.vcf.gz','_FsQdBiall.vcf.gz.tbi'),
 		bam = get_splitNCigar_bam,
 		bai = get_splitNCigar_bai
 	output:
-		csv = 'Results/Counter/{sample}_phASER.allelic_counts.txt'
+		csv = 'Results_ASE/phASER/{sample}_phASER.allelic_counts.txt',
+		hap_count = 'Results_ASE/phASER/{sample}_phASER.haplotypic_counts.txt'
 	threads:  config["phASER"]["cpu"]
 	params:
 		mappingQuality = config['mappingQuality'],
@@ -68,5 +71,45 @@ rule phASER:
   		idSeparator = '_' if config['id_separator'] == '' else config['id_separator']
 	shell:
 		"""
-		python2.7 {config[bin_dir]}/phaser.py --vcf {input.vcf} --bam {input.bam} --sample {wildcards.sample} --threads {threads} --o Results/Counter/{wildcards.sample}_phASER --paired_end {params.paired_option} --baseq {params.baseQuality} --mapq {params.mappingQuality} --id_separator {params.idSeparator}
+		python2.7 {config[bin_dir]}/phaser.py --vcf {input.vcf} --bam {input.bam} --sample {wildcards.sample} --threads {threads} --o Results_ASE/phASER/{wildcards.sample}_phASER --paired_end {params.paired_option} --unphased_vars 1 --gw_phase_vcf 1 --baseq {params.baseQuality} --mapq {params.mappingQuality} --id_separator {params.idSeparator}
+		"""
+
+
+rule gtf2bed:
+	input:
+		gtf = config['gtf_ref']
+	output:
+		bed = 'Results_ASE/ref_bed/' + os.path.basename(config['gtf_ref']).replace('.gtf', '.bed')
+	script:
+		"../script/gtf2bed.py"
+
+
+rule phASER_gene_ae:
+	input:
+		hap_count = "Results_ASE/phASER/{sample}_phASER.haplotypic_counts.txt",
+		bed = 'Results_ASE/ref_bed/' + os.path.basename(config['gtf_ref']).replace('.gtf', '.bed')
+	output:
+		gene_ae = "Results_ASE/phASER_gene_ae/{sample}_phASER.gene_ae.txt"
+	params:
+		idSeparator = '_' if config['id_separator'] == '' else config['id_separator']
+	shell:
+		"""
+		python2.7 {config[bin_dir]}/phaser_gene_ae.py --haplotypic_counts {input.hap_count} --features {input.bed} --o {output} --id_separator {params.idSeparator}
+		"""
+
+
+rule phASER_expr_matrix:
+	input:
+		expand("Results_ASE/phASER_gene_ae/{sample}_phASER.gene_ae.txt", sample=table["sample_name"].unique()),
+		bed = 'Results_ASE/ref_bed/' + os.path.basename(config['gtf_ref']).replace('.gtf', '.bed')
+	output:
+		bed = 'Results_ASE/phASER_pop/' + os.path.basename(config['vcf']).replace('.vcf.gz','_phASER.bed.gz'),
+		phased_bed = 'Results_ASE/phASER_pop/' + os.path.basename(config['vcf']).replace('.vcf.gz','_phASER.gw_phased.bed.gz')
+	params:
+		gene_ae_dir = "Results_ASE/phASER_gene_ae/",
+		out_prefix = 'Results_ASE/phASER_pop/' + os.path.basename(config['vcf']).replace('.vcf.gz','_phASER')
+	threads: config["phASER_expr_matrix"]["cpu"]
+	shell:
+		"""
+		python2.7 {config[bin_dir]}/phaser_expr_matrix.py --t {threads} --gene_ae_dir {params.gene_ae_dir} --feature {input.bed} --o {params.out_prefix}
 		"""
--- a/Snakemake/1000RNASeq_chicken/ASE/rules/TrimGalore.smk
+++ b/Snakemake/1000RNASeq_chicken/ASE/rules/TrimGalore.smk
 # coding: utf-8

-__author__ = 'Kenza BAZI KABBAJ / Maria BERNARD- Sigenae'
+__author__ = 'Maria BERNARD- Sigenae'
 __version__ = '1.0.0'

 '''
@@ -16,7 +16,7 @@ rule convertPhred:
    input:
        fastq_in = get_phred64
    output:
-        fastq_out = temp("Results/Phred_convert/{prefix}.fastq.gz")
+        fastq_out = temp("Results_ASE/Phred_convert/{prefix}.fastq.gz")
    script:
        "../script/convertPhred64_to_Phred33.py"

@@ -27,13 +27,13 @@ def get_fastq(wildcards):
        if re.search('^' + wildcards.prefix + '(_R1){0,1}\.f.{0,3}q(\.gz){0,1}$',f):
            is_phred64 = table[table["forward_read"] == f]["phred_scale"] == '64'
            if is_phred64.all():
-                inputs.append("Results/Phred_convert/" + f)
+                inputs.append("Results_ASE/Phred_convert/" + f)
            else:
                inputs.append(config["data_dir"] + "/" + f)
            r = table[table["forward_read"] == f]["reverse_read"]
            if not r.isnull().values.any():
                if is_phred64.all():
-                    inputs.append("Results/Phred_convert/" + r.squeeze())
+                    inputs.append("Results_ASE/Phred_convert/" + r.squeeze())
                else:
                    inputs.append(config["data_dir"] + "/" + r.squeeze())
    return inputs
@@ -43,14 +43,15 @@ rule trim_se:
    input:
        get_fastq
    output:
-        out=temp("Results/TrimGalore/{prefix}_trim.fastq.gz")
+        out=temp("Results_ASE/TrimGalore/{prefix}_trim.fastq.gz")
    params:
-        min_trim=lambda wildcards : str(int([ int(table[table["forward_read"] == f]["read_length"].tolist()[0]) for f in table["forward_read"] if f.startswith(wildcards.prefix)][0] / 3 ))
+        min_trim=lambda wildcards : str(int([ int(table[table["forward_read"] == f]["read_length"].tolist()[0]) for f in table["forward_read"] if f.startswith(wildcards.prefix)][0] / 3 )),
+        qual=config["trimming_quality"]
    log:
-        "Results/TrimGalore/{prefix}_trimSe.log"
+        "Results_ASE/TrimGalore/{prefix}_trimSe.log"
    shell:
        """
-        trim_galore --no_report_file --length {params.min_trim} --quality 0 -o `dirname {output.out}` {input} 2> {log}
+        trim_galore --no_report_file --length {params.min_trim} --quality {params.qual} -o `dirname {output.out}` {input} 2> {log}
        mv `dirname {output.out}`/{wildcards.prefix}*trimmed.fq.gz {output.out}
        """

@@ -59,15 +60,16 @@ rule trim_pe:
    input:
        get_fastq
    output:
-        out1=temp("Results/TrimGalore/{prefix}_trim_R1.fastq.gz"),
-        out2=temp("Results/TrimGalore/{prefix}_trim_R2.fastq.gz")
+        out1=temp("Results_ASE/TrimGalore/{prefix}_trim_R1.fastq.gz"),
+        out2=temp("Results_ASE/TrimGalore/{prefix}_trim_R2.fastq.gz")
    log:
-        "Results/TrimGalore/{prefix}_trimPe.log"
+        "Results_ASE/TrimGalore/{prefix}_trimPe.log"
    params:
-        min_trim=lambda wildcards : str(int([ int(table[table["forward_read"] == f]["read_length"].tolist()[0]) for f in table["forward_read"] if f.startswith(wildcards.prefix)][0] / 3 ))
+        min_trim=lambda wildcards : str(int([ int(table[table["forward_read"] == f]["read_length"].tolist()[0]) for f in table["forward_read"] if f.startswith(wildcards.prefix)][0] / 3 )),
+        qual=config["trimming_quality"]
    shell:
        """
-        trim_galore --paired --no_report_file --length {params.min_trim} --quality 0 -o `dirname {output.out1}` {input} 2> {log}
+        trim_galore --paired --no_report_file --length {params.min_trim} --quality {params.qual} -o `dirname {output.out1}` {input} 2> {log}
        mv `dirname {output.out1}`/{wildcards.prefix}*_val_1.fq.gz {output.out1}
        mv `dirname {output.out1}`/{wildcards.prefix}*_val_2.fq.gz {output.out2}
        """
@@ -79,9 +81,9 @@ def get_trim_Log(wildcards):
        prefix=re.split('(_R1){0,1}\.f.{0,3}q(\.gz){0,1}$', f)[0]
        r = table[table["forward_read"] == f]["reverse_read"]
        if not r.isnull().values.any():
-            inputs.append("Results/TrimGalore/" + prefix + "_trimPe.log" )
+            inputs.append("Results_ASE/TrimGalore/" + prefix + "_trimPe.log" )
        else:
-            inputs.append("Results/TrimGalore/" + prefix + "_trimSe.log" )
+            inputs.append("Results_ASE/TrimGalore/" + prefix + "_trimSe.log" )
    return inputs
    
    
@@ -89,7 +91,7 @@ rule log_merge_trim:
    input:
        logs=get_trim_Log
    output:
-        out="Results/Summary/Log_TrimGalore_summary.tsv"
+        out="Results_ASE/Summary/Log_TrimGalore_summary.tsv"
    script:
        "../script/trimgalore_summary.py"
        
--- a/Snakemake/1000RNASeq_chicken/ASE/rules/alignmentProcess.smk
+++ b/Snakemake/1000RNASeq_chicken/ASE/rules/alignmentProcess.smk
@@ -14,75 +14,78 @@ STAR_INDEX_FILE=['genomeParameters.txt', 'chrName.txt', 'chrLength.txt', 'chrSta

 rule STAR_Index_1:
    input:
-        fasta = 'Results/genomeMasked/' + os.path.splitext(os.path.basename(config['fasta_ref']))[0] + '_masked.fa',
+        fasta = os.path.abspath(config['fasta_ref']),
        gtf = os.path.abspath(config["gtf_ref"])
    output:
-        expand('Results/STAR_Index_1/{file}', file=STAR_INDEX_FILE)
+        expand('Results_ASE/STAR_Index_1/{file}', file=STAR_INDEX_FILE)
    threads: config["STAR_Index_1"]["cpu"]
    params:
        sjdbOverhang=max([int(x) for x in table.read_length]) - 1
    shell:
-        "STAR --runThreadN {threads} --runMode genomeGenerate --genomeDir Results/STAR_Index_1  --genomeFastaFiles {input.fasta} --sjdbOverhang {params.sjdbOverhang} --sjdbGTFfile {input.gtf} "
+        "STAR --runThreadN {threads} --runMode genomeGenerate --genomeDir Results_ASE/STAR_Index_1  --genomeFastaFiles {input.fasta} --sjdbOverhang {params.sjdbOverhang} --sjdbGTFfile {input.gtf} "

 def get_fastq_trim(wildcards):
    inputs = list()
    for f in table["forward_read"] :
        if re.search('^' + wildcards.prefix + '(_R1){0,1}\.f.{0,3}q(\.gz){0,1}$',f):
-            inputs.append("Results/TrimGalore/" + f.replace(wildcards.prefix,wildcards.prefix + "_trim"))
+            inputs.append("Results_ASE/TrimGalore/" + f.replace(wildcards.prefix,wildcards.prefix + "_trim"))
            r = table[table["forward_read"] == f]["reverse_read"]
            if not r.isnull().values.any():
-                inputs.append("Results/TrimGalore/" + r.squeeze().replace(wildcards.prefix,wildcards.prefix + "_trim"))
+                inputs.append("Results_ASE/TrimGalore/" + r.squeeze().replace(wildcards.prefix,wildcards.prefix + "_trim"))
    return inputs

 rule STAR_Aln_1:
    input:
-        expand('Results/STAR_Index_1/{file}', file=STAR_INDEX_FILE[1:]),
+        expand('Results_ASE/STAR_Index_1/{file}', file=STAR_INDEX_FILE[1:]),
        fastq = get_fastq_trim
    output:
-        temp("Results/STAR_Aln_1/{prefix}_Log.out"), 
-        temp("Results/STAR_Aln_1/{prefix}_Log.progress.out"),
-        temp("Results/STAR_Aln_1/{prefix}_Log.final.out"),
-        bam=temp("Results/STAR_Aln_1/{prefix}_Aligned.sortedByCoord.out.bam"),
-        sj_out_tab="Results/STAR_Aln_1/{prefix}_SJ.out.tab"
+        temp("Results_ASE/STAR_Aln_1/{prefix}_Log.out"), 
+        temp("Results_ASE/STAR_Aln_1/{prefix}_Log.progress.out"),
+        temp("Results_ASE/STAR_Aln_1/{prefix}_Log.final.out"),
+        bam=temp("Results_ASE/STAR_Aln_1/{prefix}_Aligned.sortedByCoord.out.bam"),
+        sj_out_tab="Results_ASE/STAR_Aln_1/{prefix}_SJ.out.tab"
    threads: config["STAR_Aln_1"]["cpu"]
    shell:
-        "STAR --runThreadN {threads} --genomeDir Results/STAR_Index_1 --readFilesIn {input.fastq} --readFilesCommand zcat --outFileNamePrefix `dirname {output.bam}`/{wildcards.prefix}_ --outSAMtype BAM SortedByCoordinate --outFilterType BySJout"
-
+        "STAR --runThreadN {threads} --genomeDir Results_ASE/STAR_Index_1 --readFilesIn {input.fastq} --readFilesCommand zcat --outFileNamePrefix `dirname {output.bam}`/{wildcards.prefix}_ --outSAMtype BAM SortedByCoordinate --outFilterType BySJout"


+# see https://github.com/alexdobin/STAR/issues/524 and https://groups.google.com/forum/#!searchin/rna-star/2-pass|sort:date/rna-star/Cpsf-_rLK9I/gq-DaeyvBAAJ
 def get_SJout_STAR1(wildcards):
    inputs=list()
    forward_list = table["forward_read"]
    for f in forward_list:
        prefix=re.split('(_R1){0,1}\.f.{0,3}q(\.gz){0,1}$', f)[0]
-        inputs.append("Results/STAR_Aln_1/" + prefix + "_SJ.out.tab" )
+        inputs.append("Results_ASE/STAR_Aln_1/" + prefix + "_SJ.out.tab" )
    return inputs

 rule STAR_SJout_filter:
    input:
        sj_out_tab=get_SJout_STAR1
    output:
-        'Results/STAR_Index_1/SJ.out.filtered.tab'
+        'Results_ASE/STAR_Index_1/SJ.out.filtered.tab'
    shell:
-        "cat {input.sj_out_tab} | awk '$6==0 && $5>0 && $7>=2' | cut -f1-6 | sort | uniq -c |awk '{{if($1>1){{print $2,$3,$4,$5,$6,$7}}}}' |sed 's/ /\t/g' > {output}"
+        """
+        # filter junctions on: mitochondria , non canonical, already known, covered by less than 3 uniquely mapped reads in at least 1 sample
+        cat {input.sj_out_tab} | awk '$6==0 && $5>0 && $7>=2' | cut -f1-6 | sort | uniq -c |awk '{{if($1>1){{print $2,$3,$4,$5,$6,$7}}}}' |sed 's/ /\t/g' > {output}
+        """

 rule STAR_Aln_2:
    input:
-        expand('Results/STAR_Index_1/{file}', file=STAR_INDEX_FILE[1:]),
+        expand('Results_ASE/STAR_Index_1/{file}', file=STAR_INDEX_FILE[1:]),
        fastq = get_fastq_trim,
-        sj_out_tab = 'Results/STAR_Index_1/SJ.out.filtered.tab'
+        sj_out_tab = 'Results_ASE/STAR_Index_1/SJ.out.filtered.tab'
    output:
-        temp("Results/STAR_Aln_2/{prefix}_Log.out"),
-        temp("Results/STAR_Aln_2/{prefix}_Log.progress.out"),
-        temp("Results/STAR_Aln_2/{prefix}_SJ.out.tab"),
-        g_bam=temp("Results/STAR_Aln_2/{prefix}_Aligned.sortedByCoord.out.bam"),
-        log=temp("Results/STAR_Aln_2/{prefix}_Log.final.out"),
-        g_stat="Results/STAR_Aln_2/{prefix}_Aligned.sortedByCoord.out.flagstat"
+        temp("Results_ASE/STAR_Aln_2/{prefix}_Log.out"),
+        temp("Results_ASE/STAR_Aln_2/{prefix}_Log.progress.out"),
+        temp("Results_ASE/STAR_Aln_2/{prefix}_SJ.out.tab"),
+        g_bam=temp("Results_ASE/STAR_Aln_2/{prefix}_Aligned.sortedByCoord.out.bam"),
+        log=temp("Results_ASE/STAR_Aln_2/{prefix}_Log.final.out"),
+        g_stat="Results_ASE/STAR_Aln_2/{prefix}_Aligned.sortedByCoord.out.flagstat"
    threads: config["STAR_Aln_2"]["cpu"]
    priority : 50
    shell:
        """
-        STAR --runThreadN {threads} --genomeDir Results/STAR_Index_1/ --sjdbFileChrStartEnd {input.sj_out_tab} --readFilesIn {input.fastq} --readFilesCommand zcat --outFileNamePrefix `dirname {output.g_bam}`/{wildcards.prefix}_ --outSAMtype BAM SortedByCoordinate
+        STAR --runThreadN {threads} --genomeDir Results_ASE/STAR_Index_1/ --sjdbFileChrStartEnd {input.sj_out_tab} --readFilesIn {input.fastq} --readFilesCommand zcat --outFileNamePrefix `dirname {output.g_bam}`/{wildcards.prefix}_ --outSAMtype BAM SortedByCoordinate
        samtools flagstat {output.g_bam} > {output.g_stat}
        """

@@ -90,47 +93,49 @@ def get_STAR_log(wildcards):
    inputs = list()
    for f in table["forward_read"]:
        prefix=re.split('(_R1){0,1}\.f.{0,3}q(\.gz){0,1}$', f)[0]
-        inputs.append("Results/" + wildcards.dir + "/" + prefix + "_Log.final.out" )
+        inputs.append("Results_ASE/" + wildcards.dir + "/" + prefix + "_Log.final.out" )
    return inputs

 rule log_merge_STAR:
    input:
        logs=get_STAR_log
    output:
-        out="Results/Summary/Log_{dir}_summary.tsv"
+        out="Results_ASE/Summary/Log_{dir}_summary.tsv"
    wildcard_constraints:
        dir='STAR_Aln_[12]'
    script:
        "../script/star_summary.py"

+    # PICARD
+    # java -jar picard.jar AddOrReplaceReadGroups I={input.bam} O={output.bam} RGID={params.idx} RGLB={params.name} RGPL={params.sequencer} RGPU="-" RGSM={params.name} 
+    # GATK4:
 rule AddOrReplaceReadGroups:
    input :
-        bam="Results/STAR_Aln_2/{prefix}_Aligned.sortedByCoord.out.bam",
+        bam="Results_ASE/STAR_Aln_2/{prefix}_Aligned.sortedByCoord.out.bam",
    output : 
-        bam=temp("Results/STAR_Aln_2/{prefix}_Aligned.sortedByCoord_rg.bam")
+        bam=temp("Results_ASE/STAR_Aln_2/{prefix}_Aligned.sortedByCoord_rg.bam")
    params :
        idx  = lambda wildcards : [ table[table["forward_read"] == f]["idx"].tolist()[0] for f in table["forward_read"] if f.startswith(wildcards.prefix)][0],
        name = lambda wildcards : [ table[table["forward_read"] == f]["sample_name"].tolist()[0] for f in table["forward_read"] if f.startswith(wildcards.prefix)][0],
        sequencer = lambda wildcards : [ table[table["forward_read"] == f]["sequencer"].tolist()[0] for f in table["forward_read"] if f.startswith(wildcards.prefix)][0],
        mem=str(int(config["AddOrReplaceReadGroups"]["mem"].replace("G","")) -4 )+"G" if int(config["AddOrReplaceReadGroups"]["mem"].replace("G","")) -4 > 1 else config["AddOrReplaceReadGroups"]["mem"] ,
-        jar=find_jar("picard.jar")
    shell:
        """
-        java -Xmx{params.mem} -jar {params.jar} AddOrReplaceReadGroups I={input.bam} O={output.bam} RGID={params.idx} RGLB={params.name} RGPL={params.sequencer} RGPU="-" RGSM={params.name} 
+        gatk --java-options "-Xmx{params.mem}" AddOrReplaceReadGroups -I {input.bam} -O {output.bam} --RGID {params.idx} --RGLB {params.name} --RGPL {params.sequencer} --RGPU "-" --RGSM {params.name} 
        """

 def get_genomic_bam(wildcards):
    inputs = list()
    for f in table[table["sample_name"] == wildcards.sample]["forward_read"]:
        prefix=re.split('(_R1){0,1}\.f.{0,3}q(\.gz){0,1}$', f)[0]
-        inputs.append("Results/STAR_Aln_2/" + prefix + "_Aligned.sortedByCoord_rg.bam" )
+        inputs.append("Results_ASE/STAR_Aln_2/" + prefix + "_Aligned.sortedByCoord_rg.bam" )
    return inputs
    
 rule merge_bam : 
    input:
        bam = get_genomic_bam
    output:
-        bam = temp("Results/STAR_Aln_2/{sample}_rg_genomic.bam")
+        bam = temp("Results_ASE/STAR_Aln_2/{sample}_rg_genomic.bam")
    threads:  config["merge_bam"]["cpu"]
    shell:
        """
@@ -146,9 +151,9 @@ rule merge_bam :

 rule properly_paired:
    input :
-        bam = "Results/STAR_Aln_2/{sample}_rg_genomic.bam"
+        bam = "Results_ASE/STAR_Aln_2/{sample}_rg_genomic.bam"
    output:
-        bam = temp("Results/STAR_Aln_2/{sample}_rg_genomic_pp.bam")
+        bam = temp("Results_ASE/STAR_Aln_2/{sample}_rg_genomic_pp.bam")
    shell:
        """
        samtools view -h -f 2 {input.bam} -bo {output.bam} 
@@ -159,52 +164,54 @@ def get_properly_paired_bam(wildcards):
        sample_table = table[table["sample_name"] == wildcards.sample]
        # paired end, filter on properly paired
        if len(sample_table["forward_read"]) == len(sample_table["reverse_read"].dropna()) :
-            return "Results/STAR_Aln_2/" + wildcards.sample + "_rg_genomic_pp.bam"
+            return "Results_ASE/STAR_Aln_2/" + wildcards.sample + "_rg_genomic_pp.bam"
        # single end, no filter
        elif len(sample_table["reverse_read"].dropna() ) == 0:
-            return "Results/STAR_Aln_2/" + wildcards.sample + "_rg_genomic.bam"
+            return "Results_ASE/STAR_Aln_2/" + wildcards.sample + "_rg_genomic.bam"


 # note : This tool uses the READ_NAME_REGEX and the OPTICAL_DUPLICATE_PIXEL_DISTANCE options as the primary methods to identify and differentiate duplicate types. Set READ_NAME_REGEX to null to skip optical duplicate detection, e.g. for RNA-seq or other data where duplicate sets are extremely large and estimating library complexity is not an aim

+    # PICARD
+    # java -jar picard.jar MarkDuplicates I={input.bam} O={output.bam} M={output.metrics} CREATE_INDEX=false READ_NAME_REGEX=null REMOVE_DUPLICATES=true VALIDATION_STRINGENCY=SILENT 
+    # GATK4:    
 rule RmDuplicates:
    input:
        bam = get_properly_paired_bam
    output:
-        bam = temp("Results/STAR_Aln_2/{sample}_rg_genomic{properlyPaired}rmdup.bam"),
-        metrics = "Results/STAR_Aln_2/{sample}_rg_genomic{properlyPaired}.metrics.txt",
+        bam = temp("Results_ASE/STAR_Aln_2/{sample}_rg_genomic{properlyPaired}rmdup.bam"),
+        metrics = "Results_ASE/STAR_Aln_2/{sample}_rg_genomic{properlyPaired}.metrics.txt",
    params:
        mem=str(int(config["rmdup"]["mem"].replace("G","")) -4 )+"G" if int(config["rmdup"]["mem"].replace("G","")) -4 > 1 else config["rmdup"]["mem"] ,
-        jar=find_jar("picard.jar")
    shell:
        """
-        java -Xmx{params.mem} -jar {params.jar} MarkDuplicates  CREATE_INDEX=false READ_NAME_REGEX=null I={input.bam} O={output.bam} REMOVE_DUPLICATES=true VALIDATION_STRINGENCY=SILENT M={output.metrics}
+        gatk --java-options "-Xmx{params.mem}" MarkDuplicates -I {input.bam} -O {output.bam} -M {output.metrics} --CREATE_INDEX false --READ_NAME_REGEX 'null' --REMOVE_DUPLICATES true --VALIDATION_STRINGENCY SILENT
        """

 rule remove_multimap:
    input:
-        bam = "Results/STAR_Aln_2/{sample}_rg_genomic{properlyPaired}rmdup.bam"
+        bam = "Results_ASE/STAR_Aln_2/{sample}_rg_genomic{properlyPaired}rmdup.bam"
    output:
-        bam = temp("Results/STAR_Aln_2/{sample}_rg_genomic{properlyPaired}rmdup_uniq.bam")
+        bam = temp("Results_ASE/STAR_Aln_2/{sample}_rg_genomic{properlyPaired}rmdup_uniq.bam")
    shell:
        """
        samtools view -h -q 255 {input.bam} -bo {output.bam} 
        """
-
+    
+    # GATK3 :
+    # java -jar GenomeAnalysisTK.jar -T SplitNCigarReads -R {input.fasta} -I {input.bam} -o {output.bam} --disable_bam_indexing -rf ReassignOneMappingQuality -RMQF 255 -RMQT 60 -U ALLOW_N_CIGAR_READS 
+    # GATK4 : default behavior, see --skip-mapping-quality-transform option
 rule SplitNCigarReads:
    input:
-        fasta = 'Results/genomeMasked/' + os.path.splitext(os.path.basename(config['fasta_ref']))[0] + '_masked.fa', 
-        idx = 'Results/genomeMasked/' + os.path.splitext(os.path.basename(config['fasta_ref']))[0] + '_masked.fa.fai', 
-        dict = 'Results/genomeMasked/' + os.path.splitext(os.path.basename(config['fasta_ref']))[0] + '_masked.dict', 
-        bam = "Results/STAR_Aln_2/{sample}_rg_genomic{properlyPaired}rmdup_uniq.bam",
-        bai = "Results/STAR_Aln_2/{sample}_rg_genomic{properlyPaired}rmdup_uniq.bam.bai"
+        fasta = config['fasta_ref'], 
+        bam = "Results_ASE/STAR_Aln_2/{sample}_rg_genomic{properlyPaired}rmdup_uniq.bam",
+        bai = "Results_ASE/STAR_Aln_2/{sample}_rg_genomic{properlyPaired}rmdup_uniq.bam.bai"
    output:
-        bam = "Results/STAR_Aln_2/{sample}_rg_genomic{properlyPaired}rmdup_uniq_split.bam"
+        bam = "Results_ASE/STAR_Aln_2/{sample}_rg_genomic{properlyPaired}rmdup_uniq_split.bam"
    params:
        mem=config["SplitNCigarReads"]["mem"],
-        jar=find_jar("GenomeAnalysisTK.jar")
    shell:
        """
-        java -Xmx{params.mem} -jar {params.jar} -T SplitNCigarReads -R {input.fasta} -I {input.bam} -o {output.bam} --disable_bam_indexing -rf ReassignOneMappingQuality -RMQF 255 -RMQT 60 -U ALLOW_N_CIGAR_READS 
+        gatk --java-options "-Xmx{params.mem}" SplitNCigarReads -R {input.fasta} -I {input.bam} -O {output.bam} --create-output-bam-index false  
        """
    
--- a/Snakemake/1000RNASeq_chicken/ASE/rules/basics.smk
+++ b/Snakemake/1000RNASeq_chicken/ASE/rules/basics.smk
@@ -23,8 +23,6 @@ rule bgzip:
        '{file}.vcf'
    output:
        '{file}.vcf.gz'
-    wildcard_constraints:
-        file=".+_SNP_callRateFiltered|.+_SNP_exclude_nonVar"
    shell:
        "bgzip {input}"
        
@@ -48,19 +46,3 @@ rule bam_index:
        
    shell:
        "samtools index {input} "
-        
-        
-rule indexFasta:
-	input:
-		ref = "{file}.fa"
-	output:
-		idx = "{file}.fa.fai",
-		dict = "{file}.dict"
-	params:
-		mem=str(int(config["indexFasta"]["mem"].replace("G","")) -4 )+"G" if int(config["indexFasta"]["mem"].replace("G","")) -4 > 1 else config["indexFasta"]["mem"] ,
-		jar=find_jar("picard.jar")
-	shell:
-		"""
-		samtools faidx {input.ref}
-        java -Xmx{params.mem} -jar {params.jar} CreateSequenceDictionary R={input.ref} O={output.dict}
-		"""
--- a/Snakemake/1000RNASeq_chicken/ASE/rules/genomeMasking.smk
+++ b/Snakemake/1000RNASeq_chicken/ASE/rules/genomeMasking.smk
-# coding: utf-8
-
-__author__ = 'Maria BERNARD- Sigenae'
-__version__ = '1.0.0'
-
-'''
-Mask variant sites in the reference genome
-'''
-
-rule maskFasta:
-	input:
-		ref = config['fasta_ref'],
-		vcf = 'Results/vcfFiltration/' + os.path.basename(config['vcf']).replace('.vcf.gz','') + '_SNP_exclude_nonVar.vcf.gz'
-	output:
-		ref = 'Results/genomeMasked/' + os.path.splitext(os.path.basename(config['fasta_ref']))[0] + '_masked.fa'
-	shell:
-		"""
-		bedtools maskfasta -fi {input.ref} -bed {input.vcf} -fo {output.ref}
-		"""
--- a/Snakemake/1000RNASeq_chicken/ASE/rules/variantFiltration.smk
+++ b/Snakemake/1000RNASeq_chicken/ASE/rules/variantFiltration.smk
 # coding: utf-8

-__author__ = 'Kenza BAZI KABBAJ / Maria BERNARD- Sigenae'
+__author__ = 'Maria BERNARD- Sigenae'
 __version__ = '1.0.0'

 '''
-Rules to 'hard' filtered SNP and INDEL following the GATK recommandations and remove non variants site
+?????
 '''
+import sys
+from pyfaidx import Fasta
+genome = Fasta(config['fasta_ref'])
+chrom_list = [ seq.name for seq in genome if seq.name.isdigit() ]

-rule VarForMasking:
+PREFIX = os.path.basename(config['vcf']).replace('.vcf.gz','')
+    # GATK4
+rule SelectVariants :
    input :
        vcf = config['vcf']
    output :
-        vcf = temp('Results/vcfFiltration/{prefix}_SNP_exclude_nonVar.vcf'),
-        stat = temp('Results/vcfFiltration/{prefix}_SNP_exclude_nonVar.stat')
-    params:
-        mem = config['VarForMasking']['mem'],
-        jar = find_jar("GenomeAnalysisTK.jar")
-    message:
-        "\nrule VarForMasking:\n" \
-        + "\tinput: {input}\n" \
-        + "\toutput: {output}\n" \
-        + "Executing script/exclude_non_var.py to keep only SNP and exclude non variants site (all samples with the same genotype call).\n"
-    script:
-        '../script/exclude_non_var.py'
+        vcf = temp("Results_ASE/vcfFiltration/" + PREFIX + "_SNP.vcf")
+    params :
+        mem = config["SelectVariants"]["mem"]
+    shell:
+        '''
+        gatk --java-options "-Xmx{params.mem}" SelectVariants -V {input.vcf} -O {output.vcf} --select-type-to-include SNP
+        '''
+
+def get_SNP_vcf(wildcards):
+    if config['SNP_filter']:
+        return "Results_ASE/vcfFiltration/" + PREFIX + "_SNP.vcf"
+    else:
+        return config['vcf']

-rule VariantFiltration:
-    input : 
-        vcf = 'Results/vcfFiltration/{prefix}_SNP_exclude_nonVar.vcf.gz',
-        idx = 'Results/vcfFiltration/{prefix}_SNP_exclude_nonVar.vcf.gz.tbi',
+    # GATK3:
+    # java -jar GenomeAnalysisTK.jar -T VariantFiltration -R {input.ref} -V {input.vcf} {params.filter_exp} -o {output.vcf}  -filterName "FS" -filter "FS > 30.0" -filterName "QD" -filter "QD < 2.0" --cluster-window-size 35 --cluster-size 3
+    # GATK4
+rule FS_QD_clust :
+    input :
+        vcf = get_SNP_vcf,
        ref = config['fasta_ref']
    output :
-        vcf = temp('Results/vcfFiltration/{prefix}_SNP_tagged.vcf.gz'),
-    params:
-        mem= config['VariantFiltration']['mem'],
-        jar=find_jar("GenomeAnalysisTK.jar"),
-        filter_exp = '-filterName FS -filter \"FS > 30.0\" -filterName QD -filter \"QD < 2.0\" -filterName Depth -filter \"DP < ' + str(config["depth"]) + '\"'
+        vcf = temp("Results_ASE/vcfFiltration/" + PREFIX + "_FS_QD_clust_tagged.vcf"),
+        idx = temp("Results_ASE/vcfFiltration/" + PREFIX + "_FS_QD_clust_tagged.vcf.idx")
+    params :
+        mem = config["FS_QD_clust"]["mem"]
    shell:
        '''
-        java -Xmx{params.mem} -jar {params.jar} -T VariantFiltration -R {input.ref} -V {input.vcf} {params.filter_exp} -o {output.vcf}
+        gatk --java-options "-Xmx{params.mem}" VariantFiltration -R {input.ref} -V {input.vcf} -O {output.vcf} --filter-name FS --filter-expression "FS > 30.0" --filter-name QD --filter-expression "QD < 2.0" -window 35 -cluster 3
        '''
-rule VariantFiltrationStat:
-    input:
-        vcf = 'Results/vcfFiltration/{prefix}_SNP_tagged.vcf.gz'
-    output:
-        stat = temp('Results/vcfFiltration/{prefix}_SNP_tagged.stat')
+
+rule vcf_filtration :
+    input :
+        vcf = "Results_ASE/vcfFiltration/" + PREFIX + "_FS_QD_clust_tagged.vcf",
+        idx = "Results_ASE/vcfFiltration/" + PREFIX + "_FS_QD_clust_tagged.vcf.idx",
+        gtf = config['gtf_ref'],
+        fasta = config['fasta_ref']
+    output :
+        vcf = temp("Results_ASE/vcfFiltration/by_chrom/{chrom}_" + PREFIX + "_FsQdBiall.vcf") ,
+        gt = temp("Results_ASE/vcfFiltration/by_chrom/{chrom}_" + PREFIX + "_FsQdBiall_withGTpopCR50_DPgt05rPopCR20_GT.tsv") ,
+        dp = temp("Results_ASE/vcfFiltration/by_chrom/{chrom}_" + PREFIX + "_FsQdBiall_withGTpopCR50_DPgt05rPopCR20_DP.tsv" ),
+        ad = temp("Results_ASE/vcfFiltration/by_chrom/{chrom}_" + PREFIX + "_FsQdBiall_withGTpopCR50_DPgt05rPopCR20_AD.tsv" ),
+        summary = temp("Results_ASE/vcfFiltration/by_chrom/{chrom}_" + PREFIX + "_VariantFiltration_summary.txt")
+    params :
+        chrom_selection = lambda wildcards : "--accepted-chrom " + wildcards.chrom if wildcards.chrom != "other" else "--excluded-chrom " + " ".join(chrom_list),
+        cpu = config['vcf_filtration']['cpu']
    shell:
        '''
-        echo "# Number of SNP filtered by criteria:" >> {output.stat}
-        zcat {input.vcf} |grep -v "#" | cut -f 7 | sort | uniq -c >> {output.stat}
-        echo "# Number of non biallelic SNP filtered: " >> {output.stat}
-        zcat {input.vcf} |grep -v "#" | cut -f 5 | grep -c "," >> {output.stat}
+        variantFiltration.py --nb-cpus {params.cpu} --in-vcf {input.vcf} --in-gtf {input.gtf} --in-fasta {input.fasta} {params.chrom_selection} --out-vcf {output.vcf} --out-gt {output.gt} --out-dp {output.dp} --out-ad {output.ad} --summary {output.summary}
        '''

-rule SelectVariantsForASE:
-    input : 
-        vcf = 'Results/vcfFiltration/{prefix}_SNP_tagged.vcf.gz',
-        idx = 'Results/vcfFiltration/{prefix}_SNP_tagged.vcf.gz.tbi',
-        ref = config['fasta_ref']
+rule merge_filteredVariant :
+    input :
+        chr_vcf = expand("Results_ASE/vcfFiltration/by_chrom/{CHR}_" + PREFIX + "_FsQdBiall.vcf", CHR=chrom_list),
+        other_vcf = "Results_ASE/vcfFiltration/by_chrom/other_" + PREFIX + "_FsQdBiall.vcf"
    output :
-        vcf = temp('Results/vcfFiltration/{prefix}_SNP_qualFiltered.vcf.gz'),
-    params:
-        mem= config['SelectVariantsForASE']['mem'],
-        jar=find_jar("GenomeAnalysisTK.jar")
+        vcf = "Results_ASE/vcfFiltration/" + PREFIX + "_FsQdBiall.vcf"
    shell:
        '''
-        java -Xmx{params.mem} -jar {params.jar} -T SelectVariants -R {input.ref} -V {input.vcf} --restrictAllelesTo BIALLELIC --excludeFiltered -o {output.vcf}
+        bcftools concat -o {output.vcf} -O v {input.chr_vcf} {input.other_vcf}
        '''
-
-rule callRateFilter:
-    input:
-        vcf = 'Results/vcfFiltration/{prefix}_SNP_qualFiltered.vcf.gz',
-        sample_config = config['sample_config']
-    output:
-        vcf = temp('Results/vcfFiltration/{prefix}_SNP_callRateFiltered.vcf'),
-        stat = temp('Results/vcfFiltration/{prefix}_SNP_callRateFiltered.stat')
-    params:
-        call_rate_th = config['call_rate_th']
-    message:
-        "\nrule callRateFilter:\n" \
-        + "\tinput: {input}\n" \
-        + "\toutput: {output}\n" \
-        + "\tcall rate threshold param: {params}\n" \
-        + "Executing script/call_rate_filter.py to keep sites where genotype call rate is >= to a the call rate threshold parameter per sample group. Warning all samples in the VCF must be defined in the sample config tsv file\n"
-
-    script:
-        "../script/call_rate_filter.py"
-
-PREFIX = os.path.basename(config['vcf']).replace('.vcf.gz','')
-rule vcfFilterSummary:
-    input:
-        VarForMasking = 'Results/vcfFiltration/' + PREFIX + '_SNP_exclude_nonVar.stat',
-        VariantFiltration = 'Results/vcfFiltration/' + PREFIX + '_SNP_tagged.stat',
-        callRateFilter = 'Results/vcfFiltration/' + PREFIX + '_SNP_callRateFiltered.stat'
-    output:
-        'Results/Summary/VariantFiltration_summary.txt'
+def get_table(wildcards):
+    table_list = expand("Results_ASE/vcfFiltration/by_chrom/{chrom}_" + PREFIX + "_FsQdBiall_withGTpopCR50_DPgt05rPopCR20_"+wildcards.suffix + ".tsv", chrom=chrom_list )
+    return table_list
+    
+rule merge_filtered_table:
+    input :
+        chr_tsv = get_table,
+        other_tsv = "Results_ASE/vcfFiltration/by_chrom/other_" + PREFIX + "_FsQdBiall_withGTpopCR50_DPgt05rPopCR20_{suffix}.tsv"
+    output :
+        tsv = "Results_ASE/vcfFiltration/" + PREFIX + "_FsQdBiall_withGTpopCR50_DPgt05rPopCR20_{suffix}.tsv"
+    wildcard_constraints:
+        suffix="GT|AD|DP"
    shell:
        '''
-        echo "# Number of variants for the masking genome step" > {output}
-        cat {input.VarForMasking} >> {output}
-        echo -e "\n# Number of variants for the ASE count steps"  >> {output}
-        cat {input.VariantFiltration}  >> {output}
-        cat {input.callRateFilter} >> {output}
+        head -n 1 {input.other_tsv} > {output.tsv}
+        cat {input.chr_tsv} {input.other_tsv}| grep -v "#" >> {output.tsv}
        '''
+
+rule merge_filtered_summary:
+    input :
+        chr_tsv = expand("Results_ASE/vcfFiltration/by_chrom/{chrom}_" + PREFIX + "_VariantFiltration_summary.txt", chrom=chrom_list),
+        other_tsv = "Results_ASE/vcfFiltration/by_chrom/other_" + PREFIX + "_VariantFiltration_summary.txt"
+    output :
+        summary = "Results_ASE/Summary/VariantFiltration_summary.txt"
+    run : 
+        dic_results = dict()
+        in_list = input.chr_tsv
+        in_list.append(input.other_tsv)
+        # in_list = ['Results_ASE/vcfFiltration/by_chrom/1_test_VariantFiltration_summary.txt', 'Results_ASE/vcfFiltration/by_chrom/25_test_VariantFiltration_summary.txt', 'Results_ASE/vcfFiltration/by_chrom/26_test_VariantFiltration_summary.txt', 'Results_ASE/vcfFiltration/by_chrom/other_test_VariantFiltration_summary.txt']
+        first=in_list.pop(0)
+        FH_first = open(first,'rt')
+        summary = ''
+        first_key = ''
+        percent_list = list()
+        for line in FH_first:
+            if ':' in line:
+                category = line.split(':')[0]
+                key = category.upper().strip().replace(' ','_')
+                
+                # store first key concidered as the total of variant
+                if first_key == '':
+                    first_key = key
+                
+                # store key where we want to compute percentage
+                if '%)' in line:
+                    percent_list.append(key)
+                    
+                value = int(line.split(':')[1].strip().split()[0])
+                
+                summary += category + ': ###' + key + '###\n'
+                dic_results[key] = value
+            
+            elif "# Input vcf file is" in line:
+                line = "# Input vcf file is " + "Results_ASE/vcfFiltration/" + PREFIX + "_FS_QD_clust_tagged.vcf\n"
+                summary += line
+            elif "those SNP are described in" in line:
+                line = "\t\t\tthose SNP are described in " + "Results_ASE/vcfFiltration/" + PREFIX + "_FsQdBiall_withGTpopCR50_DPgt05rPopCR20_GT.tsv, Results_ASE/vcfFiltration/" + PREFIX + "_FsQdBiall_withGTpopCR50_DPgt05rPopCR20_DP.tsv, Results_ASE/vcfFiltration/" + PREFIX + "_FsQdBiall_withGTpopCR50_DPgt05rPopCR20_AD.tsv\n"
+                summary += line
+            else:
+                summary += line
+        FH_first.close()
+
+        for f in in_list:
+            FH_in = open(f,'rt')
+            for line in FH_in:
+                if ':' in line:
+                    category = line.strip().split(':')[0]
+                    key = category.upper().strip().replace(' ','_')
+                    value = int(line.strip().split(':')[1].strip().split()[0])
+                    dic_results[key] += value
+            FH_in.close()
+
+
+        for k in dic_results:
+            if k in percent_list:
+                percent = round(dic_results[k]*100/dic_results[first_key],2)
+                summary = summary.replace('###' + k + '###' , str(dic_results[k]) + ' (' + str(percent) + '%)')
+            else:
+                summary = summary.replace('###' + k + '###' , str(dic_results[k]))
+                
+        FH_out = open(output.summary,'wt')
+        FH_out.write(summary)
+        FH_out.close()
--- a/Snakemake/1000RNASeq_chicken/ASE/script/call_rate_filter.py
+++ b/Snakemake/1000RNASeq_chicken/ASE/script/call_rate_filter.py
-#!/usr/bin/env python3
-# -*- coding: utf-8 -*-
-import gzip
-
-def is_snp(ref, alt):
-
-	for a in alt.split(','):
-		if len(a) != len(ref):
-			return False
-
-	return True
-
-def is_gzip( file ):
-	"""
-	@return: [bool] True if the file is gziped.
-	@param file : [str] Path to processed file.
-	"""
-	is_gzip = None
-	FH_input = gzip.open( file )
-	try:
-		FH_input.readline()
-		is_gzip = True
-	except:
-		is_gzip = False
-	finally:
-		FH_input.close()
-	return is_gzip
-
-def store_group(sample_config):
-	"""
-	@summary : for each sample store its group
-	@param sample_config : [str] Path to TAB detlimited sample's config file. header : idx  name  forward_read  reverse_read  sequencer read_length phred_scale   group
-	@return : return dictionnary with key = sample name, value = group name
-	"""
-	FH = open(sample_config, 'rt')
-	group_dict = dict()
-	for line in FH:
-		if line.startswith("id"):
-			continue
-		
-		sample_name = line.split("\t")[1]
-		group = line.strip().split("\t")[-1]
-		if not sample_name in group:
-			group_dict[sample_name] = group
-		else:
-			if group != group_dict[sample_name]:
-				raise Exception("Your sample, "+sample_name+ " is associated with different group")
-	FH.close()
-
-	return group_dict
-
-
-def process(vcf_file, sample_config, call_rate, output_vcf, log):
-
-	# store sample's group and group counts
-	samples_group = store_group(sample_config)
-	group_count = dict()
-	for s in samples_group:
-		if not samples_group[s] in group_count:
-			group_count[samples_group[s]] = 0
-		group_count[samples_group[s]] += 1
-    
-	# parse VCF
-	FH = None
-	if not is_gzip(vcf_file):
-		FH = open( vcf_file, 'rt' )
-	else:
-		FH = gzip.open( vcf_file, 'rt' )
-
-	FH_out = open(output_vcf,"w")
-	
-	sample_names = list()
-
-	tot_snp = 0
-	valid_snp = 0
-	for line in FH:
-		if line.startswith("##") and not line.startswith("#CHROM"):
-			FH_out.write(line)
-			continue
-
-		# parse header and add call rate filter info
-		if line.startswith("#CHROM"):
-			FH_out.write('##INFO=<ID=Valid_CR_Group,Number=.,Type=String,Description="List of group that pass call-rate threshold of '+ str(call_rate)+'\">\n')
-			FH_out.write(line)
-			sample_names=line.strip().split("\t")[9:]
-			continue
-
-		record = line.strip().split()
-		tot_snp += 1
-		info = record[7]
-		group_nbCall = dict()
-		for idx, call in enumerate(record[9:]):
-			sample_name = sample_names[idx]
-			group = samples_group[sample_name]
-			gt = call.split(":")[0]
-			if gt != "./." :
-				if group not in group_nbCall:
-					group_nbCall[group] = 0 
-				group_nbCall[group] += 1
-		if len(group_nbCall) == 0:
-			continue
-		ValidGroup = list()
-		for group in group_count:
-			if group in group_nbCall and group_nbCall[group] >= float(call_rate) * group_count[group]:
-				ValidGroup.append(group)
-
-		if len(ValidGroup) > 0:
-			valid_snp += 1
-			info += ";Valid_CR_Group=" + "-".join(ValidGroup)
-			record[7]=info
-			FH_out.write("\t".join(record)+"\n")
-
-	FH.close()
-	FH_out.close()
-	FH_log = open(log,"w")
-	FH_log.write("# Number of variant passing the call rate filter (threshold per group = " + call_rate + ") :\n")
-	FH_log.write(str(valid_snp) + " variants are kept among " + str(tot_snp) + "\n")
-	FH_log.close()
-
-
-vcf_file = snakemake.input.vcf
-sample_config = snakemake.input.sample_config
-call_rate = snakemake.params.call_rate_th
-output_vcf = snakemake.output.vcf
-log = snakemake.output.stat
-
-#~ vcf_file = "../Results/vcfFiltration/RpRm_SNP_qualFiltered.vcf.gz"
-#~ sample_config = "../RpRm.tsv"
-#~ call_rate = "0.95"
-#~ output_vcf = "RpRm_SNP_callRateFiltered.vcf"
-#~ log = "log"
-
-# keep site where genotype call rate is >= to a threshol per pop.
-# warning all samples in the VCF must be defined in the sample config tsv file
-process(vcf_file, sample_config, call_rate, output_vcf, log)
--- a/Snakemake/1000RNASeq_chicken/ASE/script/convertPhred64_to_Phred33.py
+++ b/Snakemake/1000RNASeq_chicken/ASE/script/convertPhred64_to_Phred33.py
+#!/usr/bin/env python3
+from Bio import SeqIO
+from Bio.SeqIO.QualityIO import FastqGeneralIterator
+import gzip
+import os,sys
+
+def is_gzip(file_in):
+    """
+    @return: [bool] True if the file is gziped.
+    @param file : [str] Path to processed file.
+    """
+    is_gzip = None
+    handle_input = gzip.open( file_in )
+    try:
+        handle_input.readline()
+        is_gzip = True
+    except:
+        is_gzip = False
+    finally:
+        handle_input.close()
+    return is_gzip
+
+def checkQualityEncode ( file_in ):
+    """
+    @summary : check the quality Phred scale. Inspire on https://github.com/brentp/bio-playground/blob/master/reads-utils/guess-encoding.py
+    @param input : [str] Path of fastq file
+    @return : either "Q33" or "Q64"
+    """
+
+    RANGES = {
+        'Q33': (33, 74),
+        'Q64': (64, 105)
+    }
+    handle_in=None
+    if is_gzip(file_in):
+        handle_in = gzip.open(file_in,"rt")
+    else:
+        handle_in = open(file_in,"rt")
+        
+    scale = ""
+    gmin, gmax  = 99, 0
+
+    for title_line, seq_string, quality_string in FastqGeneralIterator(handle_in):
+        qual = [ ord(c) for c in quality_string ]
+        qmin = min(qual)
+        qmax = max(qual)
+        if qmin < gmin or qmax > gmax:
+            gmin, gmax = min(qmin, gmin), max(qmax, gmax)
+            valid_encodings = []
+            for encoding, (emin, emax) in RANGES.items():
+                if gmin >= emin and gmax <= emax:
+                    valid_encodings.append(encoding)
+            if len(valid_encodings) == 1 :
+                scale = valid_encodings[0]
+                break
+    return scale
+
+
+#~ f_in="/work/project/sigenae/Maria/Sandrine_SNP-ASE/1000_RNASeq/FLLL_livr_F0/tmp.fq.gz"
+#~ f_out="tmp.gz"
+f_in=snakemake.input.fastq_in
+f_out=snakemake.output.fastq_out
+encoding = checkQualityEncode(f_in)
+
+if encoding == "Q33":
+    print(f_in+" is "+encoding+ " and DO NOT need to be converted\n", file=sys.stderr)
+    os.symlink(f_in, f_out)
+else:
+    try: 
+        print(f_in+" is "+encoding+ " and need to be converted\n", file=sys.stderr)
+        handle_in = gzip.open(f_in,"rt") if is_gzip(f_in) else open(f_in, "rt")
+        handle_out = gzip.open(f_out, "wt")
+        SeqIO.convert(handle_in, 'fastq-illumina' , handle_out, 'fastq-sanger')
+    except:
+        raise Exception("there is no valid encoding found")
+
No results found