Compare revisions

Kenza Bazi-Kabbaj · Kenza Bazi-Kabbaj · Kenza Bazi-Kabbaj · Kenza Bazi-Kabbaj · Kenza Bazi-Kabbaj · Maria Bernard
--- a/Nextflow/template_dev/README.md
+++ b/Nextflow/template_dev/README.md
-# Template workflow readme
-
-### Introduction
-What is the purpose of the pipeline ?
-eg: It pre-processes raw data from FastQ inputs (FastQC, Trim Galore!), aligns the reads (STAR or HiSAT2), generates gene counts (featureCounts, StringTie) and performs extensive quality-control on the results (RSeQC, dupRadar, Preseq, edgeR, MultiQC). See the output documentation for more details of the results.
-
-```mermaid
-graph TD
-A(Raw reads <br><i>Fastq</i>) -- raw data report --> B((fastQC))
-B --> C(report <br><i>HTML</i>)
-A -- triming --> D((cutadapt))
-D --> E(trimed reads <br><i>Fastq</i>)
-E -- genome alignment --> F((Star))
-F --> G(aligned reads <br><i>BAM</i>)
-```
-
-The pipeline is built using Nextflow, a bioinformatics workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It comes with docker / singularity containers making installation trivial and results highly reproducible.
-
-### Which tools ?
-List of software
-
- - fastQC version XXX (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/)
- - cutadapt version YYY (https://github.com/marcelm/cutadapt)
- - star version ZZZ (https://github.com/alexdobin/STAR)
-
-### Where to use it ?
-
-This pipeline have been installed on [Genotoul Cluster] (http://bioinfo.genotoul.fr/) so you can loggon genologin.toulouse.inra.fr and access in /usr/local/bioinfo/workflows/Nextflow/template_dev/main.nf
-
-### How to use it ?
-Ligne de commande de lancement pour genotoul ...
-
-
-### Who is maintaining the workflow ?
-Bios4Biol CATI person implied ...
-
-Be aware, that persons who are maintaining this workflow can be contact in case of bug but not in case of bioinformatics explaination need of your results.
-
-### Credits
-These scripts were copied from ....
-Many thanks to other who have helped out along the way too, including (but not limited to):
-[@loginTwitter](https://github.com/loginTwitter),...
-
---
--- a/Nextflow/template_dev/main.nf
+++ b/Nextflow/template_dev/main.nf
-params.outdir = "$baseDir/results/"
-params.readfiles = "$baseDir/../../data/reads/*{1,2}.fastq.gz"
-params.genome = "$baseDir/../../data/reference/reference.fa"
-params.annotation = "$baseDir/../../data/reference/reference.gtf"
-params.index = "$baseDir/../../index/"
-
-
-outdir = params.outdir
-readfiles = params.readfiles
-genome = file(params.genome)
-annotation = file(params.annotation)
-index = params.index
-
-Channel.fromPath(readfiles).set { reads }
-
-process fastqc {
-	cpus 1
-	publishDir outdir, mode: 'copy', overwrite: true
-	input:
-	file read from reads
-
-	output:
-	file '*html' into fastqc
-
-	"""
-	fastqc --noextract ${read}
-	"""
-}
-
-Channel.fromFilePairs(readfiles).set { pair_reads }
-
-process cutadapt {
-	cpus 1
-	input:
-	set pair_id, file(read) from pair_reads
-
-	output:
-	set pair_id, 'clean_*' into clean_files1, clean_files2
-
-	"""
-	cutadapt -a ACACTCTTTCCCTACACGACGCTCTTCCGATC -A GATCGGAAGAGCGGTTCAGCAGGAATGCCGAG -o clean_${pair_id}1.fastq.gz -p clean_${pair_id}2.fastq.gz ${read}
-	"""
-}
-
-process indexGenome {
-	storeDir index
-
-	input:
-	file ref from genome
-	file annot from annotation
-
-	output:
-	file "${dbname}.star/*" into startDb
-
-	script:
-	dbname = ref.baseName
-	"""
-	mkdir ${dbname}.star
-	STAR --runMode genomeGenerate --genomeDir ${dbname}.star --genomeFastaFiles ${ref} --sjdbGTFfile ${annot} --sjdbOverhang 99 --runThreadN ${task.cpus}
-	"""
-}
-
-
-process mapping {
-	publishDir outdir, mode: 'copy', overwrite: true
-	input:
-	set pair_id, file(read) from clean_files1
-	val starDB from startDb
-
-	output:
-	file "*bam" into aligned_sorted_files
-
-	script:
-	starDBpath = starDB[0].parent
-	"""
-	STAR --genomeDir $starDBpath --runThreadN ${task.cpus} --readFilesIn ${read} --outFileNamePrefix ${pair_id} --readFilesCommand zcat --outSAMtype BAM Unsorted --outFilterType BySJout
-	"""
-}
-
-process mapping2 {
-	publishDir outdir, mode: 'copy', overwrite: true
-	input:
-	set pair_id, file(read) from clean_files2
-	val starDB from startDb
-
-	output:
-	file "*sam" into aligned_files
-
-	script:
-	starDBpath = starDB[0].parent
-	"""
-	STAR --genomeDir $starDBpath --runThreadN ${task.cpus}  --readFilesIn ${read} --outFileNamePrefix ${pair_id} --readFilesCommand zcat --outFilterType BySJout
-	"""
-}
\ No newline at end of file
--- a/Nextflow/template_dev/nextflow.config
+++ b/Nextflow/template_dev/nextflow.config
--- a/Nextflow/template_dev/test.sh
+++ b/Nextflow/template_dev/test.sh
--- a/Snakemake/1000GB/scripts/fastq_detect.pl
+++ b/Snakemake/1000GB/scripts/fastq_detect.pl
-#!/usr/bin/perl -w
- 
-# http://wiki.bits.vib.be/index.php/Identify_the_Phred_scale_of_quality_scores_used_in_fastQ
- 
-use strict;
-use File::Basename;
-use List::MoreUtils qw( minmax );
- 
-# fastq_detect.pl fastq.file sample-size
-# detect fastQ format from quality scores in fastQ input file
-# Version 3
-#
-# Stephane Plaisance - VIB-BITS - July-04-2012 
-# Joachim Jacob - Aug-02-2012 - joachim.jacob@gmail.com
-# - changed the maximum value of Sanger to 73
-# - changed reading the file with a file handle 
-#   (was a file handle !! supporting several archive formats. SP)
-# - changed the diagnosing algoritm
-# Stephane Plaisance - VIB-BITS - April-08-2013 
-# - merged both versions and corrected flaw in min/max
-# thanks to Sergey Mitrfanov for perl reformatting
- 
-#####################################################################
-# diagnose
-#   SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS.....................................................
-#   ..........................XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX......................
-#   ...............................IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII......................
-#   .................................JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ......................
-#   LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLL....................................................
-#   !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
-#   |                         |    |        |                              |                     |
-#  33                        59   64       73                            104                   126
-# S 0........................26...31.......40                               
-# X                          -5....0........9.............................40
-# I                                0........9.............................40
-# J                                   3.....9.............................40
-# L 0.2......................26...31........41                              
-# 
-#  S - Sanger        Phred+33,  raw reads typically (0, 40)
-#  X - Solexa        Solexa+64, raw reads typically (-5, 40)
-#  I - Illumina 1.3+ Phred+64,  raw reads typically (0, 40)
-#  J - Illumina 1.5+ Phred+64,  raw reads typically (3, 40) with 0=unused, 1=unused, 2=Read Segment Quality Control Indicator (bold)
-#  L - Illumina 1.8+ Phred+33,  raw reads typically (0, 41)
-#    - Illumina 1.9+ Phred+33,  raw reads typically (3, 41)
-#####################################################################
- 
-my $script = basename($0);
- 
-@ARGV gt 0 or die "usage: $script <fastq file> <opt:sample-size (100)>\n";
-my ($inputfile, $limit) = @ARGV;
-if (! defined $limit) { $limit = 100}; # check first 100 records
- 
-my $cnt=0;
-my ($min, $max); # global min and max values
- 
-my $z = ReadFile ($inputfile) || die "Error: cannot read from variant file $inputfile: $!\n";
- 
-## parse
-while (my $id = <$z>) {
-	$id =~ m/^@/ || die "expected @ not found in line 1!\n";
-	my $seq = <$z>;
-	my $sep = <$z>;
-	$sep =~ m/^\+/ || die "expected + not found in line 3!\n";
-	my $qual = <$z>;
-	chomp($qual);
-	$cnt++;
-	$cnt>=$limit && last;
- 
-	# char to ascii
-	my @chars = split("", $qual);
-	my @nums = sort { $a <=> $b } (map { unpack("C*", $_ )} @chars);
- 
-	if ($cnt==1) {
-		($min, $max) = minmax @nums;
-	} else {
-		my ($lmin, $lmax) = minmax @nums; # local values for this read
-		$lmin<$min ? $min=$lmin : $min=$min;
-		$lmax>$max ? $max=$lmax : $max=$max;
-	}
-}
- 
-undef $z;
- 
-## diagnose
-my %diag=(
-			'Sanger'		=> '.',
-			'Solexa'		=> '.',
-			'Illumina 1.3+'	=> '.',
-			'Illumina 1.5+'	=> '.',
-			'Illumina 1.8+'	=> '.',
-			'Illumina 1.9+'	=> '.'
-			);
- 
-my %comment=(
-			'Sanger'		=> 'Phred+33',
-			'Solexa'		=> 'Solexa+64',
-			'Illumina 1.3+'	=> 'Phred+64',
-			'Illumina 1.5+'	=> 'Phred+64',
-			'Illumina 1.8+'	=> 'Phred+33',
-			'Illumina 1.9+'	=> 'Phred+33'
-			);
-
-if ($min<33 || $max>105) { die "Quality values corrupt. found [$min; $max] where [33; 104] was expected\n"; }
-if ($min>=33 && $max<=73)  {$diag{'Sanger'}='x';}
-if ($min>=59 && $max<=104) {$diag{'Solexa'}='x';}
-if ($min>=64 && $max<=104) {$diag{'Illumina 1.3+'}='x';}
-if ($min>=66 && $max<=105) {$diag{'Illumina 1.5+'}='x';}
-if ($min>=33 && $max<=74)  {$diag{'Illumina 1.8+'}='x';}
-if ($min>=33 && $max<=75)  {$diag{'Illumina 1.9+'}='x';}
- 
-## report
-foreach my $format (sort keys %diag) {
-	if ($diag{$format} eq 'x') {
-		print STDOUT "# sampled raw quality values are in the range of [".$min."; ".$max."]\n";
-		print STDOUT sprintf("%-30s\t$inputfile\n", $comment{$format});}
-}
- 
- 
-##############
-#### Subs ####
- 
-# reads from uncompressed, gzipped and bgzip fastQ files
-sub ReadFile {
-	my $infile = shift;
-	my $FH;
-	if ($infile =~ /.bz2$/) {
-		open ($FH, "bzcat $infile |") or die ("$!: can't open file $infile");
-	} elsif ($infile =~ /.gz$/) {
-		open ($FH, "zcat $infile |") or die ("$!: can't open file $infile");
-	} elsif ($infile =~ /.fq|.fastq|.txt$/) {
-		open ($FH, "cat $infile |") or die ("$!: can't open file $infile");
-	} else {
-		die ("$!: do not recognise file type $infile");
-	}
-	return $FH;
-}
-
--- a/Snakemake/1000RNASeq_chicken/ASE/README.md
+++ b/Snakemake/1000RNASeq_chicken/ASE/README.md
 # Allel Specific Expression Analysis workflow

-authors__: Maria Bernard (INRAE - SIGENAE) / Kevin Muret - Frederic Jehl - Sandrine Lagarrigue (INRAE - PEGASE)
+authors__: Maria Bernard (INRAE - SIGENAE) / Kevin Muret - Frederic Jehl - Sandrine Lagarrigue ( AGROCAMPUS OUEST / INRAE - PEGASE)

 **<u>!!! STATUT : DEVELOPMENT !!!</u>**

@@ -18,9 +18,9 @@ This pipeline will perform variant filtering, reads alignment on masked genome a

 The user must use the [config_calling.yaml](config_calling.yaml.example) file to provide all necessary inputs for the pipeline:

- A masked reference fasta file (fasta_ref option), indexed with samtools faidx and GATK4 CreateSequenceDictionary
+- A masked reference fasta file (fasta_ref option), indexed with samtools faidx and GATK4 CreateSequenceDictionary. We recommend to mask genome with variant filtered bi-allelic and on GATK metrics FS < 30, QD > 2 and AF < 1. In our used case we mask genome on variant construct at the population level, and then we launch this pipeline at the tissue level.

- A genome reference GTF file (gtf_ref option)
+- An genome annotation reference GTF file (gtf_ref option)

 - A set of variant to include in the ASE analysis

@@ -47,6 +47,8 @@ The user must use the [config_calling.yaml](config_calling.yaml.example) file to

 First step is to filter and trim raw reads before alignment. This step will convert phred 64 scale fastq files into phred 33 scale fastq files and then trim reads with [trim_galore](https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/). It will trim adapter and filtered out trimmed reads that are smaller than the third of the read length.

+By default the trimming quality threshold is set to 15.
+
 You will find in Result/Summary directory, a Log_TrimGalore_summary.tsv that detail the trim_galor statistics.


@@ -77,7 +79,7 @@ Finally reads that contain Ns in their cigar string will be splittd, and alignme

 Variants will be filtered before submitting to ASE.

-Only SNP with FS < or equal to 30.0 and QD < or equal to 2.0 will be used in ASE.
+Only bi-allelic SNP with FS < or equal to 30.0 and QD < or equal to 2.0 will be used in ASE.

 Filtering on SNP type is an option, see SNP_filter in the configuration file. By default it is set to False as the calling pipeline generate output VCF file filtered on SNP type.

@@ -92,19 +94,28 @@ The pipeline will also add annotation in the INFO field in the final VCF file (s

 Additionally, the pipeline will generate 3 tables that extract samples GT, DP and AD_REF and AD_ALT genotypes annotations. These tables will include only SNP that have a genotype call rate greater than 50% and a genotype with at least 5 reads coverage call rate greater than 20%.

+### Compute Allele Specific Expression
+
+Alleles Specific Expression will be obtained using phASER (https://github.com/secastel/phaser) and ASE Read Counter (https://gatk.broadinstitute.org/hc/en-us/articles/360036714351-ASEReadCounter), on bi-allelic variants filtered on FS < 30, and QD > 2.
+
+ASEReadCounter and phASER returnby sample, allele expression on each SNP.
+
+phASER will also return gene allele specific espression by constructing gene haplotypes by sample, and finally an expression matrix which agglomerate all the samples. 
+
 ## 3. Dependencies

 The workflow depends on:

-* cutadapt
-* trimgalore (version 0.4.5)
-* STAR (version 2.5.2b)
-* picard.jar (version 2.1.1)
-* samtools (version 1.3.1 )
-* rsem-prepare-reference ( RSEM version 1.3.0)
-* rsem-calculate-expression ( RSEM version 1.3.0)
-* GenomeAnalysisTK.jar (version 3.7)
+* cutadapt (version 2.1)
+* trimgalore (version 0.6.5)
+* STAR (version 2.6.0c)
+* samtools (version 1.9 )
+* bcftools (version 1.9)
+* bedtools( version 2.29.0)
+* tabix (version 0.2.5)
+* GenomeAnalysisTK.jar (version 3.7) which include Picard tools and ASE Read Counter
 * java (version 8)
+* phASER (cloned the 29th of May 2020)

 Theses softwares need to be available in you PATH or in a bin directory precised in the config.yaml file with the bin_dir option.


--- a/Snakemake/1000RNASeq_chicken/ASE/Snakefile
+++ b/Snakemake/1000RNASeq_chicken/ASE/Snakefile
@@ -159,9 +159,8 @@ for sample in table["sample_name"].unique():
 	final_outputs.append('Results_ASE/phASER/' + sample + '_phASER.allelic_counts.txt')
 	final_outputs.append('Results_ASE/phASER_gene_ae/' + sample + '_phASER.gene_ae.txt')

-# TO FINISH AND TEST phaser_expr_matrix
-# final_outputs.append('Results_ASE/phASER_pop/' + os.path.basename(config['vcf']).replace('.vcf.gz','_phASER.bed'))
-# final_outputs.append('Results_ASE/phASER_pop/' + os.path.basename(config['vcf']).replace('.vcf.gz','_phASER.gw_phased.bed'))
+final_outputs.append('Results_ASE/phASER_pop/' + os.path.basename(config['vcf']).replace('.vcf.gz','_phASER.bed.gz'))
+final_outputs.append('Results_ASE/phASER_pop/' + os.path.basename(config['vcf']).replace('.vcf.gz','_phASER.gw_phased.bed.gz'))

 # print(final_outputs)
 # ['Results_ASE/STAR_Aln_2/samplePE_rg_genomic_pp_rmdup_uniq_split.bam', 

--- a/Snakemake/1000RNASeq_chicken/ASE/img/full_dryrun.png
+++ b/Snakemake/1000RNASeq_chicken/ASE/img/full_dryrun.png
--- a/Snakemake/1000RNASeq_chicken/ASE/resources_SLURM.yaml
+++ b/Snakemake/1000RNASeq_chicken/ASE/resources_SLURM.yaml
@@ -49,7 +49,8 @@ SplitNCigarReads:
    
 phASER:
    cpu : 8
-    mem : "130G"
+    mem : "180G"
+    time : "8:00:00"

 phASER_expr_matrix:
    cpu : 8

--- a/Snakemake/1000RNASeq_chicken/ASE/rules/ASE_counter.smk
+++ b/Snakemake/1000RNASeq_chicken/ASE/rules/ASE_counter.smk
@@ -71,7 +71,7 @@ rule phASER:
   		idSeparator = '_' if config['id_separator'] == '' else config['id_separator']
 	shell:
 		"""
-		python2.7 {config[bin_dir]}/phaser.py --vcf {input.vcf} --bam {input.bam} --sample {wildcards.sample} --threads {threads} --o Results_ASE/phASER/{wildcards.sample}_phASER --paired_end {params.paired_option} --baseq {params.baseQuality} --mapq {params.mappingQuality} --id_separator {params.idSeparator}
+		python2.7 {config[bin_dir]}/phaser.py --vcf {input.vcf} --bam {input.bam} --sample {wildcards.sample} --threads {threads} --o Results_ASE/phASER/{wildcards.sample}_phASER --paired_end {params.paired_option} --unphased_vars 1 --gw_phase_vcf 1 --baseq {params.baseQuality} --mapq {params.mappingQuality} --id_separator {params.idSeparator}
 		"""


@@ -97,20 +97,19 @@ rule phASER_gene_ae:
 		python2.7 {config[bin_dir]}/phaser_gene_ae.py --haplotypic_counts {input.hap_count} --features {input.bed} --o {output} --id_separator {params.idSeparator}
 		"""

-# TO FINISH AND TEST

-# rule phASER_expr_matrix:
-# 	input:
-# 		expand("Results_ASE/phASER_gene_ae/{sample}_phASER.gene_ae.txt", sample=table["sample_name"].unique()),
-# 		bed = 'Results_ASE/ref_bed/' + os.path.basename(config['gtf_ref']).replace('.gtf', '.bed')
-# 	output:
-# 		'Results_ASE/phASER_pop/' + os.path.basename(config['vcf']).replace('.vcf.gz','_phASER.bed'),
-# 		'Results_ASE/phASER_pop/' + os.path.basename(config['vcf']).replace('.vcf.gz','_phASER.gw_phased.bed')
-# 	params:
-# 		gene_ae_dir = "Results_ASE/phASER_gene_ae/",
-# 		out_dir = 'Results_ASE/phASER_pop/'
-# 	threads: config["phASER_expr_matrix"]["cpu"]
-# 	shell:
-# 		"""
-# 		python2.7 {config[bin_dir]}/phaser_expr_matrix.py --t {threads} --gene_ae_dir {params.gene_ae_dir} --feature {input.bed} --o {param
-# 		"""
+rule phASER_expr_matrix:
+	input:
+		expand("Results_ASE/phASER_gene_ae/{sample}_phASER.gene_ae.txt", sample=table["sample_name"].unique()),
+		bed = 'Results_ASE/ref_bed/' + os.path.basename(config['gtf_ref']).replace('.gtf', '.bed')
+	output:
+		bed = 'Results_ASE/phASER_pop/' + os.path.basename(config['vcf']).replace('.vcf.gz','_phASER.bed.gz'),
+		phased_bed = 'Results_ASE/phASER_pop/' + os.path.basename(config['vcf']).replace('.vcf.gz','_phASER.gw_phased.bed.gz')
+	params:
+		gene_ae_dir = "Results_ASE/phASER_gene_ae/",
+		out_prefix = 'Results_ASE/phASER_pop/' + os.path.basename(config['vcf']).replace('.vcf.gz','_phASER')
+	threads: config["phASER_expr_matrix"]["cpu"]
+	shell:
+		"""
+		python2.7 {config[bin_dir]}/phaser_expr_matrix.py --t {threads} --gene_ae_dir {params.gene_ae_dir} --feature {input.bed} --o {params.out_prefix}
+		"""
--- a/Snakemake/1000RNASeq_chicken/ASE/script/variantFiltration.py
+++ b/Snakemake/1000RNASeq_chicken/ASE/script/variantFiltration.py
@@ -11,7 +11,9 @@ import multiprocessing
 # FUNCTION
 def select_on_chrom(in_vcf, accepted_chrom, excluded_chrom, selected_vcf):
    
-    FH_in = open(in_vcf, 'rt')
+    #~ FH_in = open(in_vcf, encoding='utf8', 'rt')
+    # pour FLLL je ne comprends pas pourquoi mais utf8 ne fonctionne pas ... 
+    FH_in = open(in_vcf, encoding='iso-8859-1', mode='rt')
    FH_out = open(selected_vcf,'wt')
    
    for line in FH_in:

--- a/Snakemake/1000RNASeq_chicken/calling/README.md
+++ b/Snakemake/1000RNASeq_chicken/calling/README.md
 # RNASeq Analysis workflow : quantification and SNP calling

-__authors__: Maria Bernard (INRA - SIGENAE) / Kevin Muret - Frederic Jehl - Sandrine Lagarrigue (INRA - PEGASE)
+__authors__: Maria Bernard (INRAE - SIGENAE) / Kevin Muret - Frederic Jehl - Sandrine Lagarrigue (AGROCAMPUS OUEST / INRAE - PEGASE)

 - [RNASeq Analysis workflow : quantification and SNP calling](#rnaseq-analysis-workflow---quantification-and-snp-calling)
  * [What it does ?](#what-it-does--)

--- a/Snakemake/1000RNASeq_chicken/calling/SLURM.sh
+++ b/Snakemake/1000RNASeq_chicken/calling/SLURM.sh
@@ -17,7 +17,7 @@ snakemake -s $WORKFLOW_DIR/Snakefile_calling --printshellcmds --jobs 200 \
 	# {cluster.cpu}, {cluster.mem} and {cluster.partition} will be replace by value defined in the resources_SLURM.yaml file
 snakemake -s $WORKFLOW_DIR/Snakefile_calling --printshellcmds --jobs 200 \
           --configfile config_calling.yaml.example \
-           --cluster-config $WORKFLOW_DIR/resources_SLURM.yaml \
+           --cluster-config $WORKFLOW_DIR/resources_calling_SLURM.yaml \
           --cluster "sbatch -p {cluster.partition} --cpus-per-task={cluster.cpu} --mem-per-cpu={cluster.mem} --error=logs/%x.stderr --output=logs/%x.stdout " --latency-wait 30

 # If you want to launch the workflow partially, indicate the final output files you want (see dryrun log)
@@ -26,7 +26,7 @@ snakemake -s $WORKFLOW_DIR/Snakefile_calling --printshellcmds --jobs 200 \
 PREFIX=??		# your sample_config TSV file whithout extension
 snakemake -s $WORKFLOW_DIR/Snakefile_calling --printshellcmds --jobs 200 \
           --configfile config_calling.yaml.example \
-           --cluster-config $WORKFLOW_DIR/resources_SLURM.yaml \
+           --cluster-config $WORKFLOW_DIR/resources_calling_SLURM.yaml \
           --cluster "sbatch -p {cluster.partition} --cpus-per-task={cluster.cpu} --mem-per-cpu={cluster.mem} --error=logs/%x.stderr --output=logs/%x.stdout " --latency-wait 30 \
           Results/Summary/Log_TrimGalore_summary.tsv Results/Summary/Log_STAR_Aln_1_summary.tsv Results/Summary/Log_STAR_Aln_2_summary.tsv \
            Results/Summary/RSEM_multimap_summary_${PREFIX}_genes_expected_count.tsv \
@@ -46,7 +46,7 @@ snakemake -s $WORKFLOW_DIR/Snakefile_calling --printshellcmds --jobs 200 \
 PREFIX=??		# your sample_config TSV file whithout extension
 snakemake -s $WORKFLOW_DIR/Snakefile_calling --printshellcmds --jobs 200 \
           --configfile config_calling.yaml.example \
-           --cluster-config $WORKFLOW_DIR/resources_SLURM.yaml \
+           --cluster-config $WORKFLOW_DIR/resources_calling_SLURM.yaml \
           --cluster "sbatch -p {cluster.partition} --cpus-per-task={cluster.cpu} --mem-per-cpu={cluster.mem} --error=logs/%x.stderr --output=logs/%x.stdout " --latency-wait 30 \
           Results/Summary/Log_TrimGalore_summary.tsv Results/Summary/Log_STAR_Aln_1_summary.tsv Results/Summary/Log_STAR_Aln_2_summary.tsv \
           Results/GATK/${PREFIX}_SNP.vcf.gz\

--- a/Snakemake/1000RNASeq_chicken/calling/Snakefile_calling
+++ b/Snakemake/1000RNASeq_chicken/calling/Snakefile_calling
@@ -620,7 +620,7 @@ rule individual_gVCF_HaplotypeCaller:
        jar=find_jar("GenomeAnalysisTK.jar")
    shell:
        """
-        java -Xmx{params.mem} -jar {params.jar} -T HaplotypeCaller -I {input.bam} -R {input.fasta} -stand_call_conf 20.0 --min_base_quality_score 10 --min_mapping_quality_score 20 -ERC GVCF -dontUseSoftClippedBases  -o {output.gvcf} 2> {output.log}
+        java -Xmx{params.mem} -jar {params.jar} -T HaplotypeCaller -I {input.bam} -R {input.fasta} --min_base_quality_score 10 --min_mapping_quality_score 20 -ERC GVCF -dontUseSoftClippedBases  -o {output.gvcf} 2> {output.log}
        """


@@ -643,7 +643,7 @@ rule genotypeGVCF:

    run:
        bams = gatk_multi_arg("--variant", input.gvcf)
-        shell( "java -Xmx{params.mem} -jar {params.jar} -T GenotypeGVCFs {bams} -R {input.fasta} -o {output.vcf} 2> {output.log}")
+        shell( "java -Xmx{params.mem} -jar {params.jar} -T GenotypeGVCFs {bams} -R {input.fasta} -stand_call_conf 20.0 -o {output.vcf} 2> {output.log}")
 # /GATK        
 #######

@@ -663,22 +663,5 @@ rule selectVariant:
        java -Xmx{params.mem} -jar {params.jar} -T SelectVariants -R {input.ref} -V {input.vcf} -selectType INDEL -o {output.indel}
        """

-# ajouter l'option --excludeIntervals pour exclure certains intervalles
-rule filter_variant:
-    input:
-        ref = "Results/Fai_Dict_Index/" + os.path.basename(config["fasta_ref"]),
-        snp = "Results/GATK/"+POP+"_SNP.vcf.gz"
-    output:
-        snp = "Results/GATK/"+POP+"_SNP_flag.vcf.gz"
-    params:
-        mem=config["selectVariant"]["mem"],
-        jar=find_jar("GenomeAnalysisTK.jar")
-    shell:
-        """
-        java -Xmx{params.mem} -jar {params.jar} -T VariantFiltration -R {input.ref} -V {input.snp} -window 35 -cluster 3 -filterName FS -filter "FS > 30.0" -filterName QD -filter "QD < 2.0" -o {output.snp} 
-        """
-# ADD rule for merging flagstat
-# ADD rule for merging individual HC stat log
-# ADD rule split SNP/INDEL
 # ADD rule Hard filtering
 # ADD rule filtering on junction
--- a/Snakemake/1000RNASeq_chicken/calling/resources_calling_SLURM.yaml
+++ b/Snakemake/1000RNASeq_chicken/calling/resources_calling_SLURM.yaml
@@ -2,7 +2,7 @@ __default__ :
    mem : "8G"
    cpu : 8
    time : "24:00:00"
-    queue : workq
+    partition : workq

 convertPhred:
    cpu : 1
@@ -21,13 +21,13 @@ star_index_pass1 :
    mem : "18G"

 star_aln_pass1 :
-    mem : "12G"
+    mem : "5G"
    
 star_index_pass2 :
    mem : "18G"

 star_aln_pass2 :
-    mem : "12G"
+    mem : "5G"

 log_merge_STAR:
    cpu : 1

--- a/Snakemake/1000RNASeq_chicken_calling_GATK4/README.md
+++ b/Snakemake/1000RNASeq_chicken_calling_GATK4/README.md
+# RNASeq Analysis workflow : quantification and SNP calling
+
+__authors__: Maria Bernard (INRAE - SIGENAE) / Kevin Muret - Frederic Jehl - Sandrine Lagarrigue (AGROCAMPUS OUEST / INRAE - PEGASE)
+
+- [RNASeq Analysis workflow : quantification and SNP calling](#rnaseq-analysis-workflow---quantification-and-snp-calling)
+  * [What it does ?](#what-it-does--)
+  * [1. Input](#1-input)
+  * [2. Data Processing](#2-data-processing)
+    + [Raw reads trimming](#raw-reads-trimming)
+    + [Sequence alignment](#sequence-alignment)
+    + [Transcriptomic Alignment post processing](#transcriptomic-alignment-post-processing)
+    + [Gene/Isoform expression](#gene-isoform-expression)
+    + [Genomic Alignment post processing](#genomic-alignment-post-processing)
+    + [Variant Calling](#variant-calling)
+  * [3. Dependencies](#3-dependencies)
+  * [4. Configure and running workflow](#4-configure-and-running-workflow)
+
+## What it does ? 
+
+This pipeline will perform classical RNASeq analysis workflow (trimming, STAR mapping, and RSEM expression quantification) but also call variant based on the [GATK RNASeq recommandations](https://software.broadinstitute.org/gatk/documentation/article.php?id=3891).
+
+ The pipeline is represented in the graph below:
+
+![](img/full_dryrun.png)
+
+
+
+## 1. Input
+
+The user must use the [config_calling.yaml](config_calling.yaml.example) file to provide all necessary inputs for the pipeline:
+
+- A reference fasta file (fasta_ref option)
+
+- A genome reference GTF file (gtf_ref option)
+
+- A set of known variants used to recalibrate bases quality in GATK preprocessing steps RealignerTargetCreator and BaseRecalibrator (known_vcf option)
+
+- A sample reads description TSV file (sample_config option), with the following columns:
+
+  - idx : a unique index number to identify each fastq (paire) file
+  - name : the sample name
+  - forward_read and reverse_read are fastq file names. These files need to be in a directory precised with the data_dir option
+    - If single end, leave an empty column in the reverse_read
+    - If paired, file names need to ends with _R1.fastq.gz or _R2.fastq.gz (or fq instead of fastq, and not necessarly compressed)
+  - sequencer : sequencer used to generate the sequences : "ILLUMINA", "SLX", "SOLEXA", "SOLID", "454", "LS454", "COMPLETE", "PACBIO", "IONTORRENT", "CAPILLARY", "HELICOS", "UNKNOWN"
+  - oriented : to indicate if sequences are generated with strand-specific protocol. It corresponds to the forward_prob RSEM parameter. Indicate : 
+    - 1 for a strand-specific protocol where all (upstream) reads are derived from the forward strand,
+    - 0 for a strand-specific protocol where all (upstream) read are derived from the reverse strand,
+    - 0.5 for a non-strand-specific protocol. 
+  - phred scale: to indicate the phred score scale use to code base quality: either 33 (Sanger and illumina 1.8+) or 64 (Solexa, illumina 1.3 to 1.8 excluded)
+
+
+
+
+## 2. Data Processing
+
+### Raw reads trimming
+
+First step is to filter and trim raw reads before alignment. This step will convert phred 64 scale fastq files into phred 33 scale fastq files and then trim reads with [trim_galore](https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/). It will trim adapter and filtered out trimmed reads that are smaller than the third of the read length.
+
+You will find in Result/Summary directory, a Log_TrimGalore_summary.tsv that detail the trim_galor statistics.
+
+### Sequence alignment
+
+Trimmed reads will be align against the reference genome thanks to the [STAR](https://academic.oup.com/bioinformatics/article/29/1/15/272537) following the multi-sample 2-pass mapping procedure, with the GTF  reference file.
+
+STAR 2nd pass alignment will generate bam files in genome and in transcriptome coordinates.
+
+You will find in the Results/Summary directory , 2 files reporting alignment statistics : Log_STAR_Aln_1_summary.tsv and Log_STAR_Aln_2_summary.tsv
+
+
+
+### Transcriptomic Alignment post processing
+
+Reads with multiple alignment will be filtered out to generate a seconde transcriptomic bam:
+
+```samtools view -q 255```
+
+
+
+### Gene/Isoform expression 
+
+Expression will be compute on all reads align on transcriptome or on uniquely mapped reads thanks to [RSEM](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-12-323). For each sample, it will generate gene and isoforms results.
+
+You will find in the Results/Summary directory, expression quantification for all samples:
+
+* for gene and isorms
+* for expected count, TPM and FPKM quantification method
+* based on uniquely mapped reads or all alignment.
+
+The file will be named : 
+
+	RSEM\__[multimap/uniqmap]_\_summary_PREFIX\__[gene/isoforms]_\__[TPM/expected\_count/FPKM]_.tsv
+
+PREFIX is the name of your sample_config file define in the config.yaml file.
+
+
+
+### Genomic Alignment post processing
+
+Genomic alignement will be used to call variants. Each sample will have its bam treated like this:
+
+- cleaned thanks to cleanSam from GATK
+- annotated with AddOrReplaceReadGroups from [Picard tools](https://broadinstitute.github.io/picard/) based on information you provide in the sample config file
+- multimapped reads filtered out (samtools -q 255)
+- merged with other bam files belonging to the same sample name (optionnal)
+- with PCR and optical duplicates marked thanks to Picard tools
+- with splitted alignment convert in separated alignments and mapping quality of 250 reassigned to 60 with SplitNCigarReads from GATK 
+- with its indel realign thanks to RealignerTargetCreator and IndelRealigner, and its base quality score recalibrated thanks to BaseRecalibrator and PrintReads from GATK using the known_vcf file provided in the config.yaml.
+
+
+
+The final bam alignment files will be find in Results/GATK/ directory and will be called : _[sample]_\_uniq_cs_rg_merge_md_split_real_recal.bam
+
+
+
+### Variant Calling
+
+Variants will called thanks to HaplotypeCaller from GATK which will generate one gVCF per sample (Results/GATK/_[sample]_\_.g.vcf.gz files) with the following options:
+
+- -stand_call_conf 20.0 
+- --min_base_quality_score 10 
+- --min_mapping_quality_score 20
+
+GenotypeGVCFs from GATK will take all gVCF files to generate one VCF file called Results/GATK/PREFIX_full.vcf.gz
+
+Finally, this VCF file will be splitted into two VCF files for SNP or INDEL using SelectVariant from GATK:Results/GATK/PREFIX_SNP.vcf.gz and Results/GATK/PREFIX_INDEL.vcf.gz
+
+
+
+## 3. Dependencies
+
+###3.1. Links to binaries
+
+The workflow depends on:  LIST TO CHECK !
+* cutadapt
+* trimgalore (version 0.4.5)
+* STAR (version 2.5.2b)
+* picard.jar (version 2.1.1)
+* samtools (version 1.3.1 )
+* rsem-prepare-reference (RSEM version 1.3.0)
+* rsem-calculate-expression (RSEM version 1.3.0)
+* GenomeAnalysisTK.jar (version 3.7)
+* java (version 8)
+* snakemake (version 4.8.0)
+* maskFastaFromBed (bedtools version 2.26.0)
+* vcf-merge (VCFtools version 0.1.15)
+* bgzip (Tabix version 0.2.5)
+* tabix (Tabix version 0.2.5)
+* python3.6.3 (version 3.6.3) 
+* fastqc (version 0.11.7)
+
+
+Theses softwares need to be available in you PATH or in a bin directory precised in the config.yaml file with the bin_dir option.
+
+###3.2. Using conda mamba
+
+```
+conda activate mamba
+mamba create -n env_name -c bioconda -c conda-forge cutadapt trim-galore star samtools gatk4 rsem
+conda activate /path/to/miniconda/envs/mamba/envs/env_name
+module load system/Python-3.6.3
+export PERL5LIB=""
+```
+
+
+##  4. Configure and running workflow
+
+- Copy the config_calling.yaml.example file into your working directory and update it according to the instructions inside the file.
+
+- launch snakemake command as presented in SLURM.sh to launch the workflow. This script explain how to launch a dryrun, a full run, perform quantification expression only or variant calling only
+
+- If necessary update resources_SLURM.yaml to define cpu memory and cluster partition resources for each rule. The default section is the minimal resources per job. The other section take as name the rules name and define only resources that are different from the default section.
+
+
+Resources_SLURM.yaml and SLURM.sh are specific to SLURM cluster. Therefore, if you use another cluster, you need to create other files. However, you can use these files to inspire you.
+    
+
+This is the official documentation of Snakemake if you need more information: <https://snakemake.readthedocs.io/en/stable/tutorial/tutorial.html>
+
--- a/Snakemake/1000RNASeq_chicken_calling_GATK4/SLURM.sh
+++ b/Snakemake/1000RNASeq_chicken_calling_GATK4/SLURM.sh
+#!/bin/bash
+##############################################################################
+## launch workflow
+
+# if used on Genologin cluster (INRA Toulouse )
+# 	module load system/Python-3.6.3
+
+WORKFLOW_DIR=.
+
+mkdir -p logs
+
+# This is an example of the snakemake command line to launch a dryrun
+snakemake -s $WORKFLOW_DIR/Snakefile_calling --printshellcmds --jobs 200 \
+           --configfile config_calling.yaml.example --dryrun
+
+# This is an example of the snakemake command line to launch the full workflow on a SLURM cluster
+	# {cluster.cpu}, {cluster.mem} and {cluster.partition} will be replace by value defined in the resources_SLURM.yaml file
+snakemake -s $WORKFLOW_DIR/Snakefile_calling --printshellcmds --jobs 200 \
+           --configfile config_calling.yaml.example \
+           --cluster-config $WORKFLOW_DIR/resources_calling_SLURM.yaml \
+           --cluster "sbatch -p {cluster.partition} --cpus-per-task={cluster.cpu} --mem-per-cpu={cluster.mem} --error=logs/%x.stderr --output=logs/%x.stdout " --latency-wait 30
+
+# If you want to launch the workflow partially, indicate the final output files you want (see dryrun log)
+# 
+# For RSEM quantification only
+PREFIX=??		# your sample_config TSV file whithout extension
+snakemake -s $WORKFLOW_DIR/Snakefile_calling --printshellcmds --jobs 200 \
+           --configfile config_calling.yaml.example \
+           --cluster-config $WORKFLOW_DIR/resources_calling_SLURM.yaml \
+           --cluster "sbatch -p {cluster.partition} --cpus-per-task={cluster.cpu} --mem-per-cpu={cluster.mem} --error=logs/%x.stderr --output=logs/%x.stdout " --latency-wait 30 \
+           Results/Summary/Log_TrimGalore_summary.tsv Results/Summary/Log_STAR_Aln_1_summary.tsv Results/Summary/Log_STAR_Aln_2_summary.tsv \
+            Results/Summary/RSEM_multimap_summary_${PREFIX}_genes_expected_count.tsv \
+            Results/Summary/RSEM_uniqmap_summary_${PREFIX}_genes_expected_count.tsv \
+			Results/Summary/RSEM_multimap_summary_${PREFIX}_genes_FPKM.tsv \
+			Results/Summary/RSEM_uniqmap_summary_${PREFIX}_genes_FPKM.tsv \
+			Results/Summary/RSEM_multimap_summary_${PREFIX}_genes_TPM.tsv \
+			Results/Summary/RSEM_uniqmap_summary_${PREFIX}_genes_TPM.tsv \
+			Results/Summary/RSEM_multimap_summary_${PREFIX}_isoforms_expected_count.tsv \
+			Results/Summary/RSEM_uniqmap_summary_${PREFIX}_isoforms_expected_count.tsv \
+			Results/Summary/RSEM_multimap_summary_${PREFIX}_isoforms_FPKM.tsv \
+			Results/Summary/RSEM_uniqmap_summary_${PREFIX}_isoforms_FPKM.tsv \
+			Results/Summary/RSEM_multimap_summary_${PREFIX}_isoforms_TPM.tsv \
+			Results/Summary/RSEM_uniqmap_summary_${PREFIX}_isoforms_TPM.tsv
+
+# For GATK calling only
+PREFIX=??		# your sample_config TSV file whithout extension
+snakemake -s $WORKFLOW_DIR/Snakefile_calling --printshellcmds --jobs 200 \
+           --configfile config_calling.yaml.example \
+           --cluster-config $WORKFLOW_DIR/resources_calling_SLURM.yaml \
+           --cluster "sbatch -p {cluster.partition} --cpus-per-task={cluster.cpu} --mem-per-cpu={cluster.mem} --error=logs/%x.stderr --output=logs/%x.stdout " --latency-wait 30 \
+           Results/Summary/Log_TrimGalore_summary.tsv Results/Summary/Log_STAR_Aln_1_summary.tsv Results/Summary/Log_STAR_Aln_2_summary.tsv \
+           Results/GATK/${PREFIX}_SNP.vcf.gz\
+           Results/GATK/${PREFIX}_INDEL.vcf.gz
--- a/Snakemake/1000RNASeq_chicken_calling_GATK4/Snakefile_calling
+++ b/Snakemake/1000RNASeq_chicken_calling_GATK4/Snakefile_calling
--- a/Snakemake/1000RNASeq_chicken_calling_GATK4/config_calling.yaml
+++ b/Snakemake/1000RNASeq_chicken_calling_GATK4/config_calling.yaml
+# bin_dir is a directory containing software binary used in this pipeline:
+# cutadapt 
+# trimgalore (version 0.4.5
+# STAR (version 2.5.2b)
+# picard.jar (version 2.1.1)
+# samtools (version 1.3.1 )
+# rsem-prepare-reference ( RSEM version 1.3.0)
+# rsem-calculate-expression ( RSEM version 1.3.0)
+# GenomeAnalysisTK.jar (version 3.7)
+# java (version 8)
+# if not bin_dir, the binary need to be available in the PATH
+bin_dir : .
+
+# data_dir contains fastq files described in sample_config file
+#data_dir : /work/project/1000rnaseqchick/Mathieu/reads/
+data_dir : /work/project/1000rnaseqchick/Mathieu/reads/
+
+# sample_config file is a tabular file describing each input fastq file
+# sample_config file name will be used as prefix of the output files.
+# header columns are : idx	name	forward_read	reverse_read	sequencer	read_length	oriented	phred_scale
+# 	forward_read and reverse_read are fastq file names. These files need to be in the data_dir directory
+# 	sample may be paired end or single end. 
+#       - If single end, leave an empty column in the reverse_read
+#       - If paired, file names need to ends with _R1.fastq.gz or _R2.fastq.gz (or fq instead of fastq, and not necessarly compressed)
+# 	sequencer need to be choosen in the list : ["ILLUMINA","SLX","SOLEXA","SOLID","454","LS454","COMPLETE","PACBIO","IONTORRENT","CAPILLARY","HELICOS","UNKNOWN"]
+# 	oriented is the forward_prob RSEM parameter to indicate : 
+#       1 for a strand-specific protocol where all (upstream) reads are derived from the forward strand, 
+#       0 for a strand-specific protocol where all (upstream) read are derived from the reverse strand, 
+#       0.5 for a non-strand-specific protocol. 
+# 	phred scale indicate the phred score scale use to code base quality: either 33 (Sanger and illumina 1.8+) or 64 (Solexa, illumina 1.3 to 1.8 excluded)
+sample_config : data/1000RNAseq.tsv
+
+# fasta_ref file is the genome reference Fasta file
+fasta_ref : data/reference.fa
+# gtf_ref file is the genome reference GTF file
+gtf_ref : data/reference.gtf
+# known_vcf file is set of known variants used to recalibrate bases quality in GATK preprocessing steps RealignerTargetCreator and BaseRecalibrator
+known_vcf : data/reference_known_var.vcf.gz
+
+# quality trimming threshold used in trimgalore to remove low quality bases.
+trimming_quality : 15
+
+# computing ressources, also give to --cluster-config snakemake option if executed on a cluster
+# this yaml file defined default resources (mem and cpu at least) in a __default__ section, and specific resources either all or one resource for particular rule if different from the default
+resources: resources_calling_SLURM.yaml
--- a/Snakemake/1000RNASeq_chicken_calling_GATK4/config_calling.yaml.example
+++ b/Snakemake/1000RNASeq_chicken_calling_GATK4/config_calling.yaml.example
+# bin_dir is a directory containing software binary used in this pipeline:
+# cutadapt 
+# trimgalore (version 0.4.5
+# STAR (version 2.5.2b)
+# picard.jar (version 2.1.1)
+# samtools (version 1.3.1 )
+# rsem-prepare-reference ( RSEM version 1.3.0)
+# rsem-calculate-expression ( RSEM version 1.3.0)
+# GenomeAnalysisTK.jar (version 3.7)
+# java (version 8)
+# if not bin_dir, the binary need to be available in the PATH
+bin_dir : ./
+
+# data_dir contains fastq files described in sample_config file
+data_dir : data
+
+# sample_config file is a tabular file describing each input fastq file
+# sample_config file name will be used as prefix of the output files.
+# header columns are : idx	name	forward_read	reverse_read	sequencer	read_length	oriented	phred_scale
+# 	forward_read and reverse_read are fastq file names. These files need to be in the data_dir directory
+# 	sample may be paired end or single end. 
+#       - If single end, leave an empty column in the reverse_read
+#       - If paired, file names need to ends with _R1.fastq.gz or _R2.fastq.gz (or fq instead of fastq, and not necessarly compressed)
+# 	sequencer need to be choosen in the list : ["ILLUMINA","SLX","SOLEXA","SOLID","454","LS454","COMPLETE","PACBIO","IONTORRENT","CAPILLARY","HELICOS","UNKNOWN"]
+# 	oriented is the forward_prob RSEM parameter to indicate : 
+#       1 for a strand-specific protocol where all (upstream) reads are derived from the forward strand, 
+#       0 for a strand-specific protocol where all (upstream) read are derived from the reverse strand, 
+#       0.5 for a non-strand-specific protocol. 
+# 	phred scale indicate the phred score scale use to code base quality: either 33 (Sanger and illumina 1.8+) or 64 (Solexa, illumina 1.3 to 1.8 excluded)
+sample_config : data/population.tsv.example
+
+# fasta_ref file is the genome reference Fasta file
+fasta_ref : data/reference.fa
+# gtf_ref file is the genome reference GTF file
+gtf_ref : data/reference.gtf
+# known_vcf file is set of known variants used to recalibrate bases quality in GATK preprocessing steps RealignerTargetCreator and BaseRecalibrator
+known_vcf : data/reference_known_var.vcf.gz
+
+# quality trimming threshold used in trimgalore to remove low quality bases.
+trimming_quality : 15
+
+# computing ressources, also give to --cluster-config snakemake option if executed on a cluster
+# this yaml file defined default resources (mem and cpu at least) in a __default__ section, and specific resources either all or one resource for particular rule if different from the default
+resources: resources_calling_SLURM.yaml
No results found