Skip to content
Snippets Groups Projects

Compare revisions

Changes are shown as if the source revision was being merged into the target revision. Learn more about comparing revisions.

Source

Select target project
No results found

Target

Select target project
  • laure.heurtevin/workflows
  • cervin.guyomar/workflows
  • bios4biol/workflows
3 results
Show changes
Commits on Source (45)
Showing
with 1065 additions and 335 deletions
# Template workflow readme
### Introduction
What is the purpose of the pipeline ?
eg: It pre-processes raw data from FastQ inputs (FastQC, Trim Galore!), aligns the reads (STAR or HiSAT2), generates gene counts (featureCounts, StringTie) and performs extensive quality-control on the results (RSeQC, dupRadar, Preseq, edgeR, MultiQC). See the output documentation for more details of the results.
```mermaid
graph TD
A(Raw reads <br><i>Fastq</i>) -- raw data report --> B((fastQC))
B --> C(report <br><i>HTML</i>)
A -- triming --> D((cutadapt))
D --> E(trimed reads <br><i>Fastq</i>)
E -- genome alignment --> F((Star))
F --> G(aligned reads <br><i>BAM</i>)
```
The pipeline is built using Nextflow, a bioinformatics workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It comes with docker / singularity containers making installation trivial and results highly reproducible.
### Which tools ?
List of software
- fastQC version XXX (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/)
- cutadapt version YYY (https://github.com/marcelm/cutadapt)
- star version ZZZ (https://github.com/alexdobin/STAR)
### Where to use it ?
This pipeline have been installed on [Genotoul Cluster] (http://bioinfo.genotoul.fr/) so you can loggon genologin.toulouse.inra.fr and access in /usr/local/bioinfo/workflows/Nextflow/template_dev/main.nf
### How to use it ?
Ligne de commande de lancement pour genotoul ...
### Who is maintaining the workflow ?
Bios4Biol CATI person implied ...
Be aware, that persons who are maintaining this workflow can be contact in case of bug but not in case of bioinformatics explaination need of your results.
### Credits
These scripts were copied from ....
Many thanks to other who have helped out along the way too, including (but not limited to):
[@loginTwitter](https://github.com/loginTwitter),...
---
params.outdir = "$baseDir/results/"
params.readfiles = "$baseDir/../../data/reads/*{1,2}.fastq.gz"
params.genome = "$baseDir/../../data/reference/reference.fa"
params.annotation = "$baseDir/../../data/reference/reference.gtf"
params.index = "$baseDir/../../index/"
outdir = params.outdir
readfiles = params.readfiles
genome = file(params.genome)
annotation = file(params.annotation)
index = params.index
Channel.fromPath(readfiles).set { reads }
process fastqc {
cpus 1
publishDir outdir, mode: 'copy', overwrite: true
input:
file read from reads
output:
file '*html' into fastqc
"""
fastqc --noextract ${read}
"""
}
Channel.fromFilePairs(readfiles).set { pair_reads }
process cutadapt {
cpus 1
input:
set pair_id, file(read) from pair_reads
output:
set pair_id, 'clean_*' into clean_files1, clean_files2
"""
cutadapt -a ACACTCTTTCCCTACACGACGCTCTTCCGATC -A GATCGGAAGAGCGGTTCAGCAGGAATGCCGAG -o clean_${pair_id}1.fastq.gz -p clean_${pair_id}2.fastq.gz ${read}
"""
}
process indexGenome {
storeDir index
input:
file ref from genome
file annot from annotation
output:
file "${dbname}.star/*" into startDb
script:
dbname = ref.baseName
"""
mkdir ${dbname}.star
STAR --runMode genomeGenerate --genomeDir ${dbname}.star --genomeFastaFiles ${ref} --sjdbGTFfile ${annot} --sjdbOverhang 99 --runThreadN ${task.cpus}
"""
}
process mapping {
publishDir outdir, mode: 'copy', overwrite: true
input:
set pair_id, file(read) from clean_files1
val starDB from startDb
output:
file "*bam" into aligned_sorted_files
script:
starDBpath = starDB[0].parent
"""
STAR --genomeDir $starDBpath --runThreadN ${task.cpus} --readFilesIn ${read} --outFileNamePrefix ${pair_id} --readFilesCommand zcat --outSAMtype BAM Unsorted --outFilterType BySJout
"""
}
process mapping2 {
publishDir outdir, mode: 'copy', overwrite: true
input:
set pair_id, file(read) from clean_files2
val starDB from startDb
output:
file "*sam" into aligned_files
script:
starDBpath = starDB[0].parent
"""
STAR --genomeDir $starDBpath --runThreadN ${task.cpus} --readFilesIn ${read} --outFileNamePrefix ${pair_id} --readFilesCommand zcat --outFilterType BySJout
"""
}
\ No newline at end of file
#!/usr/bin/perl -w
# http://wiki.bits.vib.be/index.php/Identify_the_Phred_scale_of_quality_scores_used_in_fastQ
use strict;
use File::Basename;
use List::MoreUtils qw( minmax );
# fastq_detect.pl fastq.file sample-size
# detect fastQ format from quality scores in fastQ input file
# Version 3
#
# Stephane Plaisance - VIB-BITS - July-04-2012
# Joachim Jacob - Aug-02-2012 - joachim.jacob@gmail.com
# - changed the maximum value of Sanger to 73
# - changed reading the file with a file handle
# (was a file handle !! supporting several archive formats. SP)
# - changed the diagnosing algoritm
# Stephane Plaisance - VIB-BITS - April-08-2013
# - merged both versions and corrected flaw in min/max
# thanks to Sergey Mitrfanov for perl reformatting
#####################################################################
# diagnose
# SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS.....................................................
# ..........................XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX......................
# ...............................IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII......................
# .................................JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ......................
# LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLL....................................................
# !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
# | | | | | |
# 33 59 64 73 104 126
# S 0........................26...31.......40
# X -5....0........9.............................40
# I 0........9.............................40
# J 3.....9.............................40
# L 0.2......................26...31........41
#
# S - Sanger Phred+33, raw reads typically (0, 40)
# X - Solexa Solexa+64, raw reads typically (-5, 40)
# I - Illumina 1.3+ Phred+64, raw reads typically (0, 40)
# J - Illumina 1.5+ Phred+64, raw reads typically (3, 40) with 0=unused, 1=unused, 2=Read Segment Quality Control Indicator (bold)
# L - Illumina 1.8+ Phred+33, raw reads typically (0, 41)
# - Illumina 1.9+ Phred+33, raw reads typically (3, 41)
#####################################################################
my $script = basename($0);
@ARGV gt 0 or die "usage: $script <fastq file> <opt:sample-size (100)>\n";
my ($inputfile, $limit) = @ARGV;
if (! defined $limit) { $limit = 100}; # check first 100 records
my $cnt=0;
my ($min, $max); # global min and max values
my $z = ReadFile ($inputfile) || die "Error: cannot read from variant file $inputfile: $!\n";
## parse
while (my $id = <$z>) {
$id =~ m/^@/ || die "expected @ not found in line 1!\n";
my $seq = <$z>;
my $sep = <$z>;
$sep =~ m/^\+/ || die "expected + not found in line 3!\n";
my $qual = <$z>;
chomp($qual);
$cnt++;
$cnt>=$limit && last;
# char to ascii
my @chars = split("", $qual);
my @nums = sort { $a <=> $b } (map { unpack("C*", $_ )} @chars);
if ($cnt==1) {
($min, $max) = minmax @nums;
} else {
my ($lmin, $lmax) = minmax @nums; # local values for this read
$lmin<$min ? $min=$lmin : $min=$min;
$lmax>$max ? $max=$lmax : $max=$max;
}
}
undef $z;
## diagnose
my %diag=(
'Sanger' => '.',
'Solexa' => '.',
'Illumina 1.3+' => '.',
'Illumina 1.5+' => '.',
'Illumina 1.8+' => '.',
'Illumina 1.9+' => '.'
);
my %comment=(
'Sanger' => 'Phred+33',
'Solexa' => 'Solexa+64',
'Illumina 1.3+' => 'Phred+64',
'Illumina 1.5+' => 'Phred+64',
'Illumina 1.8+' => 'Phred+33',
'Illumina 1.9+' => 'Phred+33'
);
if ($min<33 || $max>105) { die "Quality values corrupt. found [$min; $max] where [33; 104] was expected\n"; }
if ($min>=33 && $max<=73) {$diag{'Sanger'}='x';}
if ($min>=59 && $max<=104) {$diag{'Solexa'}='x';}
if ($min>=64 && $max<=104) {$diag{'Illumina 1.3+'}='x';}
if ($min>=66 && $max<=105) {$diag{'Illumina 1.5+'}='x';}
if ($min>=33 && $max<=74) {$diag{'Illumina 1.8+'}='x';}
if ($min>=33 && $max<=75) {$diag{'Illumina 1.9+'}='x';}
## report
foreach my $format (sort keys %diag) {
if ($diag{$format} eq 'x') {
print STDOUT "# sampled raw quality values are in the range of [".$min."; ".$max."]\n";
print STDOUT sprintf("%-30s\t$inputfile\n", $comment{$format});}
}
##############
#### Subs ####
# reads from uncompressed, gzipped and bgzip fastQ files
sub ReadFile {
my $infile = shift;
my $FH;
if ($infile =~ /.bz2$/) {
open ($FH, "bzcat $infile |") or die ("$!: can't open file $infile");
} elsif ($infile =~ /.gz$/) {
open ($FH, "zcat $infile |") or die ("$!: can't open file $infile");
} elsif ($infile =~ /.fq|.fastq|.txt$/) {
open ($FH, "cat $infile |") or die ("$!: can't open file $infile");
} else {
die ("$!: do not recognise file type $infile");
}
return $FH;
}
# Allel Specific Expression Analysis workflow
authors__: Maria Bernard (INRAE - SIGENAE) / Kevin Muret - Frederic Jehl - Sandrine Lagarrigue (INRAE - PEGASE)
authors__: Maria Bernard (INRAE - SIGENAE) / Kevin Muret - Frederic Jehl - Sandrine Lagarrigue ( AGROCAMPUS OUEST / INRAE - PEGASE)
**<u>!!! STATUT : DEVELOPMENT !!!</u>**
......@@ -18,9 +18,9 @@ This pipeline will perform variant filtering, reads alignment on masked genome a
The user must use the [config_calling.yaml](config_calling.yaml.example) file to provide all necessary inputs for the pipeline:
- A masked reference fasta file (fasta_ref option), indexed with samtools faidx and GATK4 CreateSequenceDictionary
- A masked reference fasta file (fasta_ref option), indexed with samtools faidx and GATK4 CreateSequenceDictionary. We recommend to mask genome with variant filtered bi-allelic and on GATK metrics FS < 30, QD > 2 and AF < 1. In our used case we mask genome on variant construct at the population level, and then we launch this pipeline at the tissue level.
- A genome reference GTF file (gtf_ref option)
- An genome annotation reference GTF file (gtf_ref option)
- A set of variant to include in the ASE analysis
......@@ -47,6 +47,8 @@ The user must use the [config_calling.yaml](config_calling.yaml.example) file to
First step is to filter and trim raw reads before alignment. This step will convert phred 64 scale fastq files into phred 33 scale fastq files and then trim reads with [trim_galore](https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/). It will trim adapter and filtered out trimmed reads that are smaller than the third of the read length.
By default the trimming quality threshold is set to 15.
You will find in Result/Summary directory, a Log_TrimGalore_summary.tsv that detail the trim_galor statistics.
......@@ -77,7 +79,7 @@ Finally reads that contain Ns in their cigar string will be splittd, and alignme
Variants will be filtered before submitting to ASE.
Only SNP with FS < or equal to 30.0 and QD < or equal to 2.0 will be used in ASE.
Only bi-allelic SNP with FS < or equal to 30.0 and QD < or equal to 2.0 will be used in ASE.
Filtering on SNP type is an option, see SNP_filter in the configuration file. By default it is set to False as the calling pipeline generate output VCF file filtered on SNP type.
......@@ -92,19 +94,28 @@ The pipeline will also add annotation in the INFO field in the final VCF file (s
Additionally, the pipeline will generate 3 tables that extract samples GT, DP and AD_REF and AD_ALT genotypes annotations. These tables will include only SNP that have a genotype call rate greater than 50% and a genotype with at least 5 reads coverage call rate greater than 20%.
### Compute Allele Specific Expression
Alleles Specific Expression will be obtained using phASER (https://github.com/secastel/phaser) and ASE Read Counter (https://gatk.broadinstitute.org/hc/en-us/articles/360036714351-ASEReadCounter), on bi-allelic variants filtered on FS < 30, and QD > 2.
ASEReadCounter and phASER returnby sample, allele expression on each SNP.
phASER will also return gene allele specific espression by constructing gene haplotypes by sample, and finally an expression matrix which agglomerate all the samples.
## 3. Dependencies
The workflow depends on:
* cutadapt
* trimgalore (version 0.4.5)
* STAR (version 2.5.2b)
* picard.jar (version 2.1.1)
* samtools (version 1.3.1 )
* rsem-prepare-reference ( RSEM version 1.3.0)
* rsem-calculate-expression ( RSEM version 1.3.0)
* GenomeAnalysisTK.jar (version 3.7)
* cutadapt (version 2.1)
* trimgalore (version 0.6.5)
* STAR (version 2.6.0c)
* samtools (version 1.9 )
* bcftools (version 1.9)
* bedtools( version 2.29.0)
* tabix (version 0.2.5)
* GenomeAnalysisTK.jar (version 3.7) which include Picard tools and ASE Read Counter
* java (version 8)
* phASER (cloned the 29th of May 2020)
Theses softwares need to be available in you PATH or in a bin directory precised in the config.yaml file with the bin_dir option.
......
......@@ -159,9 +159,8 @@ for sample in table["sample_name"].unique():
final_outputs.append('Results_ASE/phASER/' + sample + '_phASER.allelic_counts.txt')
final_outputs.append('Results_ASE/phASER_gene_ae/' + sample + '_phASER.gene_ae.txt')
# TO FINISH AND TEST phaser_expr_matrix
# final_outputs.append('Results_ASE/phASER_pop/' + os.path.basename(config['vcf']).replace('.vcf.gz','_phASER.bed'))
# final_outputs.append('Results_ASE/phASER_pop/' + os.path.basename(config['vcf']).replace('.vcf.gz','_phASER.gw_phased.bed'))
final_outputs.append('Results_ASE/phASER_pop/' + os.path.basename(config['vcf']).replace('.vcf.gz','_phASER.bed.gz'))
final_outputs.append('Results_ASE/phASER_pop/' + os.path.basename(config['vcf']).replace('.vcf.gz','_phASER.gw_phased.bed.gz'))
# print(final_outputs)
# ['Results_ASE/STAR_Aln_2/samplePE_rg_genomic_pp_rmdup_uniq_split.bam',
......
Snakemake/1000RNASeq_chicken/ASE/img/full_dryrun.png

231 KiB

......@@ -49,7 +49,8 @@ SplitNCigarReads:
phASER:
cpu : 8
mem : "130G"
mem : "180G"
time : "8:00:00"
phASER_expr_matrix:
cpu : 8
......
......@@ -71,7 +71,7 @@ rule phASER:
idSeparator = '_' if config['id_separator'] == '' else config['id_separator']
shell:
"""
python2.7 {config[bin_dir]}/phaser.py --vcf {input.vcf} --bam {input.bam} --sample {wildcards.sample} --threads {threads} --o Results_ASE/phASER/{wildcards.sample}_phASER --paired_end {params.paired_option} --baseq {params.baseQuality} --mapq {params.mappingQuality} --id_separator {params.idSeparator}
python2.7 {config[bin_dir]}/phaser.py --vcf {input.vcf} --bam {input.bam} --sample {wildcards.sample} --threads {threads} --o Results_ASE/phASER/{wildcards.sample}_phASER --paired_end {params.paired_option} --unphased_vars 1 --gw_phase_vcf 1 --baseq {params.baseQuality} --mapq {params.mappingQuality} --id_separator {params.idSeparator}
"""
......@@ -97,20 +97,19 @@ rule phASER_gene_ae:
python2.7 {config[bin_dir]}/phaser_gene_ae.py --haplotypic_counts {input.hap_count} --features {input.bed} --o {output} --id_separator {params.idSeparator}
"""
# TO FINISH AND TEST
# rule phASER_expr_matrix:
# input:
# expand("Results_ASE/phASER_gene_ae/{sample}_phASER.gene_ae.txt", sample=table["sample_name"].unique()),
# bed = 'Results_ASE/ref_bed/' + os.path.basename(config['gtf_ref']).replace('.gtf', '.bed')
# output:
# 'Results_ASE/phASER_pop/' + os.path.basename(config['vcf']).replace('.vcf.gz','_phASER.bed'),
# 'Results_ASE/phASER_pop/' + os.path.basename(config['vcf']).replace('.vcf.gz','_phASER.gw_phased.bed')
# params:
# gene_ae_dir = "Results_ASE/phASER_gene_ae/",
# out_dir = 'Results_ASE/phASER_pop/'
# threads: config["phASER_expr_matrix"]["cpu"]
# shell:
# """
# python2.7 {config[bin_dir]}/phaser_expr_matrix.py --t {threads} --gene_ae_dir {params.gene_ae_dir} --feature {input.bed} --o {param
# """
rule phASER_expr_matrix:
input:
expand("Results_ASE/phASER_gene_ae/{sample}_phASER.gene_ae.txt", sample=table["sample_name"].unique()),
bed = 'Results_ASE/ref_bed/' + os.path.basename(config['gtf_ref']).replace('.gtf', '.bed')
output:
bed = 'Results_ASE/phASER_pop/' + os.path.basename(config['vcf']).replace('.vcf.gz','_phASER.bed.gz'),
phased_bed = 'Results_ASE/phASER_pop/' + os.path.basename(config['vcf']).replace('.vcf.gz','_phASER.gw_phased.bed.gz')
params:
gene_ae_dir = "Results_ASE/phASER_gene_ae/",
out_prefix = 'Results_ASE/phASER_pop/' + os.path.basename(config['vcf']).replace('.vcf.gz','_phASER')
threads: config["phASER_expr_matrix"]["cpu"]
shell:
"""
python2.7 {config[bin_dir]}/phaser_expr_matrix.py --t {threads} --gene_ae_dir {params.gene_ae_dir} --feature {input.bed} --o {params.out_prefix}
"""
......@@ -11,7 +11,9 @@ import multiprocessing
# FUNCTION
def select_on_chrom(in_vcf, accepted_chrom, excluded_chrom, selected_vcf):
FH_in = open(in_vcf, 'rt')
#~ FH_in = open(in_vcf, encoding='utf8', 'rt')
# pour FLLL je ne comprends pas pourquoi mais utf8 ne fonctionne pas ...
FH_in = open(in_vcf, encoding='iso-8859-1', mode='rt')
FH_out = open(selected_vcf,'wt')
for line in FH_in:
......
# RNASeq Analysis workflow : quantification and SNP calling
__authors__: Maria Bernard (INRA - SIGENAE) / Kevin Muret - Frederic Jehl - Sandrine Lagarrigue (INRA - PEGASE)
__authors__: Maria Bernard (INRAE - SIGENAE) / Kevin Muret - Frederic Jehl - Sandrine Lagarrigue (AGROCAMPUS OUEST / INRAE - PEGASE)
- [RNASeq Analysis workflow : quantification and SNP calling](#rnaseq-analysis-workflow---quantification-and-snp-calling)
* [What it does ?](#what-it-does--)
......
......@@ -17,7 +17,7 @@ snakemake -s $WORKFLOW_DIR/Snakefile_calling --printshellcmds --jobs 200 \
# {cluster.cpu}, {cluster.mem} and {cluster.partition} will be replace by value defined in the resources_SLURM.yaml file
snakemake -s $WORKFLOW_DIR/Snakefile_calling --printshellcmds --jobs 200 \
--configfile config_calling.yaml.example \
--cluster-config $WORKFLOW_DIR/resources_SLURM.yaml \
--cluster-config $WORKFLOW_DIR/resources_calling_SLURM.yaml \
--cluster "sbatch -p {cluster.partition} --cpus-per-task={cluster.cpu} --mem-per-cpu={cluster.mem} --error=logs/%x.stderr --output=logs/%x.stdout " --latency-wait 30
# If you want to launch the workflow partially, indicate the final output files you want (see dryrun log)
......@@ -26,7 +26,7 @@ snakemake -s $WORKFLOW_DIR/Snakefile_calling --printshellcmds --jobs 200 \
PREFIX=?? # your sample_config TSV file whithout extension
snakemake -s $WORKFLOW_DIR/Snakefile_calling --printshellcmds --jobs 200 \
--configfile config_calling.yaml.example \
--cluster-config $WORKFLOW_DIR/resources_SLURM.yaml \
--cluster-config $WORKFLOW_DIR/resources_calling_SLURM.yaml \
--cluster "sbatch -p {cluster.partition} --cpus-per-task={cluster.cpu} --mem-per-cpu={cluster.mem} --error=logs/%x.stderr --output=logs/%x.stdout " --latency-wait 30 \
Results/Summary/Log_TrimGalore_summary.tsv Results/Summary/Log_STAR_Aln_1_summary.tsv Results/Summary/Log_STAR_Aln_2_summary.tsv \
Results/Summary/RSEM_multimap_summary_${PREFIX}_genes_expected_count.tsv \
......@@ -46,7 +46,7 @@ snakemake -s $WORKFLOW_DIR/Snakefile_calling --printshellcmds --jobs 200 \
PREFIX=?? # your sample_config TSV file whithout extension
snakemake -s $WORKFLOW_DIR/Snakefile_calling --printshellcmds --jobs 200 \
--configfile config_calling.yaml.example \
--cluster-config $WORKFLOW_DIR/resources_SLURM.yaml \
--cluster-config $WORKFLOW_DIR/resources_calling_SLURM.yaml \
--cluster "sbatch -p {cluster.partition} --cpus-per-task={cluster.cpu} --mem-per-cpu={cluster.mem} --error=logs/%x.stderr --output=logs/%x.stdout " --latency-wait 30 \
Results/Summary/Log_TrimGalore_summary.tsv Results/Summary/Log_STAR_Aln_1_summary.tsv Results/Summary/Log_STAR_Aln_2_summary.tsv \
Results/GATK/${PREFIX}_SNP.vcf.gz\
......
......@@ -620,7 +620,7 @@ rule individual_gVCF_HaplotypeCaller:
jar=find_jar("GenomeAnalysisTK.jar")
shell:
"""
java -Xmx{params.mem} -jar {params.jar} -T HaplotypeCaller -I {input.bam} -R {input.fasta} -stand_call_conf 20.0 --min_base_quality_score 10 --min_mapping_quality_score 20 -ERC GVCF -dontUseSoftClippedBases -o {output.gvcf} 2> {output.log}
java -Xmx{params.mem} -jar {params.jar} -T HaplotypeCaller -I {input.bam} -R {input.fasta} --min_base_quality_score 10 --min_mapping_quality_score 20 -ERC GVCF -dontUseSoftClippedBases -o {output.gvcf} 2> {output.log}
"""
......@@ -643,7 +643,7 @@ rule genotypeGVCF:
run:
bams = gatk_multi_arg("--variant", input.gvcf)
shell( "java -Xmx{params.mem} -jar {params.jar} -T GenotypeGVCFs {bams} -R {input.fasta} -o {output.vcf} 2> {output.log}")
shell( "java -Xmx{params.mem} -jar {params.jar} -T GenotypeGVCFs {bams} -R {input.fasta} -stand_call_conf 20.0 -o {output.vcf} 2> {output.log}")
# /GATK
#######
......@@ -663,22 +663,5 @@ rule selectVariant:
java -Xmx{params.mem} -jar {params.jar} -T SelectVariants -R {input.ref} -V {input.vcf} -selectType INDEL -o {output.indel}
"""
# ajouter l'option --excludeIntervals pour exclure certains intervalles
rule filter_variant:
input:
ref = "Results/Fai_Dict_Index/" + os.path.basename(config["fasta_ref"]),
snp = "Results/GATK/"+POP+"_SNP.vcf.gz"
output:
snp = "Results/GATK/"+POP+"_SNP_flag.vcf.gz"
params:
mem=config["selectVariant"]["mem"],
jar=find_jar("GenomeAnalysisTK.jar")
shell:
"""
java -Xmx{params.mem} -jar {params.jar} -T VariantFiltration -R {input.ref} -V {input.snp} -window 35 -cluster 3 -filterName FS -filter "FS > 30.0" -filterName QD -filter "QD < 2.0" -o {output.snp}
"""
# ADD rule for merging flagstat
# ADD rule for merging individual HC stat log
# ADD rule split SNP/INDEL
# ADD rule Hard filtering
# ADD rule filtering on junction
......@@ -2,7 +2,7 @@ __default__ :
mem : "8G"
cpu : 8
time : "24:00:00"
queue : workq
partition : workq
convertPhred:
cpu : 1
......@@ -21,13 +21,13 @@ star_index_pass1 :
mem : "18G"
star_aln_pass1 :
mem : "12G"
mem : "5G"
star_index_pass2 :
mem : "18G"
star_aln_pass2 :
mem : "12G"
mem : "5G"
log_merge_STAR:
cpu : 1
......
# RNASeq Analysis workflow : quantification and SNP calling
__authors__: Maria Bernard (INRAE - SIGENAE) / Kevin Muret - Frederic Jehl - Sandrine Lagarrigue (AGROCAMPUS OUEST / INRAE - PEGASE)
- [RNASeq Analysis workflow : quantification and SNP calling](#rnaseq-analysis-workflow---quantification-and-snp-calling)
* [What it does ?](#what-it-does--)
* [1. Input](#1-input)
* [2. Data Processing](#2-data-processing)
+ [Raw reads trimming](#raw-reads-trimming)
+ [Sequence alignment](#sequence-alignment)
+ [Transcriptomic Alignment post processing](#transcriptomic-alignment-post-processing)
+ [Gene/Isoform expression](#gene-isoform-expression)
+ [Genomic Alignment post processing](#genomic-alignment-post-processing)
+ [Variant Calling](#variant-calling)
* [3. Dependencies](#3-dependencies)
* [4. Configure and running workflow](#4-configure-and-running-workflow)
## What it does ?
This pipeline will perform classical RNASeq analysis workflow (trimming, STAR mapping, and RSEM expression quantification) but also call variant based on the [GATK RNASeq recommandations](https://software.broadinstitute.org/gatk/documentation/article.php?id=3891).
The pipeline is represented in the graph below:
![](img/full_dryrun.png)
## 1. Input
The user must use the [config_calling.yaml](config_calling.yaml.example) file to provide all necessary inputs for the pipeline:
- A reference fasta file (fasta_ref option)
- A genome reference GTF file (gtf_ref option)
- A set of known variants used to recalibrate bases quality in GATK preprocessing steps RealignerTargetCreator and BaseRecalibrator (known_vcf option)
- A sample reads description TSV file (sample_config option), with the following columns:
- idx : a unique index number to identify each fastq (paire) file
- name : the sample name
- forward_read and reverse_read are fastq file names. These files need to be in a directory precised with the data_dir option
- If single end, leave an empty column in the reverse_read
- If paired, file names need to ends with _R1.fastq.gz or _R2.fastq.gz (or fq instead of fastq, and not necessarly compressed)
- sequencer : sequencer used to generate the sequences : "ILLUMINA", "SLX", "SOLEXA", "SOLID", "454", "LS454", "COMPLETE", "PACBIO", "IONTORRENT", "CAPILLARY", "HELICOS", "UNKNOWN"
- oriented : to indicate if sequences are generated with strand-specific protocol. It corresponds to the forward_prob RSEM parameter. Indicate :
- 1 for a strand-specific protocol where all (upstream) reads are derived from the forward strand,
- 0 for a strand-specific protocol where all (upstream) read are derived from the reverse strand,
- 0.5 for a non-strand-specific protocol.
- phred scale: to indicate the phred score scale use to code base quality: either 33 (Sanger and illumina 1.8+) or 64 (Solexa, illumina 1.3 to 1.8 excluded)
## 2. Data Processing
### Raw reads trimming
First step is to filter and trim raw reads before alignment. This step will convert phred 64 scale fastq files into phred 33 scale fastq files and then trim reads with [trim_galore](https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/). It will trim adapter and filtered out trimmed reads that are smaller than the third of the read length.
You will find in Result/Summary directory, a Log_TrimGalore_summary.tsv that detail the trim_galor statistics.
### Sequence alignment
Trimmed reads will be align against the reference genome thanks to the [STAR](https://academic.oup.com/bioinformatics/article/29/1/15/272537) following the multi-sample 2-pass mapping procedure, with the GTF reference file.
STAR 2nd pass alignment will generate bam files in genome and in transcriptome coordinates.
You will find in the Results/Summary directory , 2 files reporting alignment statistics : Log_STAR_Aln_1_summary.tsv and Log_STAR_Aln_2_summary.tsv
### Transcriptomic Alignment post processing
Reads with multiple alignment will be filtered out to generate a seconde transcriptomic bam:
```samtools view -q 255```
### Gene/Isoform expression
Expression will be compute on all reads align on transcriptome or on uniquely mapped reads thanks to [RSEM](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-12-323). For each sample, it will generate gene and isoforms results.
You will find in the Results/Summary directory, expression quantification for all samples:
* for gene and isorms
* for expected count, TPM and FPKM quantification method
* based on uniquely mapped reads or all alignment.
The file will be named :
​ RSEM\__[multimap/uniqmap]_\_summary_PREFIX\__[gene/isoforms]_\__[TPM/expected\_count/FPKM]_.tsv
PREFIX is the name of your sample_config file define in the config.yaml file.
### Genomic Alignment post processing
Genomic alignement will be used to call variants. Each sample will have its bam treated like this:
- cleaned thanks to cleanSam from GATK
- annotated with AddOrReplaceReadGroups from [Picard tools](https://broadinstitute.github.io/picard/) based on information you provide in the sample config file
- multimapped reads filtered out (samtools -q 255)
- merged with other bam files belonging to the same sample name (optionnal)
- with PCR and optical duplicates marked thanks to Picard tools
- with splitted alignment convert in separated alignments and mapping quality of 250 reassigned to 60 with SplitNCigarReads from GATK
- with its indel realign thanks to RealignerTargetCreator and IndelRealigner, and its base quality score recalibrated thanks to BaseRecalibrator and PrintReads from GATK using the known_vcf file provided in the config.yaml.
The final bam alignment files will be find in Results/GATK/ directory and will be called : _[sample]_\_uniq_cs_rg_merge_md_split_real_recal.bam
### Variant Calling
Variants will called thanks to HaplotypeCaller from GATK which will generate one gVCF per sample (Results/GATK/_[sample]_\_.g.vcf.gz files) with the following options:
- -stand_call_conf 20.0
- --min_base_quality_score 10
- --min_mapping_quality_score 20
GenotypeGVCFs from GATK will take all gVCF files to generate one VCF file called Results/GATK/PREFIX_full.vcf.gz
Finally, this VCF file will be splitted into two VCF files for SNP or INDEL using SelectVariant from GATK:Results/GATK/PREFIX_SNP.vcf.gz and Results/GATK/PREFIX_INDEL.vcf.gz
## 3. Dependencies
###3.1. Links to binaries
The workflow depends on: LIST TO CHECK !
* cutadapt
* trimgalore (version 0.4.5)
* STAR (version 2.5.2b)
* picard.jar (version 2.1.1)
* samtools (version 1.3.1 )
* rsem-prepare-reference (RSEM version 1.3.0)
* rsem-calculate-expression (RSEM version 1.3.0)
* GenomeAnalysisTK.jar (version 3.7)
* java (version 8)
* snakemake (version 4.8.0)
* maskFastaFromBed (bedtools version 2.26.0)
* vcf-merge (VCFtools version 0.1.15)
* bgzip (Tabix version 0.2.5)
* tabix (Tabix version 0.2.5)
* python3.6.3 (version 3.6.3)
* fastqc (version 0.11.7)
Theses softwares need to be available in you PATH or in a bin directory precised in the config.yaml file with the bin_dir option.
###3.2. Using conda mamba
```
conda activate mamba
mamba create -n env_name -c bioconda -c conda-forge cutadapt trim-galore star samtools gatk4 rsem
conda activate /path/to/miniconda/envs/mamba/envs/env_name
module load system/Python-3.6.3
export PERL5LIB=""
```
## 4. Configure and running workflow
- Copy the config_calling.yaml.example file into your working directory and update it according to the instructions inside the file.
- launch snakemake command as presented in SLURM.sh to launch the workflow. This script explain how to launch a dryrun, a full run, perform quantification expression only or variant calling only
- If necessary update resources_SLURM.yaml to define cpu memory and cluster partition resources for each rule. The default section is the minimal resources per job. The other section take as name the rules name and define only resources that are different from the default section.
Resources_SLURM.yaml and SLURM.sh are specific to SLURM cluster. Therefore, if you use another cluster, you need to create other files. However, you can use these files to inspire you.
This is the official documentation of Snakemake if you need more information: <https://snakemake.readthedocs.io/en/stable/tutorial/tutorial.html>
#!/bin/bash
##############################################################################
## launch workflow
# if used on Genologin cluster (INRA Toulouse )
# module load system/Python-3.6.3
WORKFLOW_DIR=.
mkdir -p logs
# This is an example of the snakemake command line to launch a dryrun
snakemake -s $WORKFLOW_DIR/Snakefile_calling --printshellcmds --jobs 200 \
--configfile config_calling.yaml.example --dryrun
# This is an example of the snakemake command line to launch the full workflow on a SLURM cluster
# {cluster.cpu}, {cluster.mem} and {cluster.partition} will be replace by value defined in the resources_SLURM.yaml file
snakemake -s $WORKFLOW_DIR/Snakefile_calling --printshellcmds --jobs 200 \
--configfile config_calling.yaml.example \
--cluster-config $WORKFLOW_DIR/resources_calling_SLURM.yaml \
--cluster "sbatch -p {cluster.partition} --cpus-per-task={cluster.cpu} --mem-per-cpu={cluster.mem} --error=logs/%x.stderr --output=logs/%x.stdout " --latency-wait 30
# If you want to launch the workflow partially, indicate the final output files you want (see dryrun log)
#
# For RSEM quantification only
PREFIX=?? # your sample_config TSV file whithout extension
snakemake -s $WORKFLOW_DIR/Snakefile_calling --printshellcmds --jobs 200 \
--configfile config_calling.yaml.example \
--cluster-config $WORKFLOW_DIR/resources_calling_SLURM.yaml \
--cluster "sbatch -p {cluster.partition} --cpus-per-task={cluster.cpu} --mem-per-cpu={cluster.mem} --error=logs/%x.stderr --output=logs/%x.stdout " --latency-wait 30 \
Results/Summary/Log_TrimGalore_summary.tsv Results/Summary/Log_STAR_Aln_1_summary.tsv Results/Summary/Log_STAR_Aln_2_summary.tsv \
Results/Summary/RSEM_multimap_summary_${PREFIX}_genes_expected_count.tsv \
Results/Summary/RSEM_uniqmap_summary_${PREFIX}_genes_expected_count.tsv \
Results/Summary/RSEM_multimap_summary_${PREFIX}_genes_FPKM.tsv \
Results/Summary/RSEM_uniqmap_summary_${PREFIX}_genes_FPKM.tsv \
Results/Summary/RSEM_multimap_summary_${PREFIX}_genes_TPM.tsv \
Results/Summary/RSEM_uniqmap_summary_${PREFIX}_genes_TPM.tsv \
Results/Summary/RSEM_multimap_summary_${PREFIX}_isoforms_expected_count.tsv \
Results/Summary/RSEM_uniqmap_summary_${PREFIX}_isoforms_expected_count.tsv \
Results/Summary/RSEM_multimap_summary_${PREFIX}_isoforms_FPKM.tsv \
Results/Summary/RSEM_uniqmap_summary_${PREFIX}_isoforms_FPKM.tsv \
Results/Summary/RSEM_multimap_summary_${PREFIX}_isoforms_TPM.tsv \
Results/Summary/RSEM_uniqmap_summary_${PREFIX}_isoforms_TPM.tsv
# For GATK calling only
PREFIX=?? # your sample_config TSV file whithout extension
snakemake -s $WORKFLOW_DIR/Snakefile_calling --printshellcmds --jobs 200 \
--configfile config_calling.yaml.example \
--cluster-config $WORKFLOW_DIR/resources_calling_SLURM.yaml \
--cluster "sbatch -p {cluster.partition} --cpus-per-task={cluster.cpu} --mem-per-cpu={cluster.mem} --error=logs/%x.stderr --output=logs/%x.stdout " --latency-wait 30 \
Results/Summary/Log_TrimGalore_summary.tsv Results/Summary/Log_STAR_Aln_1_summary.tsv Results/Summary/Log_STAR_Aln_2_summary.tsv \
Results/GATK/${PREFIX}_SNP.vcf.gz\
Results/GATK/${PREFIX}_INDEL.vcf.gz
This diff is collapsed.
# bin_dir is a directory containing software binary used in this pipeline:
# cutadapt
# trimgalore (version 0.4.5
# STAR (version 2.5.2b)
# picard.jar (version 2.1.1)
# samtools (version 1.3.1 )
# rsem-prepare-reference ( RSEM version 1.3.0)
# rsem-calculate-expression ( RSEM version 1.3.0)
# GenomeAnalysisTK.jar (version 3.7)
# java (version 8)
# if not bin_dir, the binary need to be available in the PATH
bin_dir : .
# data_dir contains fastq files described in sample_config file
#data_dir : /work/project/1000rnaseqchick/Mathieu/reads/
data_dir : /work/project/1000rnaseqchick/Mathieu/reads/
# sample_config file is a tabular file describing each input fastq file
# sample_config file name will be used as prefix of the output files.
# header columns are : idx name forward_read reverse_read sequencer read_length oriented phred_scale
# forward_read and reverse_read are fastq file names. These files need to be in the data_dir directory
# sample may be paired end or single end.
# - If single end, leave an empty column in the reverse_read
# - If paired, file names need to ends with _R1.fastq.gz or _R2.fastq.gz (or fq instead of fastq, and not necessarly compressed)
# sequencer need to be choosen in the list : ["ILLUMINA","SLX","SOLEXA","SOLID","454","LS454","COMPLETE","PACBIO","IONTORRENT","CAPILLARY","HELICOS","UNKNOWN"]
# oriented is the forward_prob RSEM parameter to indicate :
# 1 for a strand-specific protocol where all (upstream) reads are derived from the forward strand,
# 0 for a strand-specific protocol where all (upstream) read are derived from the reverse strand,
# 0.5 for a non-strand-specific protocol.
# phred scale indicate the phred score scale use to code base quality: either 33 (Sanger and illumina 1.8+) or 64 (Solexa, illumina 1.3 to 1.8 excluded)
sample_config : data/1000RNAseq.tsv
# fasta_ref file is the genome reference Fasta file
fasta_ref : data/reference.fa
# gtf_ref file is the genome reference GTF file
gtf_ref : data/reference.gtf
# known_vcf file is set of known variants used to recalibrate bases quality in GATK preprocessing steps RealignerTargetCreator and BaseRecalibrator
known_vcf : data/reference_known_var.vcf.gz
# quality trimming threshold used in trimgalore to remove low quality bases.
trimming_quality : 15
# computing ressources, also give to --cluster-config snakemake option if executed on a cluster
# this yaml file defined default resources (mem and cpu at least) in a __default__ section, and specific resources either all or one resource for particular rule if different from the default
resources: resources_calling_SLURM.yaml
# bin_dir is a directory containing software binary used in this pipeline:
# cutadapt
# trimgalore (version 0.4.5
# STAR (version 2.5.2b)
# picard.jar (version 2.1.1)
# samtools (version 1.3.1 )
# rsem-prepare-reference ( RSEM version 1.3.0)
# rsem-calculate-expression ( RSEM version 1.3.0)
# GenomeAnalysisTK.jar (version 3.7)
# java (version 8)
# if not bin_dir, the binary need to be available in the PATH
bin_dir : ./
# data_dir contains fastq files described in sample_config file
data_dir : data
# sample_config file is a tabular file describing each input fastq file
# sample_config file name will be used as prefix of the output files.
# header columns are : idx name forward_read reverse_read sequencer read_length oriented phred_scale
# forward_read and reverse_read are fastq file names. These files need to be in the data_dir directory
# sample may be paired end or single end.
# - If single end, leave an empty column in the reverse_read
# - If paired, file names need to ends with _R1.fastq.gz or _R2.fastq.gz (or fq instead of fastq, and not necessarly compressed)
# sequencer need to be choosen in the list : ["ILLUMINA","SLX","SOLEXA","SOLID","454","LS454","COMPLETE","PACBIO","IONTORRENT","CAPILLARY","HELICOS","UNKNOWN"]
# oriented is the forward_prob RSEM parameter to indicate :
# 1 for a strand-specific protocol where all (upstream) reads are derived from the forward strand,
# 0 for a strand-specific protocol where all (upstream) read are derived from the reverse strand,
# 0.5 for a non-strand-specific protocol.
# phred scale indicate the phred score scale use to code base quality: either 33 (Sanger and illumina 1.8+) or 64 (Solexa, illumina 1.3 to 1.8 excluded)
sample_config : data/population.tsv.example
# fasta_ref file is the genome reference Fasta file
fasta_ref : data/reference.fa
# gtf_ref file is the genome reference GTF file
gtf_ref : data/reference.gtf
# known_vcf file is set of known variants used to recalibrate bases quality in GATK preprocessing steps RealignerTargetCreator and BaseRecalibrator
known_vcf : data/reference_known_var.vcf.gz
# quality trimming threshold used in trimgalore to remove low quality bases.
trimming_quality : 15
# computing ressources, also give to --cluster-config snakemake option if executed on a cluster
# this yaml file defined default resources (mem and cpu at least) in a __default__ section, and specific resources either all or one resource for particular rule if different from the default
resources: resources_calling_SLURM.yaml