Skip to content
Snippets Groups Projects
Commit b47fcd24 authored by Maria Bernard's avatar Maria Bernard
Browse files

Snakemake RNSAseqVariantCalling : update documentation workflow and clean up

parent d88486bc
No related branches found
No related tags found
No related merge requests found
## Install snakemake 5 with a venv
```
module load system/Python-3.6.3
python3.6 -m venv env/snakemake-5_venv
module purge
source env/snakemake-5_venv/bin/activate
pip install snakemake
pip install drmaa
pip install pandas
```
## Install conda and singularity
Conda:
```
wget https://repo.continuum.io/miniconda/Miniconda2-4.6.14-Linux-x86_64.sh
sh Miniconda2-4.6.14-Linux-x86_64.sh
```
Restart shell:
```
conda config --set auto_activate_base false
conda config --add channels bioconda
```
Singularity:
```
sudo apt-get install singularity-container
```
## Create singularity image
[see build process](./env/README.md)
\ No newline at end of file
# How to install
# RNASeq Analysis workflow : quantification and SNP calling
## Install snakemake 5 with a venv
__authors__: Maria Bernard (INRA - SIGENAE) / Kevin Muret - Frederic Jehl - Sandrine Lagarrigue (INRA - PEGASE)
```
module load system/Python-3.6.3
python3.6 -m venv env/snakemake-5_venv
module purge
source env/snakemake-5_venv/bin/activate
pip install snakemake
pip install drmaa
pip install pandas
```
- [RNASeq Analysis workflow : quantification and SNP calling](#rnaseq-analysis-workflow---quantification-and-snp-calling)
* [What it does ?](#what-it-does--)
* [1. Input](#1-input)
* [2. Data Processing](#2-data-processing)
+ [Raw reads trimming](#raw-reads-trimming)
+ [Sequence alignment](#sequence-alignment)
+ [Transcriptomic Alignment post processing](#transcriptomic-alignment-post-processing)
+ [Gene/Isoform expression](#gene-isoform-expression)
+ [Genomic Alignment post processing](#genomic-alignment-post-processing)
+ [Variant Calling](#variant-calling)
* [3. Dependencies](#3-dependencies)
* [4. Configure and running workflow](#4-configure-and-running-workflow)
## Install conda and singularity
## What it does ?
Conda:
```
wget https://repo.continuum.io/miniconda/Miniconda2-4.6.14-Linux-x86_64.sh
sh Miniconda2-4.6.14-Linux-x86_64.sh
```
This pipeline will perform classical RNASeq analysis workflow (trimming, STAR mapping, and RSEM expression quantification) but also call variant based on the [GATK RNASeq recommandations](https://software.broadinstitute.org/gatk/documentation/article.php?id=3891).
Restart shell:
```
conda config --set auto_activate_base false
conda config --add channels bioconda
```
The pipeline is represented in the graph below:
Singularity:
```
sudo apt-get install singularity-container
```
![](/home/maria/workspace/workflows_develop/Snakemake/1000RNASeq_chicken/calling/img/full_dryrun.png)
## Create singularity image
[see build process](./env/README.md)
## 1. Input
# How to run
The user must use the [config_calling.yaml](example/config_calling.yaml) file to provide all necessary inputs for the pipeline:
- A reference fasta file (fasta_ref option)
- A genome reference GTF file (gtf_ref option)
- A set of known variants used to recalibrate bases quality in GATK preprocessing steps RealignerTargetCreator and BaseRecalibrator (known_vcf option)
- A sample reads description TSV file (sample_config option), with the following columns:
- idx : a unique index number to identify each fastq (paire) file
- name : the sample name
- forward_read and reverse_read are fastq file names. These files need to be in a directory precised with the data_dir option
- If single end, leave an empty column in the reverse_read
- If paired, file names need to ends with _R1.fastq.gz or _R2.fastq.gz (or fq instead of fastq, and not necessarly compressed)
- sequencer : sequencer used to generate the sequences : "ILLUMINA", "SLX", "SOLEXA", "SOLID", "454", "LS454", "COMPLETE", "PACBIO", "IONTORRENT", "CAPILLARY", "HELICOS", "UNKNOWN"
- oriented : to indicate if sequences are generated with strand-specific protocol. It corresponds to the forward_prob RSEM parameter. Indicate :
- 1 for a strand-specific protocol where all (upstream) reads are derived from the forward strand,
- 0 for a strand-specific protocol where all (upstream) read are derived from the reverse strand,
- 0.5 for a non-strand-specific protocol.
- phred scale: to indicate the phred score scale use to code base quality: either 33 (Sanger and illumina 1.8+) or 64 (Solexa, illumina 1.3 to 1.8 excluded)
## 2. Data Processing
### Raw reads trimming
First step is to filter and trim raw reads before alignment. This step will convert phred 64 scale fastq files into phred 33 scale fastq files and then trim reads with [trim_galore](https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/). It will trim adapter and filtered out trimmed reads that are smaller than the third of the read length.
You will find in Result/Summary directory, a Log_TrimGalore_summary.tsv that detail the trim_galor statistics.
### Sequence alignment
Trimmed reads will be align against the reference genome thanks to the [STAR](https://academic.oup.com/bioinformatics/article/29/1/15/272537) following the multi-sample 2-pass mapping procedure, with the GTF reference file.
STAR 2nd pass alignment will generate bam files in genome and in transcriptome coordinates.
You will find in the Results/Summary directory , 2 files reporting alignment statistics : Log_STAR_Aln_1_summary.tsv and Log_STAR_Aln_2_summary.tsv
### Transcriptomic Alignment post processing
Reads with multiple alignment will be filtered out to generate a seconde transcriptomic bam:
```samtools view -q 255```
### Gene/Isoform expression
Expression will be compute on all reads align on transcriptome or on uniquely mapped reads thanks to [RSEM](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-12-323). For each sample, it will generate gene and isoforms results.
You will find in the Results/Summary directory, expression quantification for all samples:
## Configure workflow
* for gene and isorms
* for expected count, TPM and FPKM quantification method
* based on uniquely mapped reads or all alignment.
### 1. Create sample file based on data/sample.example.csv file
The file will be named :
`sample_config` file is a tabular file describing each input fastq file.
​ RSEM\__[multimap/uniqmap]_\_summary_PREFIX\__[gene/isoforms]_\__[TPM/expected\_count/FPKM]_.tsv
The expected header columns are : `idx name forward_read reverse_read sequencer read_length oriented phred_scale`
PREFIX is the name of your sample_config file define in the config.yaml file.
#### `name` will be used to agregated sample in the final VCF
#### `forward_read` and `reverse_read` are fastq file names. These files need to be in the data_dir directory sample may be paired end or single end.
* If single end leave an empty column in the reverse_read
* If paired, file names need to ends with _R1.fastq.gz or _R2.fastq.gz (or fq instead of fastq, and not necessarly compressed)
### Genomic Alignment post processing
Genomic alignement will be used to call variants. Each sample will have its bam treated like this:
- cleaned thanks to cleanSam from GATK
- annotated with AddOrReplaceReadGroups from [Picard tools](https://broadinstitute.github.io/picard/) based on information you provide in the sample config file
- multimapped reads filtered out (samtools -q 255)
- merged with other bam files belonging to the same sample name (optionnal)
- with PCR and optical duplicates marked thanks to Picard tools
- with splitted alignment convert in separated alignments and mapping quality of 250 reassigned to 60 with SplitNCigarReads from GATK
- with its indel realign thanks to RealignerTargetCreator and IndelRealigner, and its base quality score recalibrated thanks to BaseRecalibrator and PrintReads from GATK using the known_vcf file provided in the config.yaml.
The final bam alignment files will be find in Results/GATK/ directory and will be called : _[sample]_\_uniq_cs_rg_merge_md_split_real_recal.bam
### Variant Calling
Variants will called thanks to HaplotypeCaller from GATK which will generate one gVCF per sample (Results/GATK/_[sample]_\_.g.vcf.gz files) with the following options:
- -stand_call_conf 20.0
- --min_base_quality_score 10
- --min_mapping_quality_score 20
GenotypeGVCFs from GATK will take all gVCF files to generate one VCF file called Results/GATK/PREFIX_full.vcf.gz
Finally, this VCF file will be splitted into two VCF files for SNP or INDEL using SelectVariant from GATK:Results/GATK/PREFIX_SNP.vcf.gz and Results/GATK/PREFIX_INDEL.vcf.gz
# How to install
The workflow depends on:
* cutadapt
* trimgalore (version 0.4.5)
* STAR (version 2.5.2b)
* picard.jar (version 2.1.1)
* samtools (version 1.3.1 )
* rsem-prepare-reference ( RSEM version 1.3.0)
* rsem-calculate-expression ( RSEM version 1.3.0)
* GenomeAnalysisTK.jar (version 3.7)
* java (version 8)
These dependencies are all included in the singularity environment, or must be available in your PATH or in a directory that you must precise in the [config_calling.yaml](example/config_calling.yaml)
See [How_to_install.md](How_to_install.md) for more details on building singularity environment
# How to run
### Configure workflow
1. Create sample file based on example/sample.tsv file (see [config_calling.yaml](example/config_calling.yaml) or section [1. Input](#1-input):
#### `sequencer`
* ["ILLUMINA","SLX","SOLEXA","SOLID","454","LS454","COMPLETE","PACBIO","IONTORRENT","CAPILLARY","HELICOS","UNKNOWN"]
- The name of the file will be used to prefix the final results
#### `oriented`
This is the forward_prob RSEM parameter to indicate :
* 1 for a strand-specific protocol where all (upstream) reads are derived from the forward strand,
* 0 for a strand-specific protocol where all (upstream) read are derived from the reverse strand,
* 0.5 for a non-strand-specific protocol. (Default: off)
- sample with the same `name` will be aggregated
#### `phred scale` indicate the phred score scale use to code base quality
* either 33 (Sanger and illumina 1.8+)
* or 64 (Solexa, illumina 1.3 to 1.8 excluded)
- fastq files need to be in the `data_dir` specified in the `config.yaml` file (see next point)
### 2. edit config file config_calling.yaml
Copy file `example/config_calling.yaml` and fill it with exepected values :
2. Copy file `example/config_calling.yaml` and fill it with expected values :
For workflow software : you can use either binary in path, directory path with all requiered binary or singularity image.
[see build process](./env/README.md)
For workflow software : you can use either binary in path, directory path with all required binary or singularity image (see [build process](./env/README.md))
Configure input path files:
* sample_config
* fasta_ref
* gtf_ref
* known_vcf
* resources
### 3. set environment variable
3. launch snakemake command as presented in SLURM.sh or Singularity_SLURM.sh to launch the workflow. This script explain how to launch a dryrun, a full run, perform quantification expression only or variant calling only
If you don't use singularity image you must fix path to java jar files for picard tools and GATK (3.8)
```
picard_jar : $PICARD_PATH
gatk_jar : $GATK_PATH
```
## Execution on a cluster
Create script file `launsh_snakemake.sh`:
```
#!/bin/sh
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH -J GATK_RSEM
#SBATCH -o log/%x.out
#SBATCH -e log/%x.err
4. If necessary update [resources_SLURM.yaml](resources_SLURM.yaml) to define cpu memory and cluster partition resources for each rule. The default section is the minimal resources per job. The other section take as name the rules name and define only resources that are different from the default section.
source ./env/snakemake-5_venv/bin/activate
module load system/singularity-3.0.1
mkdir log
snakemake \
--snakefile Snakefile \
--configfile config_calling.yaml \
--jobs 200 --cluster-config example/resources_calling_SLURM.yaml \
--use-singularity \
--cluster "sbatch -p {cluster.queue} -e log/%x.err -o log/%x.out \
--cpus-per-task={cluster.cpu} --time={cluster.time} --mem-per-cpu={cluster.mem}" \
--printshellcmds --latency-wait 30
```
resources_SLURM.yaml, SLURM.sh, Singularity_SLURM.sh are specific to SLURM cluster. Therefore, if you use another cluster, you need to create other files. However, you can use these files to inspire you.
launch command with:
```
sbatch launsh_snakemake.sh
```
This is the official documentation of Snakemake if you need more informations : <https://snakemake.readthedocs.io/en/stable/tutorial/tutorial.html>
\ No newline at end of file
#!/bin/bash
##############################################################################
## launch workflow
# if used on Genologin cluster (INRA Toulouse )
# module load system/Python-3.6.3
# Path to the SNAKEMAKE workflow directory
WORKFLOW_DIR=.
mkdir -p logs
# This is an example of the snakemake command line to launch a dryrun
snakemake -s $WORKFLOW_DIR/Snakefile_calling --printshellcmds --jobs 200 \
--configfile config_calling.yaml.example --dryrun
# This is an example of the snakemake command line to launch the full workflow on a SLURM cluster
# {cluster.cpu}, {cluster.mem} and {cluster.partition} will be replace by value defined in the resources_SLURM.yaml file
snakemake -s $WORKFLOW_DIR/Snakefile_calling --printshellcmds --jobs 200 \
--configfile config_calling.yaml.example \
--cluster-config $WORKFLOW_DIR/resources_SLURM.yaml \
--cluster "sbatch -p {cluster.partition} --cpus-per-task={cluster.cpu} --mem-per-cpu={cluster.mem} --error=logs/%x.stderr --output=logs/%x.stdout " --latency-wait 30
# If you want to launch the workflow partially, indicate the final output files you want (see dryrun log)
#
# For RSEM quantification only
PREFIX=?? # your sample_config TSV file whithout extension
snakemake -s $WORKFLOW_DIR/Snakefile_calling --printshellcmds --jobs 200 \
--configfile config_calling.yaml.example \
--cluster-config $WORKFLOW_DIR/resources_SLURM.yaml \
--cluster "sbatch -p {cluster.partition} --cpus-per-task={cluster.cpu} --mem-per-cpu={cluster.mem} --error=logs/%x.stderr --output=logs/%x.stdout " --latency-wait 30 \
Results/Summary/Log_TrimGalore_summary.tsv Results/Summary/Log_STAR_Aln_1_summary.tsv Results/Summary/Log_STAR_Aln_2_summary.tsv \
Results/Summary/RSEM_multimap_summary_${PREFIX}_genes_expected_count.tsv \
Results/Summary/RSEM_uniqmap_summary_${PREFIX}_genes_expected_count.tsv \
Results/Summary/RSEM_multimap_summary_${PREFIX}_genes_FPKM.tsv \
Results/Summary/RSEM_uniqmap_summary_${PREFIX}_genes_FPKM.tsv \
Results/Summary/RSEM_multimap_summary_${PREFIX}_genes_TPM.tsv \
Results/Summary/RSEM_uniqmap_summary_${PREFIX}_genes_TPM.tsv \
Results/Summary/RSEM_multimap_summary_${PREFIX}_isoforms_expected_count.tsv \
Results/Summary/RSEM_uniqmap_summary_${PREFIX}_isoforms_expected_count.tsv \
Results/Summary/RSEM_multimap_summary_${PREFIX}_isoforms_FPKM.tsv \
Results/Summary/RSEM_uniqmap_summary_${PREFIX}_isoforms_FPKM.tsv \
Results/Summary/RSEM_multimap_summary_${PREFIX}_isoforms_TPM.tsv \
Results/Summary/RSEM_uniqmap_summary_${PREFIX}_isoforms_TPM.tsv
# For GATK calling only
PREFIX=?? # your sample_config TSV file whithout extension
snakemake -s $WORKFLOW_DIR/Snakefile_calling --printshellcmds --jobs 200 \
--configfile config_calling.yaml.example \
--cluster-config $WORKFLOW_DIR/resources_SLURM.yaml \
--cluster "sbatch -p {cluster.partition} --cpus-per-task={cluster.cpu} --mem-per-cpu={cluster.mem} --error=logs/%x.stderr --output=logs/%x.stdout " --latency-wait 30 \
Results/Summary/Log_TrimGalore_summary.tsv Results/Summary/Log_STAR_Aln_1_summary.tsv Results/Summary/Log_STAR_Aln_2_summary.tsv \
Results/GATK/${PREFIX}_SNP.vcf.gz\
Results/GATK/${PREFIX}_INDEL.vcf.gz
#!/bin/bash
##############################################################################
## launch workflow
# if used on Genologin cluster (INRA Toulouse )
# module load system/Python-3.6.3
# module load system/singularity-3.0.1
# Path to the SNAKEMAKE workflow directory
WORKFLOW_DIR=.
mkdir -p logs
# This is an example of the snakemake command line to launch a dryrun
snakemake -s $WORKFLOW_DIR/Snakefile_calling --printshellcmds --jobs 200 \
--configfile config_calling.yaml.example --dryrun
# This is an example of the snakemake command line to launch the full workflow on a SLURM cluster
# {cluster.cpu}, {cluster.mem} and {cluster.partition} will be replace by value defined in the resources_SLURM.yaml file
snakemake -s $WORKFLOW_DIR/Snakefile_calling --printshellcmds --jobs 200 \
--use-singularity \
--configfile config_calling.yaml.example \
--cluster-config $WORKFLOW_DIR/resources_SLURM.yaml \
--cluster "sbatch -p {cluster.partition} --cpus-per-task={cluster.cpu} --mem-per-cpu={cluster.mem} --error=logs/%x.stderr --output=logs/%x.stdout " --latency-wait 30
# If you want to launch the workflow partially, indicate the final output files you want (see dryrun log)
#
# For RSEM quantification only
PREFIX=?? # your sample_config TSV file whithout extension
snakemake -s $WORKFLOW_DIR/Snakefile_calling --printshellcmds --jobs 200 \
--configfile config_calling.yaml.example \
--use-singularity \
--cluster-config $WORKFLOW_DIR/resources_SLURM.yaml \
--cluster "sbatch -p {cluster.partition} --cpus-per-task={cluster.cpu} --mem-per-cpu={cluster.mem} --error=logs/%x.stderr --output=logs/%x.stdout " --latency-wait 30 \
Results/Summary/Log_TrimGalore_summary.tsv Results/Summary/Log_STAR_Aln_1_summary.tsv Results/Summary/Log_STAR_Aln_2_summary.tsv \
Results/Summary/RSEM_multimap_summary_${PREFIX}_genes_expected_count.tsv \
Results/Summary/RSEM_uniqmap_summary_${PREFIX}_genes_expected_count.tsv \
Results/Summary/RSEM_multimap_summary_${PREFIX}_genes_FPKM.tsv \
Results/Summary/RSEM_uniqmap_summary_${PREFIX}_genes_FPKM.tsv \
Results/Summary/RSEM_multimap_summary_${PREFIX}_genes_TPM.tsv \
Results/Summary/RSEM_uniqmap_summary_${PREFIX}_genes_TPM.tsv \
Results/Summary/RSEM_multimap_summary_${PREFIX}_isoforms_expected_count.tsv \
Results/Summary/RSEM_uniqmap_summary_${PREFIX}_isoforms_expected_count.tsv \
Results/Summary/RSEM_multimap_summary_${PREFIX}_isoforms_FPKM.tsv \
Results/Summary/RSEM_uniqmap_summary_${PREFIX}_isoforms_FPKM.tsv \
Results/Summary/RSEM_multimap_summary_${PREFIX}_isoforms_TPM.tsv \
Results/Summary/RSEM_uniqmap_summary_${PREFIX}_isoforms_TPM.tsv
# For GATK calling only
PREFIX=?? # your sample_config TSV file whithout extension
snakemake -s $WORKFLOW_DIR/Snakefile_calling --printshellcmds --jobs 200 \
--configfile config_calling.yaml.example \
--use-singularity \
--cluster-config $WORKFLOW_DIR/resources_SLURM.yaml \
--cluster "sbatch -p {cluster.partition} --cpus-per-task={cluster.cpu} --mem-per-cpu={cluster.mem} --error=logs/%x.stderr --output=logs/%x.stdout " --latency-wait 30 \
Results/Summary/Log_TrimGalore_summary.tsv Results/Summary/Log_STAR_Aln_1_summary.tsv Results/Summary/Log_STAR_Aln_2_summary.tsv \
Results/GATK/${PREFIX}_SNP.vcf.gz\
Results/GATK/${PREFIX}_INDEL.vcf.gz
This diff is collapsed.
################################################################################
#
# SOFTWARE : soft should bin in PATH or in bin_dir or use a singularity image
# bin_dir is a directory containing software binary used in this pipeline:
# bin_dir:
#
### Singularity Environment
# Could be generate with env/rnaseq-gatk-singularity-receipe.txt
singulatiry_img: <ABS_PATH>/rnaseq-gatk-singularity
# for singularity image use following environment variable.
### OR executable in a bin directory
# bin_dir is a directory containing software binary used in this pipeline:
# bin_dir:
# for bin_dir fix path to jar files
# for singularity image use following environment variable.
picard_jar: $PICARD_PATH
gatk_jar: $GATK_PATH
################################################################################
#
# WORKFLOW INPUTS
#
# data_dir contains fastq files described in sample_config file
data_dir: <ABS_PATH>
# sample file see REAME file for detailed explanations
# sample_config file is a tabular file describing each input fastq file
# sample_config file name will be used as prefix of the output files.
# header columns are : idx name forward_read reverse_read sequencer read_length oriented phred_scale
# forward_read and reverse_read are fastq file names. These files need to be in the data_dir directory
# sample may be paired end or single end.
# - If single end, leave an empty column in the reverse_read
# - If paired, file names need to ends with _R1.fastq.gz or _R2.fastq.gz (or fq instead of fastq, and not necessarly compressed)
# sequencer need to be choosen in the list : ["ILLUMINA","SLX","SOLEXA","SOLID","454","LS454","COMPLETE","PACBIO","IONTORRENT","CAPILLARY","HELICOS","UNKNOWN"]
# oriented is the forward_prob RSEM parameter to indicate :
# 1 for a strand-specific protocol where all (upstream) reads are derived from the forward strand,
# 0 for a strand-specific protocol where all (upstream) read are derived from the reverse strand,
# 0.5 for a non-strand-specific protocol.
# phred scale indicate the phred score scale use to code base quality: either 33 (Sanger and illumina 1.8+) or 64 (Solexa, illumina 1.3 to 1.8 excluded)
sample_config : sample.tsv
# fasta_ref file is the genome reference Fasta file
fasta_ref: <ABS_PATH>/reference.fa
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment