README.md

# Allel Specific Expression Analysis workflow

authors__: Maria Bernard (INRAE - SIGENAE) / Kevin Muret - Frederic Jehl - Sandrine Lagarrigue ( AGROCAMPUS OUEST / INRAE - PEGASE)

**<u>!!! STATUT : DEVELOPMENT !!!</u>**

## What it does ? 

This pipeline will perform variant filtering, reads alignment on masked genome and allel specific quantification expression.

 The pipeline is represented in the graph below:

![](img/full_dryrun.png)


## 1. Input

The user must use the [config_calling.yaml](config_calling.yaml.example) file to provide all necessary inputs for the pipeline:

- A masked reference fasta file (fasta_ref option), indexed with samtools faidx and GATK4 CreateSequenceDictionary. We recommend to mask genome with variant filtered bi-allelic and on GATK metrics FS < 30, QD > 2 and AF < 1. In our used case we mask genome on variant construct at the population level, and then we launch this pipeline at the tissue level.

- An genome annotation reference GTF file (gtf_ref option)

- A set of variant to include in the ASE analysis

- A sample reads description TSV file (sample_config option), with the following columns:

  - idx : a unique index number to identify each fastq (paire) file
  - name : the sample name
  - forward_read and reverse_read are fastq file names. These files need to be in a directory precised with the data_dir option
    - If single end, leave an empty column in the reverse_read
    - If paired, file names need to ends with _R1.fastq.gz or _R2.fastq.gz (or fq instead of fastq, and not necessarly compressed)
  - sequencer : sequencer used to generate the sequences : "ILLUMINA", "SLX", "SOLEXA", "SOLID", "454", "LS454", "COMPLETE", "PACBIO", "IONTORRENT", "CAPILLARY", "HELICOS", "UNKNOWN"
  - oriented : to indicate if sequences are generated with strand-specific protocol. It corresponds to the forward_prob RSEM parameter. Indicate : 
    - 1 for a strand-specific protocol where all (upstream) reads are derived from the forward strand,
    - 0 for a strand-specific protocol where all (upstream) read are derived from the reverse strand,
    - 0.5 for a non-strand-specific protocol. 
  - phred scale: to indicate the phred score scale use to code base quality: either 33 (Sanger and illumina 1.8+) or 64 (Solexa, illumina 1.3 to 1.8 excluded)


## 2. Data Processing

### Raw reads trimming

First step is to filter and trim raw reads before alignment. This step will convert phred 64 scale fastq files into phred 33 scale fastq files and then trim reads with [trim_galore](https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/). It will trim adapter and filtered out trimmed reads that are smaller than the third of the read length.

By default the trimming quality threshold is set to 15.

You will find in Result/Summary directory, a Log_TrimGalore_summary.tsv that detail the trim_galor statistics.


### Sequence alignment

Trimmed reads will be align against the reference genome thanks to the [STAR](https://academic.oup.com/bioinformatics/article/29/1/15/272537) following the multi-sample 2-pass mapping procedure, with the GTF  reference file.

The second alignment pass will take into account new junctions detected from the first pass. Among all samples new canonical junction covered by at least 2 reads in one sample will be kept.

STAR 2nd pass alignment will generate bam files in genome coordinates.

You will find in the Results/Summary directory , 2 files reporting alignment statistics : Log_STAR_Aln_1_summary.tsv and Log_STAR_Aln_2_summary.tsv


### Genomic Alignment post processing

Alignment will be filtered to keep only:

- properly paired alignment in case of paired libraries
- non duplicated alignment ( PCR or OPTICAL duplicates are removed)
- uniquely mapped reads (corresponding to STAR quality score of 255)

Finally reads that contain Ns in their cigar string will be splittd, and alignment score of 255 will be convert to 60.

### Variant Filtration

Variants will be filtered before submitting to ASE.

Only bi-allelic SNP with FS < or equal to 30.0 and QD < or equal to 2.0 will be used in ASE.

Filtering on SNP type is an option, see SNP_filter in the configuration file. By default it is set to False as the calling pipeline generate output VCF file filtered on SNP type.

The pipeline will also add annotation in the INFO field in the final VCF file (see: Results/vcfFiltration/*_FsQdBiall.vcf.gz):

- LOCALISATION will be set to :
  - *jnct5nt* if the variants is closed ( +/- 5bp) to an exon/intron junction
  - *homodimer5* if the variants is included in a 5bp homopolymer
  - *cl3snp35nt* if the variants is in a 35 bp windows that included at least 3 others variants
- GTpopCR will indicate the percentage of known genotype
- DPgt05rPopCR will indicate the percentage of known genotype covered by at least 5 reads

Additionally, the pipeline will generate 3 tables that extract samples GT, DP and AD_REF and AD_ALT genotypes annotations. These tables will include only SNP that have a genotype call rate greater than 50% and a genotype with at least 5 reads coverage call rate greater than 20%.

### Compute Allele Specific Expression

Alleles Specific Expression will be obtained using phASER (https://github.com/secastel/phaser) and ASE Read Counter (https://gatk.broadinstitute.org/hc/en-us/articles/360036714351-ASEReadCounter), on bi-allelic variants filtered on FS < 30, and QD > 2.

ASEReadCounter and phASER returnby sample, allele expression on each SNP.

phASER will also return gene allele specific espression by constructing gene haplotypes by sample, and finally an expression matrix which agglomerate all the samples. 

## 3. Dependencies

The workflow depends on:

* cutadapt (version 2.1)
* trimgalore (version 0.6.5)
* STAR (version 2.6.0c)
* samtools (version 1.9 )
* bcftools (version 1.9)
* bedtools( version 2.29.0)
* tabix (version 0.2.5)
* GenomeAnalysisTK.jar (version 3.7) which include Picard tools and ASE Read Counter
* java (version 8)
* phASER (cloned the 29th of May 2020)

Theses softwares need to be available in you PATH or in a bin directory precised in the config.yaml file with the bin_dir option.


##  4. Configure and running workflow

- Copy the config_calling.yaml.example file into your working directory and update it according to the instructions inside the file.

- launch snakemake command as presented in SLURM.sh to launch the workflow. This script explain how to launch a dryrun, a full run, perform quantification expression only or variant calling only

- If necessary update resources_SLURM.yaml to define cpu memory and cluster partition resources for each rule. The default section is the minimal resources per job. The other section take as name the rules name and define only resources that are different from the default section.


Resources_SLURM.yaml and SLURM.sh are specific to SLURM cluster. Therefore, if you use another cluster, you need to create other files. However, you can use these files to inspire you.
    

This is the official documentation of Snakemake if you need more information: <https://snakemake.readthedocs.io/en/stable/tutorial/tutorial.html>