README.md 2.69 KB
Newer Older
Jacques Lagnel's avatar
Jacques Lagnel committed
1
2
3
4
# WGS SNPs calling pipeline for Paired-end Illumina sequencing
## From raw fastq data to vcf file.
## Pipeline features:
### 1) read preprocessing,
Jacques Lagnel's avatar
Jacques Lagnel committed
5
6
7
### 2) reads QC, fastq report
### 3) merges overlaping read and keep unmerged Paired end (R1 & R2)
### 4) mapping with bwa,
Jacques Lagnel's avatar
Jacques Lagnel committed
8
9
10
      a) Maps unmerged reads (PE)
      b) Maps merged reads
      c) Merge the 2 bam file
Jacques Lagnel's avatar
Jacques Lagnel committed
11
12
### 5) mapping QC report (multiqc)
### 6) SNPs calling using GATK in parallel mode: reference splitted by sequences: #tasks=#samples X #chrs
Jacques Lagnel's avatar
Jacques Lagnel committed
13
14
15
16
      a) GATK using "Best Practices Workflows" https://gatk.broadinstitute.org/hc/en-us/sections/360007226651-Best-Practices-Workflows
      b) Using  hard filtering as outlined in GATK docs
       https://gatkforums.broadinstitute.org/gatk/discussion/2806/howto-apply-hard-filters-to-a-call-set
       https://gatk.broadinstitute.org/hc/en-us/articles/360035890471-Hard-filtering-germline-short-variants
Jacques Lagnel's avatar
Jacques Lagnel committed
17

Jacques Lagnel's avatar
Jacques Lagnel committed
18

Jacques Lagnel's avatar
Jacques Lagnel committed
19
:exclamation: !! For the reference fasta file, he reference definition line must not contain space or special character.
Jacques Lagnel's avatar
Jacques Lagnel committed
20

Jacques Lagnel's avatar
Jacques Lagnel committed
21
22


Jacques Lagnel's avatar
Jacques Lagnel committed
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
## Snakemake features: fastq from csv file, config, modules, SLURM

### Workflow steps are descibed in the dag_rules.pdf
### Snakemake rules are based on [Snakemake modules](https://forgemia.inra.fr/gafl/snakemake_modules)


### Files description:
### 1) Snakefile
    - Snakefile_para_gatk_PEmerged.smk

### 2) Configuration file in yaml format:, paths, singularity images paths, parameters, GATK filters ....
    - config.yaml

### 3) a sbatch file to run the pipeline: (to be edited)
    - run_snakemake_pipeline_gatk.slurm

### 4) A slurm directive (#core, mem,...) in json format. Can be adjusted if needed
    - cluster.json

### 5) samples file in csv format
    Must contens at least 2 columns for SE reads and 3 for PE reads (tab separator )
    SampleName  fq1     fq2
    SampleName : your sample ID
    fq1: fastq file for a given sample
    fq2: read 2 for paired-end reads
Jacques Lagnel's avatar
Jacques Lagnel committed
48
49
50
51
    (sp1   ra_R1.fastq.gz   ra_R2.fastq.gz)

    a sample my have multiple fastq files separated by a ','
    (sp1   ra_R1.fastq.gz,rb_R1.fastq.gz   ra_R2.fastq.gz,rb_R2.fastq.gz)
Jacques Lagnel's avatar
Jacques Lagnel committed
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69

    - samples.csv


## RUN:

### 1) Edit the config.yaml

### 2) Set your samples in the sample.csv

### 3) Adjust the run_snakemake_pipeline_gatk.slurm file

### 3) Run pipelene in dry run mode first:
`sbatch run_snakemake_pipeline_gatk.slurm`

### 4) uncomment the real run line and run the pipeline:
`sbatch run_snakemake_pipeline_gatk.slurm`

Jacques Lagnel's avatar
Jacques Lagnel committed
70
71

### 5) optionaly run post vcf filtering (example use for GWAS)
Jacques Lagnel's avatar
Jacques Lagnel committed
72
73
74
75
76
77
78
    1) filter by read depth
    2) keep only biallelic allels
    3) max missing sample for a site
    4) filter by MAF
    5) keep only polymorph sites


Jacques Lagnel's avatar
Jacques Lagnel committed
79
80
`sbatch run_post_filters_and_stats.sh`

Jacques Lagnel's avatar
Jacques Lagnel committed
81
82
83
#### Documentation being written (still)