add README.md

178074c4 · Helene Rimbert · 7a25d13f · 178074c4
Commit 178074c4 authored 4 years ago by Helene Rimbert
--- a/README.md
+++ b/README.md
+# TRANSFERTANNOT pipeline
+Snakemake pipeline used to transfert GFF annotation on a new assembly with a fine target mapping approach.
+
+## Dependancies
+
+### Snakemake
+* Version 5.5.2
+
+### Python3
+* Python: 3.5
+* Biopython: 1.68
+* numpy: 1.15
+* pandas: 0.23
+* pysam: 0.15
+  
+### Genomic Tools
+* Bedtools: 2.27
+* Blat: 36
+* GenomeTools: 1.5.9
+* gffread: 0.9.11
+* GMAP: 2018-05-11
+* NCBI-blast (BLAST+): 2.6
+* Samtools: 1.9
+
+## Prepare the pipeline
+
+### Creating teh configuration file
+
+The configuration file [config.yaml](config.yaml) will contain all the input files required for the pipeline and some other parameters.
+
+* Prepare the query genome data
+
+| Parameter in `config.yaml` | format |description |Example |  
+|----------------------------|--------|------------|--------|  
+|**annotationQuery** | GFF3 | The gff file of the annotation we want to transfert onto the new Target genome| annotationQuery: "/path/to/IWGSC_RefSeqv1_annotation.gff3"|  
+|**featureType** | \[STRING\] | The feature we want to use to anchore the annotation. Here we use the gene feature of the GFF.| featureType: 'gene|
+|**queryFasta** | FASTA | Fasta file of the genome reference sequence. Must match the GFF file used in `annotationQuery` parameter| queryFasta: '/path/to/IWGSC_RefSeqv1_annotation.fasta'|
+|**blastdb** | \[blast database\]|blast db of all mRNAs (all isoforms) of the query annotation. This will be used to rescue genes which have failed in the transfert|blastdb: 'data/IWGSCv1.1_all_mrna'|
+|**chromosomes**|python list|list of all the chromosomes in the query reference genome. This will be used to split all the data per chromosome and speed up the analysis|chromosomes: ['1A', '2A', '3A', '4A', '5A', '6A', '7A', '1B', '2B', '3B', '4B', '5B', '6B', '7B', '1D', '2D', '3D', '4D', '5D', '6D', '7D', 'U']|
+
+* Prepare the markers/ISBPs input data
+
+The pipeline uses markers/ISBPs as anchores to target a restricted part of the target genome on which our genes are suspected to be located.
+
+In the first place, it is required to map such markers (ISBPs in our cases) with bwa on your target genome:
+
+```bash
+$ bwa mem IWGSC_RefSeqV2_bwaindex ISBPs.fasta |samtools view -bS - |samtools sort -o ISBPS_on_refseqv2.bam
+```
+
+| Parameter in `config.yaml` | format |description |Example |  
+|----------------------------|--------|------------|--------|  
+|**isbpBam**| BAM|BAM file of the mapping of markers/ISBPs on the target genome|isbpBam: 'ISBPS_on_refseqv2.bam'|
+|**isbpBed**| BED|Initial coordinates of the markers/ISBPs on the query genome|isbpBed: 'data/ISBP_refseqv1.bed'|
+|**mapq**| \[INT\]|Minimum mapping quality of the marker/ISBP to be kept in the anchoring process|mapq: 30|
+|**mismatches**| \[INT\]|Max missmatches allowed for a marker to be kept in the analysis|mismatches: 0|
+
+* Prepare the target genome data
+
+Here, we only need the fasta of the new genome assembly
+
+| Parameter in `config.yaml` | format |description |Example |  
+|----------------------------|--------|------------|--------|  
+|**targetFasta**|FASTA|fasta file of the target genome assembly on which we transfert the annotation|targetFasta: 'data/CS_pesudo_v2.1.fa'|
+
+* Output parameters/settings
+
+| Parameter in `config.yaml` | format |description |Example |  
+|----------------------------|--------|------------|--------|  
+|**results**| \[STRING\] | directory in which all the Snakemake rules will be executed|results: 'results'|
+|**finalPrefix**| \[STRING\] | Prefix for all the final output files (annotaion, mrna/pep fasta sequences ect)|finalPrefix: 'IWGSC_refseqv2.0_annotv2.0'|
+|**chromMapID**| CSV| Mapping file which sets the correspondance between the chromosome names in the GFF and the chromosome ID in the newlygenerated gene IDs|chromMapID: 'data/chromosomeMappingID.csv'|
+
+Example of `chromosomeMappingID.csv` file :
+
+```bash
+$ cat data/chromosomeMappingID.csv
+Chr1A   1A
+Chr1B   1B
+Chr1D   1D
+Chr2A   2A
+Chr2B   2B
+Chr2D   2D
+Chr3A   3A
+Chr3B   3B
+Chr3D   3D
+Chr4A   4A
+Chr4B   4B
+Chr4D   4D
+Chr5A   5A
+Chr5B   5B
+Chr5D   5D
+Chr6A   6A
+Chr6B   6B
+Chr6D   6D
+Chr7A   7A
+Chr7B   7B
+Chr7D   7D
+```
+
+Once all those parameters has been set up, the final configuration file may look like this:
+
+```yaml
+##### QUERY related files/parameters (refseqv1.0)
+# GFF annotatin to transfert
+annotationQuery: 'data/IWGSC_v1.1_20170706.gff3'
+# feature type used for anchoring on target genome
+featureType: 'gene'
+# FASTA of the query (used to check the sequences after the coordinates are calculated on the target genome)
+queryFasta: 'data/161010_Chinese_Spring_v1.0_pseudomolecules.fasta'
+# blastdb of all mrnas. used to rescue genes which have failed in the transfert using the targeted approache
+blastdb: 'data/IWGSCv1.1_all_mrna'
+# map of all chromosome ids --> NEED TO BE UPDATED in another version WITH ONE ARRAY FOR THE QUERY AND ONE ARRAY FOR THE TARGET GENOME ASSEMBLY
+chromosomes: ['1A', '2A', '3A', '4A', '5A', '6A', '7A', '1B', '2B', '3B', '4B', '5B', '6B', '7B', '1D', '2D', '3D', '4D', '5D', '6D', '7D', 'U']
+refChrom: ['chr1A', 'chr1B', 'chr1D', 'chr2A', 'chr2B', 'chr2D', 'chr3A', 'chr3B', 'chr3D', 'chr4A', 'chr4B', 'chr4D', 'chr5A', 'chr5B', 'chr5D', 'chr6A', 'chr6B', 'chr6D', 'chr7A', 'chr7B', 'chr7D', 'chrUn']
+
+##### TARGET related files/parameters (refseqv2.1)
+targetFasta: 'data/CS_pesudo_v2.1.fa'
+
+##### ISBP/markers related config and parameters
+# BAM file of markers/ISBPs mapped on the target genome (REFSEQ v2.1)
+isbpBam: 'data/iwgsc_refseqv1_ISBP.bwav2.1.bam'
+# BED file of coordinates on the query genome (REFSEQ v1.0)
+isbpBed: 'data/ISBP_refseqv1.bed'
+# minimum mapping quality of markers on the target genome
+mapq: 30
+# max mismatches per ISBP/marker
+mismatches: 0
+
+##### OUTPUT directory
+results: 'resultsDEV'
+finalPrefix: 'IWGSC_refseqv2.0_annotv2.0'
+# this file contains two columns: the first is the chromosome name as it appears in the genome.fasta of the new reference,
+# and the second the chromosome name as it will appear in the new gene Names
+chromMapID: 'data/chromosomeMappingID.csv'
+```
+
+## Running the pipeline
+
+At first, it is recommended to make a dry-run of the analysis:
+```bash
+$ snakemake -nrp
+```
+This will check all the rules and th parameters in the `config.yaml` file and print all the command lines which would have been executed.  
+If there is no errors, then you can execute the pipeline with:
+
+```bash
+$ snakemake
+```
+
+If you have multiple CPUs available on your computer, you can choose to use them.  
+For example, if you want to use up to 8 CPUs in parallel, you can run:
+
+```bash
+$ snakemake -j 8
+```
+
+If you are on a computer cluster with a job scheduler, you can tell the pipeline to use this scheduler instead of runnin all the processes on the local machine:
+```bash
+$ snakemake -j 32 --cluster sbatch
+```
+
+This will allow to have at most 32 subproccess run through the SLURM scheduler with `sbatch`.
+
+You can generate the diagram of all the processes and dependancies of you analysis:
+```bash
+$ snakemake --dag |dot -T png > dag.png
+```
+This will generate a PNG file of your diagram.  
+![dag.png](dag.png)
+
+If you simply want the global process of the pipeline, you may run:
+```bash
+$ snakemake --rulegraph |dot -T png > rulegraph.png
+```
+This will generate a PNG file of your diagram.  
+![rulegraph.png](rulegraph.png)