# MAGATT pipeline Marker Assisted Gene Annotation Transfert for Triticeae. Snakemake pipeline used to transfert GFF annotation on a new assembly with a fine target mapping approach. ## Install the pipeline ```console $ git clone https://forgemia.inra.fr/umr-gdec/magatt.git ``` ## Dependancies ### Build magatt environment with conda We recommend to build the environment using conda (developped with miniconda 3, conda 4.9.2 ) with the file [environment.yml](environment.yml): ```console $ conda env create -f environment.yml -n magatt ``` Once created, you can activate the environment with: ```console $ conda activate magatt ``` All the dependancies installed in the conda env are listed below. * Snakemake : 5.5.2 * Python: 3.5 * Biopython: 1.68 * numpy: 1.15 * pandas: 0.23 * pysam: 0.15 * Bedtools: 2.27 * Blat: 36 * Exonerate (fastavalidcds): 2.4.0 * GenomeTools: 1.5.9 * gffread: 0.9.11 * GMAP: 2018-05-11 * NCBI-blast (BLAST+): 2.6 * Samtools: 1.9 ## Prepare and run the pipeline ### Creating the configuration file: inputs and other parameters The configuration file [config.yaml](config.yaml) will contain all the input files required for the pipeline and some other parameters. * Prepare the query genome data | Parameter in `config.yaml` | format |description |Example | |----------------------------|--------|------------|--------| |**annotationQuery** | GFF3 | The gff file of the annotation we want to transfert onto the new Target genome| annotationQuery: "/path/to/IWGSC_RefSeqv1_annotation.gff3"| |**featureType** | \[STRING\] | The feature we want to use to anchore the annotation. Here we use the gene feature of the GFF.| featureType: 'gene| |**queryFasta** | FASTA | Fasta file of the genome reference sequence. Must match the GFF file used in `annotationQuery` parameter| queryFasta: '/path/to/IWGSC_RefSeqv1_annotation.fasta'| |**blastdb** | \[blast database\]|blast db of all mRNAs (all isoforms) of the query annotation. This will be used to rescue genes which have failed in the transfert|blastdb: 'data/IWGSCv1.1_all_mrna'| |**chromosomes**|python list|list of all the chromosomes in the query reference genome. This will be used to split all the data per chromosome and speed up the analysis|chromosomes: ['1A', '2A', '3A', '4A', '5A', '6A', '7A', '1B', '2B', '3B', '4B', '5B', '6B', '7B', '1D', '2D', '3D', '4D', '5D', '6D', '7D', 'U']| * Prepare the markers/ISBPs input data The pipeline uses markers/ISBPs as anchores to target a restricted part of the target genome on which our genes are suspected to be located. In the first place, it is required to map such markers (ISBPs in our cases) with bwa on your target genome: ```bash $ bwa mem IWGSC_RefSeqV2_bwaindex ISBPs.fasta |samtools view -bS - |samtools sort -o ISBPS_on_refseqv2.bam ``` | Parameter in `config.yaml` | format |description |Example | |----------------------------|--------|------------|--------| |**isbpBam**| BAM|BAM file of the mapping of markers/ISBPs on the target genome|isbpBam: 'ISBPS_on_refseqv2.bam'| |**isbpBed**| BED|Initial coordinates of the markers/ISBPs on the query genome|isbpBed: 'data/ISBP_refseqv1.bed'| |**mapq**| \[INT\]|Minimum mapping quality of the marker/ISBP to be kept in the anchoring process|mapq: 30| |**mismatches**| \[INT\]|Max missmatches allowed for a marker to be kept in the analysis|mismatches: 0| * Prepare the target genome data Here, we only need the fasta of the new genome assembly | Parameter in `config.yaml` | format |description |Example | |----------------------------|--------|------------|--------| |**targetFasta**|FASTA|fasta file of the target genome assembly on which we transfert the annotation|targetFasta: 'data/CS_pesudo_v2.1.fa'| |**targetGmapIndex**|PATH|Name of the GMAP index directory. This will be used with the `-d` option of `gmapl`|targetGmapIndex: 'ensembl_Triticum_aestivum_julius_2022-9-16'| |**targetGmapIndexPath**|PATH|Full path to the directory in which the GMAPindex is found. This will be used with the `-D` option of `gmapl`|targetGmapIndexPath: '/home/data/triticum_aestivum/julius/gmapdb/all/'| * Output parameters/settings | Parameter in `config.yaml` | format |description |Example | |----------------------------|--------|------------|--------| |**results**| \[STRING\] | directory in which all the Snakemake rules will be executed|results: 'results'| |**finalPrefix**| \[STRING\] | Prefix for all the final output files (annotaion, mrna/pep fasta sequences ect)|finalPrefix: 'IWGSC_refseqv2.0_annotv2.0'| |**chromMapID**| CSV| Mapping file which sets the correspondance between the chromosome names in the GFF and the chromosome ID in the newlygenerated gene IDs|chromMapID: 'data/chromosomeMappingID.csv'| |**transferType**| \[STRING\] | transfert all isoforms or only the representative transcript (.1) for each gene in the reference genome | transferType: 'all' ; transferType: 'first'| Example of `chromosomeMappingID.csv` file : ```bash $ cat data/chromosomeMappingID.csv Chr1A 1A Chr1B 1B Chr1D 1D Chr2A 2A Chr2B 2B Chr2D 2D Chr3A 3A Chr3B 3B Chr3D 3D Chr4A 4A Chr4B 4B Chr4D 4D Chr5A 5A Chr5B 5B Chr5D 5D Chr6A 6A Chr6B 6B Chr6D 6D Chr7A 7A Chr7B 7B Chr7D 7D ``` Once all those parameters has been set up, the final configuration file may look like this: ```yaml ##### QUERY related files/parameters (refseqv1.0) # GFF annotatin to transfert annotationQuery: 'data/IWGSC_v1.1_20170706.gff3' # feature type used for anchoring on target genome featureType: 'gene' # FASTA of the query (used to check the sequences after the coordinates are calculated on the target genome) queryFasta: 'data/161010_Chinese_Spring_v1.0_pseudomolecules.fasta' # blastdb of all mrnas. used to rescue genes which have failed in the transfert using the targeted approache blastdb: 'data/IWGSCv1.1_all_mrna' # map of all chromosome ids --> NEED TO BE UPDATED in another version WITH ONE ARRAY FOR THE QUERY AND ONE ARRAY FOR THE TARGET GENOME ASSEMBLY chromosomes: ['1A', '2A', '3A', '4A', '5A', '6A', '7A', '1B', '2B', '3B', '4B', '5B', '6B', '7B', '1D', '2D', '3D', '4D', '5D', '6D', '7D', 'U'] refChrom: ['chr1A', 'chr1B', 'chr1D', 'chr2A', 'chr2B', 'chr2D', 'chr3A', 'chr3B', 'chr3D', 'chr4A', 'chr4B', 'chr4D', 'chr5A', 'chr5B', 'chr5D', 'chr6A', 'chr6B', 'chr6D', 'chr7A', 'chr7B', 'chr7D', 'chrUn'] ##### TARGET related files/parameters (refseqv2.1) targetFasta: 'data/CS_pesudo_v2.1.fa' #GMAP index of the genome for -d option targetGmapIndex: 'ensembl_Triticum_aestivum_julius_2022-9-16' #GMAP index: path to the gmapindex directory, for -D option targetGmapIndexPath: '/home/herimbert/gdec/shared/triticum_aestivum/julius/current/gmapdb/all/' ##### ISBP/markers related config and parameters # BAM file of markers/ISBPs mapped on the target genome (REFSEQ v2.1) isbpBam: 'data/iwgsc_refseqv1_ISBP.bwav2.1.bam' # BED file of coordinates on the query genome (REFSEQ v1.0) isbpBed: 'data/ISBP_refseqv1.bed' # minimum mapping quality of markers on the target genome mapq: 30 # max mismatches per ISBP/marker mismatches: 0 ##### OUTPUT directory results: 'resultsDEV' finalPrefix: 'IWGSC_refseqv2.0_annotv2.0' # this file contains two columns: the first is the chromosome name as it appears in the genome.fasta of the new reference, # and the second the chromosome name as it will appear in the new gene Names chromMapID: 'data/chromosomeMappingID.csv' ``` ## Running the pipeline At first, it is recommended to make a dry-run of the analysis: ```bash $ snakemake -nrp ``` This will check all the rules and th parameters in the `config.yaml` file and print all the command lines which would have been executed. If there is no errors, then you can execute the pipeline with: ```bash $ snakemake ``` If you have multiple CPUs available on your computer, you can choose to use them. For example, if you want to use up to 8 CPUs in parallel, you can run: ```bash $ snakemake -j 8 ``` If you are on a computer cluster with a job scheduler, you can tell the pipeline to use this scheduler instead of runnin all the processes on the local machine: ```bash $ snakemake -j 32 --cluster sbatch ``` This will allow to have at most 32 subproccess run through the SLURM scheduler with `sbatch`. You can use a custom [cluster.json](cluster.json) JSON file do setup the parameters of SBATCH for each rules, and use it with with: ```console $ snakemake -j 32 --cluster-config cluster-hpc2.json --cluster "sbatch -J {cluster.jobName} -c {cluster.c} --mem {cluster.mem} -e {cluster.error} -o {cluster.output} -p gdec" --verbose" ``` The pipeline comes with conda environment file which can be used by Snakemake. To enable the use of conda: ```console $ snakemake --use-conda -j 8 --cluster-config cluster-hpc2.json --cluster "sbatch -p {cluster.partition} -c {cluster.c} -N 1 -t {cluster.time} -J {cluster.jobName} --mem={cluster.mem} -e {cluster.error} -o {cluster.output}" ``` It is possible to force Snakemake to wait for a defined amount of time in case of latency on the filesystem of your cluster/server. ```console # wating 30 seconds after each job to check for output files $ snakemake --latency-wait 30 [...] ``` You can generate the diagram of all the processes and dependancies of you analysis: ```bash $ snakemake --dag |dot -T png > dag.png ``` This will generate a PNG file of your diagram.  If you simply want the global process of the pipeline, you may run: ```bash $ snakemake --rulegraph |dot -T png > rulegraph.png ``` This will generate a PNG file of your diagram. 