|
|
* [Practical course: Transposable Elements identification with The REPET package](#practical-course-transposable-elements-identification-with-the-repet-package)
|
|
|
* [Run the REPET pipelines](#run-the-repet-pipelines)
|
|
|
* [Setup The REPET package environment](#setup-the-repet-package-environment)
|
|
|
* [Start TEdenovo pipeline](#start-tedenovo-pipeline)
|
|
|
* [Alternatively, you can launch the TEdenovo pipeline step by step:](#alternatively-you-can-launch-the-tedenovo-pipeline-step-by-step)
|
|
|
* [Post TEdenovo pipeline](#post-tedenovo-pipeline)
|
|
|
* [Parse MCL clustering results (TEdenovo step 8): create a list (tabulated file) with 2 columns "Cluster_id TE_id"](#parse-mcl-clustering-results-tedenovo-step-8-create-a-list-tabulated-file-with-2-columns-cluster_id-te_id)
|
|
|
* [Get all the annotations done by PASTEC (TEdenovo, step 5) on the Consensus](#get-all-the-annotations-done-by-pastec-tedenovo-step-5-on-the-consensus)
|
|
|
* [Get the multiple-alignment used to build the consensus](#get-the-multiple-alignment-used-to-build-the-consensus)
|
|
|
* [Start TEannot pipeline](#start-teannot-pipeline)
|
|
|
* [Alternatively, you can launch the TEannot.py pipeline step by step:](#alternatively-you-can-launch-the-teannotpy-pipeline-step-by-step)
|
|
|
* [Post TEannot pipeline](#post-teannot-pipeline)
|
|
|
* [TEdenovo consensus library classification corresponding to Chig_refTEs.fa](#tedenovo-consensus-library-classification-corresponding-to-chig_reftesfa)
|
|
|
* [Concatenate all gff files of genome annotation in one](#concatenate-all-gff-files-of-genome-annotation-in-one)
|
|
|
* [Compute statistics of TE genome annotation](#compute-statistics-of-te-genome-annotation)
|
|
|
* [Compute and plot the consensuses coverage](#compute-and-plot-the-consensuses-coverage)
|
|
|
* [Select consensus for the second round of TEannot](#select-consensus-for-the-second-round-of-teannot)
|
|
|
* [Results analysis](#results-analysis)
|
|
|
* [TEdenovo (and post TEdenovo) most interesting output files](#tedenovo-and-post-tedenovo-most-interesting-output-files)
|
|
|
* [TEdenovo output directories](#tedenovo-output-directories)
|
|
|
* [TEdenovo consensus library](#tedenovo-consensus-library)
|
|
|
* [TEdenovo consensus library after filtering of “noCat” consensus built using less than 10 copies and consensus classified as SSR – This library is used as input of TEannot pipeline](#tedenovo-consensus-library-after-filtering-of-nocat-consensus-built-using-less-than-10-copies-and-consensus-classified-as-ssr-this-library-is-used-as-input-of-teannot-pipeline)
|
|
|
* [Classification of TEdenovo consensus library (All consensuses including SSR and noCat built with less than 10 HSPs) according to Wicker classification nomenclature](#classification-of-tedenovo-consensus-library-all-consensuses-including-ssr-and-nocat-built-with-less-than-10-hsps-according-to-wicker-classification-nomenclature)
|
|
|
* [Classification statistics (All consensuses including SSR and noCat built with less than 10 HSPs)](#classification-statistics-all-consensuses-including-ssr-and-nocat-built-with-less-than-10-hsps)
|
|
|
* [MCL clustering output files](#mcl-clustering-output-files)
|
|
|
* [TEannot (and post TEannot) most interesting output files](#teannot-and-post-teannot-most-interesting-output-files)
|
|
|
* [TEannot output directories](#teannot-output-directories)
|
|
|
* [Genome annotation file](#genome-annotation-file)
|
|
|
* [Classification of TEdenovo consensus library corresponding to Chig_refTEs.fa](#classification-of-tedenovo-consensus-library-corresponding-to-chig_reftesfa)
|
|
|
* [Genome annotation global statistics file](#genome-annotation-global-statistics-file)
|
|
|
* [TE annotation statistics per consensus](#te-annotation-statistics-per-consensus)
|
|
|
* [Annexes](#annexes)
|
|
|
* [Additional commands](#additional-commands)
|
|
|
* [Practical course: Manual curation of the transposable elements library](#practical-course-manual-curation-of-the-transposable-elements-library)
|
|
|
* [Compilation of consensus information : classification, genome annotation statistics, MCL clustering](#compilation-of-consensus-information-classification-genome-annotation-statistics-mcl-clustering)
|
|
|
* [Consensus annotation (from PASTEC classifier) using IGV genome browser](#consensus-annotation-from-pastec-classifier-using-igv-genome-browser)
|
|
|
* [Display multiple alignment of HSP used to build the consensus using Jaview](#display-multiple-alignment-of-hsp-used-to-build-the-consensus-using-jaview)
|
|
|
* [Plot genome copies related to a consensus](#plot-genome-copies-related-to-a-consensus)
|
|
|
|
|
|
# Practical course: Transposable Elements identification with [The REPET package](https://forgemia.inra.fr/urgi-anagen/wiki-repet/-/wikis/REPET-V2.5-tutorial)
|
|
|
|
|
|
```plaintext
|
|
|
This tutorial was written by Joelle Amselem and Nathalie Choisne in the frame of Elixir and URGI
|
|
|
TE annotation training sessions, using URGI cloud Virtual Machines.
|
|
|
Note that REPET v2.5 was performed using Colletotrichum higginsianum dataset.
|
|
|
You should adapt the command path to your environment.
|
|
|
```
|
|
|
|
|
|
## Run the REPET pipelines
|
|
|
|
|
|
### Setup [The REPET package](https://forgemia.inra.fr/urgi-anagen/wiki-repet/-/wikis/REPET-V2.5-tutorial) environment
|
|
|
|
|
|
* Connect to the virtual machine containing the REPET installation:
|
|
|
|
|
|
`ssh -XY guestFormation@IP`
|
|
|
|
|
|
* Your home directory is by default : "/home/guestFormation"
|
|
|
* To start a new project, create a folder with the project name « Chig » :
|
|
|
|
|
|
`mkdir Chig`
|
|
|
|
|
|
* Change directory, copy and source the environment used by REPET softwares
|
|
|
|
|
|
`cd Chig `\
|
|
|
`cp ~/data/setEnv.sh ./ `\
|
|
|
`. setEnv.sh`
|
|
|
|
|
|
\-Check the database parameters in the « setEnv.sh » configuration file:
|
|
|
|
|
|
`more setEnv.sh`
|
|
|
|
|
|
`mysql -h $REPET_HOST -u $REPET_USER -p$REPET_PW $REPET_DB`
|
|
|
|
|
|
### Start TEdenovo pipeline
|
|
|
|
|
|
* Create a directory to launch TEdenovo
|
|
|
|
|
|
`mkdir TEdenovo`\
|
|
|
`cd TEdenovo`
|
|
|
|
|
|
* Make a link (ln -s) to access the input fasta file of the genomic sequences – The genome fasta file must be “project_name.fa”
|
|
|
|
|
|
`ln -s ~/data/Chig.fa Chig.fa`
|
|
|
|
|
|
* Make a link (ln -s) to access the databanks used in similarity based classification.
|
|
|
|
|
|
`ln -s ~/data/ProfilesBankForREPET_Pfam27.0_GypsyDB.hmm`\
|
|
|
`ln -s ~/data/rRNA_Eukaryota.fsa`\
|
|
|
`ln -s ~/data/repbase22.05_aaSeq_cleaned_TE.fa`\
|
|
|
`ln -s ~/data/repbase22.05_ntSeq_cleaned_TE.fa`
|
|
|
|
|
|
* Copy the configuration file « TEdenovo.cfg », into your TEdenovo working directory:
|
|
|
|
|
|
(The original TEdenovo.cfg is available at “$REPET_PATH/config/TEdenovo.cfg”)
|
|
|
|
|
|
`cp ~/data/TEdenovo.cfg ./`
|
|
|
|
|
|
\-Check if the configuration file is properly filled before launching TEdenovo:
|
|
|
|
|
|
`gedit TEdenovo.cfg >/dev/null 2>&1 &`
|
|
|
|
|
|
```plaintext
|
|
|
[repet_env]
|
|
|
repet_version: 2.5
|
|
|
...
|
|
|
repet_job_manager: slurm
|
|
|
|
|
|
[project]
|
|
|
project_name: Chig
|
|
|
project_dir: /home/guestFormation/Chig/TEdenovo
|
|
|
…
|
|
|
[detect_features]
|
|
|
…
|
|
|
TE_BLRn: yes
|
|
|
TE_BLRtx: yes
|
|
|
TE_nucl_bank: repbase22.05_ntSeq_cleaned_TE.fsa
|
|
|
TE_BLRx: yes
|
|
|
TE_prot_bank: repbase22.05_aaSeq_cleaned_TE.fsa
|
|
|
TE_HMMER: yes
|
|
|
TE_HMM_profiles: ProfilesBankForREPET_Pfam27.0_GypsyDB.hmm
|
|
|
…
|
|
|
rDNA_BLRn: yes
|
|
|
rDNA_bank: rRNA_Eukaryota.fsa
|
|
|
```
|
|
|
|
|
|
* **TEdenovo pipeline consists of 8 steps that can be launched using only one command line**:
|
|
|
|
|
|
`nohup launch_TEdenovo.py -P Chig -C TEdenovo.cfg -f MCL >& TEdenovo.log &`
|
|
|
|
|
|
**P**: project name
|
|
|
|
|
|
**f**: clustering program used to find consensus families
|
|
|
|
|
|
* Useful commands to follow the progress of steps
|
|
|
|
|
|
\- job status (under Slurm)
|
|
|
|
|
|
`squeue`
|
|
|
|
|
|
\- the log files. ex:
|
|
|
|
|
|
`more TEdenovo.log`\
|
|
|
`tail TEdenovo.log`
|
|
|
|
|
|
#### Alternatively, you can launch the TEdenovo pipeline step by step:
|
|
|
|
|
|
`nohup TEdenovo.py -P name -C config.cfg -S step -[specific-step-param]`
|
|
|
|
|
|
![400px-TEdenovo_1-2](uploads/bbba43a19a359edc2fc81efe2c21056d/400px-TEdenovo_1-2.png)
|
|
|
|
|
|
* Step 1: Genomic sequences are cut and grouped into batches
|
|
|
|
|
|
`nohup TEdenovo.py -P Chig -C TEdenovo.cfg -S 1 >& runS1.log &`
|
|
|
|
|
|
* Step 2: The genome is aligned to itself using BLAST
|
|
|
|
|
|
`nohup TEdenovo.py -P Chig -C TEdenovo.cfg -S 2 -s Blaster >& runS2.log &`
|
|
|
|
|
|
![400px-TEdenovo_3](uploads/5907451fc26f37fa8cdf4b4d8a08e7d8/400px-TEdenovo_3.png)
|
|
|
|
|
|
* Step 3: The repetitives HSP from BLAST are clustered by Recon, Grouper and/or Piler
|
|
|
|
|
|
`nohup TEdenovo.py -P Chig -C TEdenovo.cfg -S 3 -s Blaster -c Grouper >& runS3G.log &`
|
|
|
|
|
|
`nohup TEdenovo.py -P Chig -C TEdenovo.cfg -S 3 -s Blaster -c Recon >& runS3R.log &`
|
|
|
|
|
|
`nohup TEdenovo.py -P Chig -C TEdenovo.cfg -S 3 -s Blaster -c Piler >& runS3P.log &`
|
|
|
|
|
|
![400px-TEdenovo_4](uploads/b7eddeae30ffb0ce2e7d13992aff646f/400px-TEdenovo_4.png)
|
|
|
|
|
|
* Step 4: A multiple alignment is computed for each cluster, and a consensus sequence is derived from each multiple alignment
|
|
|
|
|
|
`nohup TEdenovo.py -P Chig -C TEdenovo.cfg -S 4 -s Blaster -c Grouper -m Map >& runS4G.log &`
|
|
|
|
|
|
`nohup TEdenovo.py -P Chig -C TEdenovo.cfg -S 4 -s Blaster -c Recon -m Map >& runS4R.log &`
|
|
|
|
|
|
`nohup TEdenovo.py -P Chig -C TEdenovo.cfg -S 4 -s Blaster -c Piler -m Map >& runS4P.log &`
|
|
|
|
|
|
![400px-TEdenovo_5-6-7](uploads/d5a0b42979d45f73a6fb1f25ceb3b6fa/400px-TEdenovo_5-6-7.png)
|
|
|
|
|
|
* Step 5: Particular features are detected on each consensus, such as structural features or homology with known TE, HMM profiles or host genes
|
|
|
|
|
|
`nohup TEdenovo.py -P Chig -C TEdenovo.cfg -S 5 -s Blaster -c GrpRecPil -m Map >& runS5.log &`
|
|
|
|
|
|
mySQL table are created: contain the evidences of consensus annotation used by Pastec classifier
|
|
|
|
|
|
* Step 6: The consensuses are classified using Wicker's TEs classification
|
|
|
|
|
|
`nohup TEdenovo.py -P Chig -C TEdenovo.cfg -S 6 -s Blaster -c GrpRecPil -m Map >& runS6.log &`
|
|
|
|
|
|
* Step 7: SSR and under-represented unclassified ("noCat") consensus are filtered
|
|
|
|
|
|
`nohup TEdenovo.py -P Chig -C TEdenovo.cfg -S 7 -s Blaster -c GrpRecPil -m Map >& runS7.log &`
|
|
|
|
|
|
* Step 8: The consensuses are clustered into families to facilitate manual curation using Blastclust or MCL
|
|
|
|
|
|
`nohup TEdenovo.py -P Chig -C TEdenovo.cfg -S 8 -s Blaster -c GrpRecPil -m Map -f MCL >& runS8.log &`
|
|
|
|
|
|
### Post TEdenovo pipeline
|
|
|
|
|
|
#### Parse MCL clustering results (TEdenovo step 8): create a list (tabulated file) with 2 columns "Cluster_id TE_id"
|
|
|
|
|
|
`cd ~/Chig/TEdenovo/Chig_Blaster_GrpRecPil_Map_TEclassif_Filtered_MCL`\
|
|
|
`gawk -F"_MCL|_Chig" '{if(/>/){gsub(">","",$0);print "MCL\t"$2"\t"$1"_Chig"$3}}' Chig_sim_denovoLibTEs_filtered_MCL.fa \`\
|
|
|
`| sort -nk2,2 \`\
|
|
|
`| gawk -F"\t" '{print $1$2"\t"$3}' > Chig_sim_denovoLibTEs_filtered_MCL.lst`
|
|
|
|
|
|
#### Get all the annotations done by PASTEC (TEdenovo, step 5) on the Consensus
|
|
|
|
|
|
A GFF file will be created for each analysis output of the Step 5(detect feature), these GFF annotations files can be viewed in a genome browser such as IGV:
|
|
|
|
|
|
* Copy the configuration files « CreateGFF3sForClassifFeatures.cfg » into your working directory:
|
|
|
|
|
|
`cd ~/Chig/TEdenovo `\
|
|
|
`cp ~/data/CreateGFF3sForClassifFeatures.cfg ./`
|
|
|
|
|
|
* Check if the configuration file is properly filled before launching CreateGFF3sForClassifFeatures:
|
|
|
|
|
|
`gedit CreateGFF3sForClassifFeatures.cfg >/dev/null 2>&1 &`
|
|
|
|
|
|
```plaintext
|
|
|
[repet_env]
|
|
|
...
|
|
|
repet_job_manager: slurm
|
|
|
|
|
|
[project]
|
|
|
project_name: Chig
|
|
|
project_dir: /home/guestFormation/Chig/TEdenovo
|
|
|
[gff3_TEdenovo_options]
|
|
|
add_classif_infos: yes
|
|
|
TR: yes
|
|
|
polyA: yes
|
|
|
ORF: yes
|
|
|
TE_BLRn: yes
|
|
|
TE_BLRtx: yes
|
|
|
TE_BLRx: yes
|
|
|
HG_BLRn: no
|
|
|
rDNA_BLRn: yes
|
|
|
tRNA: no
|
|
|
Profiles: yes
|
|
|
SSR: yes
|
|
|
[gff3_TEannot_options]
|
|
|
project_name_teannot: Chig
|
|
|
annotated_copies: no
|
|
|
[other]
|
|
|
original_HSP: yes
|
|
|
|
|
|
[gff3_TEannot_options]
|
|
|
project_name_teannot: ThalChr4
|
|
|
annotated_copies: no
|
|
|
|
|
|
[other]
|
|
|
original_HSP: yes
|
|
|
```
|
|
|
|
|
|
* Launch the CreateGFF3sForClassifFeatures:
|
|
|
|
|
|
`. ~/data/setEnv.sh `\
|
|
|
`nohup CreateGFF3sForClassifFeatures.py -C CreateGFF3sForClassifFeatures.cfg -f Chig_Blaster_GrpRecPil_Map_TEclassif_Filtered/Chig_sim_denovoLibTEs_filtered.fa -v 3 >& CreateGFF3sForClassifFeatures.log &`
|
|
|
|
|
|
**C**: Configuration file"
|
|
|
|
|
|
**f**: Consensus sequence (fasta file) provided by the TEdenovo.
|
|
|
|
|
|
A new directory "Visualization_Files" is created
|
|
|
|
|
|
* Reverse-complement the coordinates of "\*_reversed" consensus
|
|
|
|
|
|
Indeed, the consensus annotations used to classify the consensus are performed before the step 6 where the consensus are “reverse-complemented”. The coordinates of these annotations are not reversed in the database tables. So we need a patch for GFF files provided the CreateGFF3sForClassifFeatures.py of the release 2.5 (it will be including in the next release of REPET v3).
|
|
|
|
|
|
\- Create a new directory for reverse-complemented GFF\
|
|
|
`cd Visualization_Files/; mkdir gff_reversed`
|
|
|
|
|
|
\- Create a file with 2 columns consensus name and length\
|
|
|
` cut -f1,2 ../Chig_Blaster_GrpRecPil_Map_TEclassif_Filtered/classifFileFromList.classif > Chig_sim_denovoLibTEs_filtered.len`
|
|
|
|
|
|
\- Reverse complement\
|
|
|
`` for file in `ls *.gff3`; ``\
|
|
|
`do`\
|
|
|
`grep -P "^#" $file > gff_reversed/$file;`\
|
|
|
`while read TE len;`\
|
|
|
`do gawk -F"\t" '{if($1 ~ /_reversed/ && $1 ~ /'$TE'/){rstart='$len'-$5+1;rend='$len'-$4+1; if($7 ~ /+/){rstr="-"}; if($7 ~ /-/){rstr="+"};OFS="\t";print $1,$2,$3,rstart,rend,$6,rstr,$8,$9}else{if($1 ~ /'$TE'/){print $0}}}' $file; done < Chig_sim_denovoLibTEs_filtered.len >> gff_reversed/$file;`\
|
|
|
`done`
|
|
|
|
|
|
#### Get the multiple-alignment used to build the consensus
|
|
|
|
|
|
The "original_HSP: yes" option in the CreateGFF3sForClassifFeatures.cfg config file creates a new directory "Original_HSP_fastaAlignment" with Symbolic links to the multiple-alignment used to build the consensus.\
|
|
|
These file can be loaded and browsed in Jalview - Note that they are not reversed, a base is kept in the consensus only if shared by at least 2 HSPs.
|
|
|
|
|
|
`TEdenovo/Visualization_Files/Original_HSP_fastaAlignment/*.fa_aln`
|
|
|
|
|
|
### Start TEannot pipeline
|
|
|
|
|
|
* Copy the configuration files « TEannot.cfg » into your working directory:
|
|
|
|
|
|
(The original TEannot.cfg file is available at $REPET_PATH/config/TEannot.cfg)
|
|
|
|
|
|
`cd ; cd Chig`\
|
|
|
`mkdir TEannot/; cd TEannot/`\
|
|
|
`cp ~/data/TEannot.cfg ./`
|
|
|
|
|
|
* Check if the configuration file is properly filled before launching TEannot:
|
|
|
|
|
|
`gedit TEannot.cfg >/dev/null 2>&1 &`
|
|
|
|
|
|
```plaintext
|
|
|
[repet_env]
|
|
|
...
|
|
|
repet_job_manager: slurm
|
|
|
|
|
|
[project]
|
|
|
project_name: Chig
|
|
|
project_dir: /home/guestFormation/Chig/TEannot
|
|
|
…
|
|
|
[export]
|
|
|
…
|
|
|
gff3_merge_redundant_features: yes
|
|
|
gff3_compulsory_match_part: yes
|
|
|
gff3_with_genomic_sequence: no
|
|
|
gff3_with_TE_length: yes
|
|
|
gff3_with_classif_info: yes
|
|
|
classif_table_name: Chig_sim_consensus_classif
|
|
|
```
|
|
|
|
|
|
* Link to the TEdenovo consensus library
|
|
|
|
|
|
This library contains consensus after filtering of “noCat” consensus built using less than 10 copies and consensus classified as SSR
|
|
|
|
|
|
`ln -s ../TEdenovo/Chig_Blaster_GrpRecPil_Map_TEclassif_Filtered/Chig_sim_denovoLibTEs_filtered.fa Chig_refTEs.fa`
|
|
|
|
|
|
* Link to the input fasta file of the genomic sequences
|
|
|
|
|
|
`ln -s ~/data/Chig.fa`
|
|
|
|
|
|
* Source the environment before launching REPET pipeline (if new terminal window after TEdenovo)
|
|
|
|
|
|
`. ~/Chig/setEnv.sh`
|
|
|
|
|
|
* **TEannot pipeline consists of 8 steps that you can launch using only one command line:**
|
|
|
|
|
|
`nohup launch_TEannot.py -P Chig -C TEannot.cfg -e >& TEannot.log &`
|
|
|
|
|
|
**P**: project_name
|
|
|
|
|
|
#### Alternatively, you can launch the TEannot.py pipeline step by step:
|
|
|
|
|
|
`nohup TEannot.py -P name -C config.cfg -S step -[specific-step-param]`
|
|
|
|
|
|
![400px-TEannot_1-2-3](uploads/bd18a459024a8da599a30eabcea5bffd/400px-TEannot_1-2-3.png)
|
|
|
|
|
|
* Step 1: The first step prepares all the data banks required in the next steps
|
|
|
|
|
|
`nohup TEannot.py -P Chig -C TEannot.cfg -S 1 > S1.log >& runS1.log &`
|
|
|
|
|
|
* Step 2: aligns the reference TE sequences on each genomic chunk via BLASTER (high sensitivity, followed by MATCHER) AND/OR REPEATMASKER (cutoff at 200) AND/OR CENSOR (high sensitivity)
|
|
|
|
|
|
`nohup TEannot.py -P Chig -C TEannot.cfg -S 2 -a BLR >& runS2BLR.log &`
|
|
|
|
|
|
`nohup TEannot.py -P Chig -C TEannot.cfg -S 2 -a RM >& runS2RM.log &`
|
|
|
|
|
|
`nohup TEannot.py -P Chig -C TEannot.cfg -S 2 -a CEN >& runS2CEN.log &`
|
|
|
|
|
|
* Step 2 bis: idem to step 2 on randomized sequences to generate filter threshold
|
|
|
|
|
|
`nohup TEannot.py -P Chig -C TEannot.cfg -S 2 -a BLR -r >& runS2BLRr.log &`
|
|
|
|
|
|
`nohup TEannot.py -P Chig -C TEannot.cfg -S 2 -a RM -r >& runS2RMr.log &`
|
|
|
|
|
|
`nohup TEannot.py -P Chig -C TEannot.cfg -S 2 -a CEN -r >& runS2CENr.log &`
|
|
|
|
|
|
* Step 3: filters and combines the HSPs obtained at step 2, i.e. the TE annotations
|
|
|
|
|
|
`nohup TEannot.py -P Chig -C TEannot.cfg -S 3 -c BLR+RM+CEN >& runS3.log &`
|
|
|
|
|
|
![400px-TEannot_4-5](uploads/c1181974802711024711d716a4d0332d/400px-TEannot_4-5.png)
|
|
|
|
|
|
* Step 4: search for satellites on the genomic sequences via TRF, Mreps and RepeatMasker
|
|
|
|
|
|
`nohup TEannot.py -P Chig -C TEannot.cfg -S 4 -s TRF >& runS4TRF.log &`
|
|
|
|
|
|
`nohup TEannot.py -P Chig -C TEannot.cfg -S 4 -s Mreps >& runS4Mreps.log &`
|
|
|
|
|
|
`nohup TEannot.py -P Chig -C TEannot.cfg -S 4 -s RMSSR >& runS4RMSSR.log &`
|
|
|
|
|
|
* Step 5: merges the SSR annotations from the 3 programs used at the previous step
|
|
|
|
|
|
`nohup TEannot.py -P Chig -C TEannot.cfg -S 5 >& runS5.log &`
|
|
|
|
|
|
* Step 6: compares a data bank (nucleotides or amino-acids, in fasta format, e.g. Repbase Update)
|
|
|
|
|
|
(not mandatory) - Useful when TE are too degenerated to build "reliable" consensus
|
|
|
|
|
|
`nohup TEannot.py -P Chig -C TEannot.cfg -S 6 -b tblastx >& runS6btx.log &`
|
|
|
|
|
|
`nohup TEannot.py -P Chig -C TEannot.cfg -S 6 -b blastx >& runS6bx.log &`
|
|
|
|
|
|
![400px-TEannot_7](uploads/65ea06363d2e045d12f7c345aeb37380/400px-TEannot_7.png)
|
|
|
|
|
|
* Step 7: performs successive procedures such as removal of redundant TE, removal of SSR annotations included into TE annotations and "long join procedure"
|
|
|
|
|
|
`nohup TEannot.py -P Chig -C TEannot.cfg -S 7 >& runS7.log &`
|
|
|
|
|
|
* Step 8: export annotations to GFF3 format
|
|
|
|
|
|
`nohup TEannot.py -P Chig -C TEannot.cfg -S 8 -o GFF3 >& runS8.log &`
|
|
|
|
|
|
### Post TEannot pipeline
|
|
|
|
|
|
#### TEdenovo consensus library classification corresponding to Chig_refTEs.fa
|
|
|
|
|
|
`cd ~/Chig/TEannot`\
|
|
|
`gawk '{if(/>/){gsub(">","",$0);print}}' Chig_refTEs.fa >Chig_refTEs.lst`
|
|
|
|
|
|
`egrep -f Chig_refTEs.lst ../TEdenovo/Chig_Blaster_GrpRecPil_Map_TEclassif/classifConsensus/Chig_sim_withoutRedundancy_negStrandReversed_WickerH.classif > Chig_refTEs.classif`
|
|
|
|
|
|
#### Concatenate all gff files of genome annotation in one
|
|
|
|
|
|
The outputs of TEannot step 8 are genome annotations in GFF3 format (and/or gameXML):
|
|
|
|
|
|
`cd ~/Chig/TEannot `\
|
|
|
`cat Chig_GFF3chr/*.gff3 |grep -v "##" > Chig_refTEs.gff `\
|
|
|
`rm -r Chig_GFF3chr Chig_gameXMLchr`
|
|
|
|
|
|
#### Compute statistics of TE genome annotation
|
|
|
|
|
|
* Launch the "PostAnalyzeTELib.py" script to generate statistics about identified TE during the TEdenovo pipeline.
|
|
|
|
|
|
`. ~/Chig/setEnv.sh `\
|
|
|
`nohup PostAnalyzeTELib.py -a 3 -g 50819261 -p Chig_chr_allTEs_nr_noSSR_join_path -s Chig_refTEs_seq -v 2 >& runPostAnalyze.log &`
|
|
|
|
|
|
**g**: Genome length (A. thaliana 4_CHROMOSOME).
|
|
|
|
|
|
**p**: Project name + "chr_allTEs_nr_noSSR_join_path"
|
|
|
|
|
|
**s**: Project name + "_refTEs_seq"
|
|
|
|
|
|
#### Compute and plot the consensuses coverage
|
|
|
|
|
|
* Launch the "plotCoverage.py". Each output image file (plotCoverage/\*.png) correspond to a plot of the coordinates of copies on their respective TE consensus sequences.
|
|
|
|
|
|
`mkdir plotCoverage`
|
|
|
|
|
|
`python $PYTHONPATH/SMART/Java/Python/plotCoverage.py -i Chig_refTEs.gff -f gff3 -q Chig_refTEs.fa --merge -l grey -o plotCoverage/Chig >& runPlotCoverage.log &`
|
|
|
|
|
|
`rm *.Rout`
|
|
|
|
|
|
**i**: Genome annotation file (gff).
|
|
|
|
|
|
**f**: the file format
|
|
|
|
|
|
**q**: the consensus sequences used in the TEannot
|
|
|
|
|
|
**o**: output directory and project_name prefixe
|
|
|
|
|
|
#### Select consensus for the second round of TEannot
|
|
|
|
|
|
* Launch the "GetSpecificTELibAccordingToAnnotation.py" to select 3 subset of the consensus library used in the 1st TEannot
|
|
|
|
|
|
`nohup GetSpecificTELibAccordingToAnnotation.py -i Chig_chr_allTEs_nr_noSSR_join_path.annotStatsPerTE.tab -t Chig_refTEs_seq -v1 >& GetSpecificTELibAccordingToAnnotation.log &`
|
|
|
|
|
|
**i**: Output file of PostAnalyzeTELib.py (statistics per consensus).
|
|
|
|
|
|
**t**: MySQL table containing the consensus sequences
|
|
|
|
|
|
* get the number of consenus by category
|
|
|
|
|
|
`egrep -c ">" Chig_chr_allTEs_nr_noSSR_join_path.annotStatsPerTE_*.fa`
|
|
|
|
|
|
```plaintext
|
|
|
Chig_chr_allTEs_nr_noSSR_join_path.annotStatsPerTE_FullLengthCopy.fa:55
|
|
|
Chig_chr_allTEs_nr_noSSR_join_path.annotStatsPerTE_FullLengthFrag.fa:51
|
|
|
Chig_chr_allTEs_nr_noSSR_join_path.annotStatsPerTE_OneCopyAndMore.fa:101
|
|
|
```
|
|
|
|
|
|
* get the list of consensus with at least one full-length fragment in the genome
|
|
|
|
|
|
`egrep ">" Chig_chr_allTEs_nr_noSSR_join_path.annotStatsPerTE_FullLengthFrag.fa |sed 's/>//' > Chig_chr_allTEs_nr_noSSR_join_path.annotStatsPerTE_FullLengthFrag.lst`
|
|
|
|
|
|
```plaintext
|
|
|
DHX-incomp-chim_Chig-B-G92-Map4_reversed
|
|
|
DHX-incomp_Chig-B-G87-Map12
|
|
|
DHX-incomp_Chig-B-R43-Map4
|
|
|
DTX-comp_Chig-B-G32-Map20_reversed
|
|
|
DTX-comp_Chig-B-G48-Map19_reversed
|
|
|
DTX-comp_Chig-B-G49-Map20_reversed
|
|
|
DTX-comp_Chig-B-G52-Map5_reversed
|
|
|
DTX-comp_Chig-B-G53-Map20_reversed
|
|
|
DTX-comp_Chig-B-P13.15-Map8
|
|
|
...
|
|
|
```
|
|
|
|
|
|
* One can use this list to restrict the previous result files to these consensus list
|
|
|
|
|
|
`grep -F -f Chig_chr_allTEs_nr_noSSR_join_path.annotStatsPerTE_FullLengthFrag.lst A_result_file > A_result_file_FLF`
|
|
|
|
|
|
## Results analysis
|
|
|
|
|
|
### TEdenovo (and post TEdenovo) most interesting output files
|
|
|
|
|
|
`cd /home/guestFormation/Chig/TEdenovo`
|
|
|
|
|
|
#### TEdenovo output directories
|
|
|
|
|
|
```plaintext
|
|
|
Chig_db step1: chunks and batches
|
|
|
Chig_Blaster step2: Blaster results
|
|
|
Chig_Blaster_Grouper step3: Grouper clustering
|
|
|
Chig_Blaster_Recon step3: Recon clustering
|
|
|
Chig_Blaster_Piler step3: Piler clustering
|
|
|
Chig_Blaster_Grouper_Map step4: Multiple alignment for each Grouper cluster
|
|
|
Chig_Blaster_Recon_Map step4: Multiple alignment for each Recon cluster
|
|
|
Chig_Blaster_Piler_Map step4: Multiple alignment for each Piler cluster
|
|
|
Chig_Blaster_GrpRecPil_Map_TEclassif/detectFeatures/ step5: Output of all programs used to detect features
|
|
|
Chig_Blaster_GrpRecPil_Map_TEclassif/classifConsensus step6: consensus classification
|
|
|
Chig_Blaster_GrpRecPil_Map_TEclassif_Filtered/ step7: consensus filtered for SSR and under-represented noCat
|
|
|
Chig_Blaster_GrpRecPil_Map_TEclassif_Filtered_MCL step8: MCL clustering of consensus
|
|
|
```
|
|
|
|
|
|
#### TEdenovo consensus library
|
|
|
|
|
|
`Chig_Blaster_GrpRecPil_Map_TEclassif/classifConsensus/Chig_sim_withoutRedundancy_negStrandReversed_WickerH.fa`
|
|
|
|
|
|
```plaintext
|
|
|
>noCat_Chig-B-G1-Map20
|
|
|
AGGTAGCAGGTAAATTGCCAGCCCTCATCTAGTATTTTGCTAGTCTCTAACCTATTTAGG
|
|
|
…
|
|
|
>SSR_Chig-B-G10-Map20
|
|
|
TAATTTATATATATAGTAAGCTGTATATTATATTAATCTATATATAATTTAGTACCTTTC
|
|
|
...
|
|
|
>RLX-incomp_Chig-B-G102-Map10
|
|
|
GAATTTCTTTCCAGAGTGCTTAGGAATTTCTAAGTAAGTTATTTTCCTTTATATAGGTTG
|
|
|
…
|
|
|
```
|
|
|
|
|
|
#### TEdenovo consensus library after filtering of “noCat” consensus built using less than 10 copies and consensus classified as SSR – This library is used as input of TEannot pipeline
|
|
|
|
|
|
`Chig_Blaster_GrpRecPil_Map_TEclassif_Filtered/Chig_sim_denovoLibTEs_filtered.fa`
|
|
|
|
|
|
```plaintext
|
|
|
>noCat_Chig-B-G1-Map20
|
|
|
AGGTAGCAGGTAAATTGCCAGCCCTCATCTAGTATTTTGCTAGTCTCTAACCTATTTAGG
|
|
|
…
|
|
|
>RLX-incomp_Chig-B-G102-Map10
|
|
|
GAATTTCTTTCCAGAGTGCTTAGGAATTTCTAAGTAAGTTATTTTCCTTTATATAGGTTG
|
|
|
…
|
|
|
```
|
|
|
|
|
|
#### Classification of TEdenovo consensus library (All consensuses including SSR and noCat built with less than 10 HSPs) according to Wicker classification nomenclature
|
|
|
|
|
|
`Chig_Blaster_GrpRecPil_Map_TEclassif/classifConsensus/Chig_sim_withoutRedundancy_negStrandReversed_WickerH.classif`
|
|
|
|
|
|
\- Legend
|
|
|
|
|
|
```plaintext
|
|
|
Seq_name length strand status class_classif order_classif completeness evidence
|
|
|
```
|
|
|
|
|
|
```plaintext
|
|
|
...
|
|
|
RXX-TRIM-chim_Chig-B-G163-Map3 892 . PotentialChimeric I TRIM NA CI=40; struct=(TElength: <700bps; TermRepeats: termLTR: 442); other=(TermRepeats: termTIR: 441; SSRCoverage=0.15)
|
|
|
noCat_Chig-B-G166-Map3 912 . ok noCat noCat NA CI=NA; struct=(SSRCoverage=0.20)
|
|
|
DTX-incomp_Chig-B-G16-Map9_reversed 1590 - ok II TIR incomplete CI=37; coding=(TE_BLRtx: Mariner-13_SS:ClassII:TIR:Tc1-Mariner:?: 10.98%, Mariner-3_SS:ClassII:TIR:Tc1-Mariner:?: 7.13%, Mariner-4_SS:ClassII:T
|
|
|
IR:Tc1-Mariner:?: 8.26%, Mariner-6_SS:ClassII:TIR:Tc1-Mariner:?: 7.13%, Mariner1_AO:ClassII:TIR:Tc1-Mariner:?: 16.66%; profiles: PF03184.14_DDE_1_NA_EN_20.1: 91.71%(91.71%)); struct=(TElength: >1000bps); other=(SSR: (TA)11_end; SSRCoverage
|
|
|
=0.51)
|
|
|
noCat_Chig-B-G177-Map3 1089 . ok noCat noCat NA CI=NA; struct=(SSRCoverage=0.63)
|
|
|
```
|
|
|
|
|
|
#### Classification statistics (All consensuses including SSR and noCat built with less than 10 HSPs)
|
|
|
|
|
|
`Chig_Blaster_GrpRecPil_Map_TEclassif/classifConsensus/Chig_sim_withoutRedundancy_negStrandReversed_WickerH.classif_stats.txt`
|
|
|
|
|
|
```plaintext
|
|
|
DIRS incomp: 1 (0.62%)
|
|
|
DIRS potential chimeric*: 1 (0.62%)
|
|
|
DIRS total (RYX): 1 (0.62%)
|
|
|
LARD potential chimeric*: 3 (1.85%)
|
|
|
LARD total (RXX-LARD): 4 (2.47%)
|
|
|
LINE comp: 4 (2.47%)
|
|
|
LINE incomp: 4 (2.47%)
|
|
|
LINE potential chimeric*: 2 (1.23%)
|
|
|
LINE total (RIX): 8 (4.94%)
|
|
|
LTR comp: 2 (1.23%)
|
|
|
LTR incomp: 26 (16.05%)
|
|
|
LTR potential chimeric*: 2 (1.23%)
|
|
|
LTR total (RLX): 28 (17.28%)
|
|
|
TRIM potential chimeric*: 1 (0.62%)
|
|
|
TRIM total (RXX-TRIM): 3 (1.85%)
|
|
|
|
|
|
ClassI + noCat order: 12 (7.41%)
|
|
|
ClassI + one order: 44 (27.16%)
|
|
|
ClassI potential chimeric*: 9 (5.56%)
|
|
|
ClassI total (RXX): 56 (34.57%)
|
|
|
|
|
|
Helitron incomp: 4 (2.47%)
|
|
|
Helitron potential chimeric*: 1 (0.62%)
|
|
|
Helitron total (DHX): 4 (2.47%)
|
|
|
MITE total (DXX-MITE): 3 (1.85%)
|
|
|
TIR comp: 11 (6.79%)
|
|
|
TIR incomp: 20 (12.35%)
|
|
|
TIR potential chimeric*: 1 (0.62%)
|
|
|
TIR total (DTX): 31 (19.14%)
|
|
|
|
|
|
ClassII + one order: 38 (23.46%)
|
|
|
ClassII potential chimeric*: 2 (1.23%)
|
|
|
ClassII total (DXX): 38 (23.46%)
|
|
|
|
|
|
PotentialHostGene total: 3 (1.85%)
|
|
|
SSR total: 20 (12.35%)
|
|
|
|
|
|
Nb Potential chimeric*: 11 (6.79%)
|
|
|
|
|
|
Nb noCat at class and order levels (noCat): 45 (27.78%)
|
|
|
|
|
|
-------------------------Summary--------------------------------
|
|
|
|
|
|
RXX: 56 (34.57%)
|
|
|
DXX: 38 (23.46%)
|
|
|
PotentialHostGene: 3 (1.85%)
|
|
|
SSR: 20 (12.35%)
|
|
|
noCat: 45 (27.78%)
|
|
|
TOTAL: 162 (100.00%)
|
|
|
```
|
|
|
|
|
|
#### MCL clustering output files
|
|
|
|
|
|
\-Clustering statistics (1st column \[1,2 ..n\] correspond to MCL clusters \[MCL1, MCL2..MCLn\]):
|
|
|
|
|
|
`Chig_Blaster_GrpRecPil_Map_TEclassif_Filtered_MCL/Chig_sim_denovoLibTEs_filtered_MCL_statsPerCluster.tab`
|
|
|
|
|
|
```plaintext
|
|
|
cluster sequencesNb sizeOfSmallestSeq sizeOfLargestSeq averageSize medSize
|
|
|
1 10 1828 18549 8169 6870
|
|
|
2 5 444 3020 1484 892
|
|
|
3 5 1489 7092 3138 2384
|
|
|
4 4 1590 1879 1782 1831
|
|
|
5 4 2969 7645 5530 5753
|
|
|
```
|
|
|
|
|
|
\-Clustering global statistics:
|
|
|
|
|
|
`Chig_Blaster_GrpRecPil_Map_TEclassif_Filtered_MCL/Chig_sim_denovoLibTEs_filtered_MCL_globalStatsPerCluster.txt`
|
|
|
|
|
|
```plaintext
|
|
|
nb of clusters: 28
|
|
|
nb of clusters with 1 sequence: 4
|
|
|
nb of clusters with 2 sequences: 13
|
|
|
nb of clusters with >2 sequences: 11 (48 sequences)
|
|
|
nb of sequences: 78
|
|
|
nb of sequences in the largest cluster: 10
|
|
|
nb of sequences in the smallest cluster: 1
|
|
|
size of the smallest sequence: 439
|
|
|
size of the largest sequence: 33401
|
|
|
average sequences size: 4365
|
|
|
median sequences size: 2405
|
|
|
```
|
|
|
|
|
|
\-Consensus Library with header containing the cluster name \[MCL1, MCL2..MCLn\]:
|
|
|
|
|
|
`Chig_Blaster_GrpRecPil_Map_TEclassif_Filtered_MCL/Chig_sim_denovoLibTEs_filtered_MCL.fa`
|
|
|
|
|
|
```plaintext
|
|
|
>noCat_MCL2_Chig-B-G1-Map20
|
|
|
AGGTAGCAGGTAAATTGCCAGCCCTCATCTAGTATTTTGCTAGTCTCTAACCTATTTAGG
|
|
|
…
|
|
|
>RLX-incomp_MCL12_Chig-B-G102-Map10
|
|
|
GAATTTCTTTCCAGAGTGCTTAGGAATTTCTAAGTAAGTTATTTTCCTTTATATAGGTTG
|
|
|
...
|
|
|
```
|
|
|
|
|
|
\-List (tabulated file) with 2 columns "Cluster_id TE_id" created Post TEdenovo piepeline
|
|
|
|
|
|
`Chig_Blaster_GrpRecPil_Map_TEclassif_Filtered_MCL/Chig_sim_denovoLibTEs_filtered_MCL.lst`
|
|
|
|
|
|
```plaintext
|
|
|
MCL1 DHX-incomp_Chig-B-G2-Map20
|
|
|
MCL1 DHX-incomp_Chig-B-G87-Map12
|
|
|
MCL1 DHX-incomp_Chig-B-R43-Map4
|
|
|
MCL1 DHX-incomp-chim_Chig-B-G92-Map4_reversed
|
|
|
MCL1 DTX-incomp_Chig-B-G3-Map8
|
|
|
MCL1 DTX-incomp_Chig-B-G43-Map20_reversed
|
|
|
MCL1 DTX-incomp_Chig-B-G59-Map3_reversed
|
|
|
MCL1 DTX-incomp_Chig-B-R36-Map3
|
|
|
MCL1 DTX-incomp-chim_Chig-B-G88-Map7
|
|
|
MCL1 RLX-incomp_Chig-B-P28.6-Map5
|
|
|
MCL2 DTX-incomp_Chig-B-R39-Map11
|
|
|
MCL2 DXX-MITE_Chig-B-R23-Map20
|
|
|
MCL2 noCat_Chig-B-G1-Map20
|
|
|
MCL2 noCat_Chig-B-P1.5-Map20
|
|
|
MCL2 RXX-TRIM-chim_Chig-B-G163-Map3
|
|
|
...
|
|
|
```
|
|
|
|
|
|
### TEannot (and post TEannot) most interesting output files
|
|
|
|
|
|
`cd /home/guestFormation/Chig/TEannot`
|
|
|
|
|
|
#### TEannot output directories
|
|
|
|
|
|
```plaintext
|
|
|
Chig_db step1: chunks and batches
|
|
|
Chig_TEdetect step2 to 7: Censor, RepeatMasker, Blaster on genome sequences and combined results
|
|
|
Chig_TEdetect_rnd step2 : Censor, RepeatMasker, Blaster on random genome sequences and threshold file
|
|
|
Chig_SSRdetect step4 & 5 : TRF, Mreps and RepeatMaskerSSR on genome sequences and combined SSR results
|
|
|
Chig_GFF3chr step8: A gff3 file for each genome sequence annotated
|
|
|
Chig_gameXMLchr step8: A gamexml file for each genome sequence annotated
|
|
|
```
|
|
|
|
|
|
#### Genome annotation file
|
|
|
|
|
|
`Chig_refTEs.gff`
|
|
|
|
|
|
```plaintext
|
|
|
unitig_10 Chig_REPET_TEs match 15811 15915 0.0 + . ID=ms2134_unitig_10_DTX-comp_Chig-B-G52-Map5_reversed;Target=DTX-comp_Chig-B-G52-Map5_reversed 2 107;TargetLength=1865;Identity=73.3
|
|
|
unitig_10 Chig_REPET_TEs match_part 15811 15915 0.0 + . ID=mp2134-1_unitig_10_DTX-comp_Chig-B-G52-Map5_reversed;Parent=ms2134_unitig_10_DTX-comp_Chig-B-G52-Map5_reversed;Target=DTX-comp_Chig-B-G52-Map5_reversed 2 107;Identity=73.3
|
|
|
unitig_10 Chig_REPET_TEs match 17936 18124 0.0 - . ID=ms2135_unitig_10_DTX-comp_Chig-B-G53-Map20_reversed;Target=DTX-comp_Chig-B-G53-Map20_reversed 1691 1880;OtherTargets=DTX-comp_Chig-B-G48-Map19_reversed 1894 1705
|
|
|
unitig_10 Chig_REPET_TEs match_part 17936 18124 0.0 - . ID=mp2135-1_unitig_10_DTX-comp_Chig-B-G53-Map20_reversed;Parent=ms2135_unitig_10_DTX-comp_Chig-B-G53-Map20_reversed;Target=DTX-comp_Chig-B-G53-Map20_reversed 1691 1880
|
|
|
unitig_10 Chig_REPET_TEs match 24809 26695 0.0 + . ID=ms2136_unitig_10_DTX-comp_Chig-B-P13.15-Map8;Target=DTX-comp_Chig-B-P13.15-Map8 5 1891;TargetLength=1892;Identity=100.0
|
|
|
unitig_10 Chig_REPET_TEs match_part 24809 26695 0.0 + . ID=mp2136-1_unitig_10_DTX-comp_Chig-B-P13.15-Map8;Parent=ms2136_unitig_10_DTX-comp_Chig-B-P13.15-Map8;Target=DTX-comp_Chig-B-P13.15-Map8 5 1891;Identity=100.0
|
|
|
unitig_10 Chig_REPET_TEs match 178240 178319 0.0 + . ID=ms2137_unitig_10_DHX-incomp_Chig-B-G2-Map20;Target=DHX-incomp_Chig-B-G2-Map20 5638 5718;TargetLength=12963;Identity=71.6
|
|
|
unitig_10 Chig_REPET_TEs match_part 178240 178319 0.0 + . ID=mp2137-1_unitig_10_DHX-incomp_Chig-B-G2-Map20;Parent=ms2137_unitig_10_DHX-incomp_Chig-B-G2-Map20;Target=DHX-incomp_Chig-B-G2-Map20 5638 5718;Identity=71.6
|
|
|
...
|
|
|
```
|
|
|
|
|
|
#### Classification of TEdenovo consensus library corresponding to Chig_refTEs.fa
|
|
|
|
|
|
`Chig_refTEs.classif`
|
|
|
|
|
|
```plaintext
|
|
|
RLX-incomp_Chig-B-G102-Map10 467 . ok I LTR incomplete CI=7; coding=(TE_BLRx: Copia-1_DPer-I_1p:ClassI:LTR:Copia:?: 10.51%, Copia-5_DAn-I_1p:ClassI:LTR:Copia:?: 10.54%; profiles: _RNaseH_copia_NA_RH_NA: 63.
|
|
|
50%(63.50%)); struct=(TElength: <700bps); other=(SSRCoverage=0.44)
|
|
|
noCat_Chig-B-G129-Map17 489 . ok noCat noCat NA CI=NA; struct=(SSRCoverage=0.20)
|
|
|
RXX_Chig-B-G14-Map15 2202 . ok I noCat NA CI=33; coding=(profiles: _RT_maggy_NA_RT_NA: 33.78%(33.78%)); other=(SSRCoverage=0.57)
|
|
|
noCat_Chig-B-G15-Map20 2239 . ok noCat noCat NA CI=NA; struct=(SSRCoverage=0.41)
|
|
|
RXX-TRIM-chim_Chig-B-G163-Map3 892 . PotentialChimeric I TRIM NA CI=40; struct=(TElength: <700bps; TermRepeats: termLTR: 442); other=(TermRepeats: termTIR: 441; SSRCoverage=0.15)
|
|
|
...
|
|
|
```
|
|
|
|
|
|
#### Genome annotation global statistics file
|
|
|
|
|
|
`Chig_chr_allTEs_nr_noSSR_join_path.globalAnnotStatsPerTE.txt`
|
|
|
|
|
|
```plaintext
|
|
|
nb of sequences: 104
|
|
|
nb of matched sequences: 101
|
|
|
cumulative coverage: 3528765 bp
|
|
|
coverage percentage: 6.94%
|
|
|
|
|
|
total nb of TE fragments: 3036
|
|
|
total nb full-length fragments: 411 (13.54%)
|
|
|
total nb of TE copies: 2785
|
|
|
total nb full-length copies: 448 (16.09%)
|
|
|
families with full-length fragments: 51 (49.04%)
|
|
|
with only one full-length fragment: 10
|
|
|
with only two full-length fragments: 7
|
|
|
with only three full-length fragments: 9
|
|
|
with more than three full-length fragments: 25
|
|
|
families with full-length copies: 55 (52.88%)
|
|
|
with only one full-length copy: 12
|
|
|
with only two full-length copies: 7
|
|
|
with only three full-length copies: 7
|
|
|
with more than three full-length copies: 29
|
|
|
mean of median identity of all families: 85.57 +- 9.93
|
|
|
mean of median length percentage of all families: 25.02 +- 32.68
|
|
|
```
|
|
|
|
|
|
#### TE annotation statistics per consensus
|
|
|
|
|
|
`Chig_chr_allTEs_nr_noSSR_join_path.annotStatsPerTE.tab`
|
|
|
|
|
|
```plaintext
|
|
|
TE length covg frags fullLgthFrags copies fullLgthCopies meanId sdId minId q25Id medId q75Id maxId meanLgth sdLgth minLgth q25Lgth medLgth q75Lgth maxLgth meanLgthPerc sdLgthPerc minLgthPerc q25LgthPerc medLgthPerc q75LgthPerc maxLgthPerc
|
|
|
DHX-incomp-chim_Chig-B-G92-Map4_reversed 18549 116688 64 4 60 4 85.21 9.39 66.90 78.70 84.75 93.50 100.00 1946.32 4893.32 28 47.00 59.50 157.00 18551 10.49 26.38 0.15 0.25 0.32 0.85 100.01
|
|
|
DHX-incomp_Chig-B-G2-Map20 12963 39422 158 0 154 1 78.78 7.91 60.40 73.20 78.80 84.00 100.00 255.99 1270.14 29 48.00 64.00 97.00 12970 1.97 9.80 0.22 0.37 0.49 0.75 100.05
|
|
|
DHX-incomp_Chig-B-G87-Map12 11268 166093 319 11 314 11 84.64 7.81 61.10 79.20 83.05 90.90 100.00 529.07 2168.88 26 39.00 53.00 77.00 11240 4.70 19.25 0.23 0.35 0.47 0.68 99.75
|
|
|
DHX-incomp_Chig-B-R43-Map4 6103 21451 131 1 129 1 74.07 8.01 58.90 68.50 73.90 79.20 99.90 166.33 523.52 41 70.00 95.00 147.00 5963 2.73 8.58 0.67 1.15 1.56 2.41 97.71
|
|
|
DTX-comp_Chig-B-G24-Map4_reversed 1866 459 2 0 2 0 87.50 4.95 84.00 84.00 87.50 91.00 91.00 229.50 167.58 111 111.00 229.50 348.00 348 12.30 8.98 5.95 5.95 12.30 18.65 18.65
|
|
|
...
|
|
|
```
|
|
|
|
|
|
## Annexes
|
|
|
|
|
|
### Additional commands
|
|
|
|
|
|
* If you need to restart the REPET pipeline, you must delete all the folder created by REPET and clear the “jobs” table from the database
|
|
|
|
|
|
`cd Chig `\
|
|
|
`# delete one or the 2 pipeline directories depending on what you have to relaunch `\
|
|
|
`rm -r TEannot `\
|
|
|
`rm -r TEdenovo`
|
|
|
|
|
|
`. setEnv.sh`
|
|
|
|
|
|
`mysql -h $REPET_HOST -u $REPET_USER -p$REPET_PW $REPET_DB`
|
|
|
|
|
|
`mysql> show tables;`
|
|
|
|
|
|
`mysql> select * from jobs;`
|
|
|
|
|
|
`mysql> delete from jobs;`
|
|
|
|
|
|
`mysql> exit`
|
|
|
|
|
|
* To delete all the tables: in case of relaunching all the 2 pipelines
|
|
|
|
|
|
`ListAndDropTables.py -l "*" -C TEdenovo.cfg -d "*" -v 3`
|
|
|
|
|
|
\->Deleting 30 tables corresponding to '\*'
|
|
|
|
|
|
* To delete only TEannot tables in case of relaunching only TEannot
|
|
|
|
|
|
`ListAndDropTables.py -l "Chig_chk_" -d "Chig_chk_"`
|
|
|
|
|
|
\->Deleting 9 tables corresponding to 'Chig_chk_'
|
|
|
|
|
|
`ListAndDropTables.py -l "Chig_chr_" -d "Chig_chr_"`
|
|
|
|
|
|
> Deleting 4 tables corresponding to 'Chig_chr_'
|
|
|
|
|
|
`ListAndDropTables.py -l "Chig_refTEs" -d "Chig_refTEs"`
|
|
|
|
|
|
\->Deleting 2 tables corresponding to 'Chig_refTEs'
|
|
|
|
|
|
# Practical course: Manual curation of the transposable elements library
|
|
|
|
|
|
### Compilation of consensus information : classification, genome annotation statistics, MCL clustering
|
|
|
|
|
|
* Create a tab (tabulated format) containing all the useful information for each consensus)
|
|
|
|
|
|
\- Sort MCL cluster list, TEannot stat file and TEdenovo classification file on consensus ID
|
|
|
|
|
|
`cd ~/Chig/TEannot`
|
|
|
|
|
|
`sort -k2,2 ../TEdenovo/Chig_Blaster_GrpRecPil_Map_TEclassif_Filtered_MCL/Chig_sim_denovoLibTEs_filtered_MCL.lst > tmp.mcl`
|
|
|
|
|
|
`gawk -F"\t" '{OFS="\t"; if($6>0){print $1,$2,$3,$4,$5,$6,$7,$8,$22}}' Chig_chr_allTEs_nr_noSSR_join_path.annotStatsPerTE.tab | sort -k1,1 >tmp.stat`
|
|
|
|
|
|
`sort -k1,1 Chig_refTEs.classif > tmp.classif`
|
|
|
|
|
|
\- join tmp.stat column 1 with tmp.mcl column 2
|
|
|
|
|
|
`join -t $'\t' -1 1 -2 2 tmp.stat tmp.mcl > tmp.stat.mcl`
|
|
|
|
|
|
\-Join tmp.stat.mcl column 1 with tmp.classif column 1
|
|
|
|
|
|
`join -t $'\t' -1 1 -2 1 tmp.stat.mcl tmp.classif > tmp.stat.mcl.classif`
|
|
|
|
|
|
\- Add a header to the final tab file
|
|
|
|
|
|
`echo -e "ID\tLength\tGenome_coverage(bp)\tFragments\tFLFragments\tCopies\tFLCopies\tMeanId(%)\tMeanLengthPerc(%)\tMCLcluster\tLength\tstrand\tStatus\tClass\tOrder\tCompletude\tEvidences" |cat - tmp.stat.mcl.classif > Chig_refTEs_stat_mcl_classif.tsv`
|
|
|
|
|
|
### Consensus annotation (from PASTEC classifier) using IGV genome browser
|
|
|
|
|
|
* We will use here, IGV genome browser ([download](https://software.broadinstitute.org/software/igv/download)) to display consensuses annotation
|
|
|
|
|
|
All the annotations on each consensus (output of the TEdenovo, step 5), such as structural features or homology with known TE, HMM profiles have been extracted in GFF files using the "CreateGFF3sForClassifFeatures.py" (Cf Post TEdenovo pipeline section). The coordinates (start, end) and strand have also been reversed complemented when PASTEC classifier reversed-complemented a Consensus according to evidences found. The gff files are present in "\~/TEdenovo/Visualization_Files/gff_reversed/" directory.
|
|
|
|
|
|
* Launch IGV
|
|
|
|
|
|
\- 3 ways to launch IGV: \
|
|
|
The Application installed on your computer, java webstart, or on your VM (using x2go client: igv.sh &)
|
|
|
|
|
|
* Load the consensus sequences:
|
|
|
|
|
|
```plaintext
|
|
|
Menu Genome -> Load genome from file ...
|
|
|
TEdenovo/Visualization_Files/Chig_sim_denovoLibTEs_filtered.fa
|
|
|
```
|
|
|
|
|
|
* Load the gff files corresponding to each track of annotation
|
|
|
|
|
|
```plaintext
|
|
|
Menu File -> Load from File ...
|
|
|
TEdenovo/Visualization_Files/gff_reversed/Chig_TE_BLRx.gff3
|
|
|
Chig_TE_BLRx.gff3
|
|
|
Chig_TE_BLRtx.gff3
|
|
|
Chig_TE_BLRn.gff3
|
|
|
Chig_Profiles.gff3
|
|
|
Chig_TR.gff3
|
|
|
Chig_SSR.gff3
|
|
|
Chig_ORF.gff3
|
|
|
Chig_polyA.gff3
|
|
|
Chig_rDNA_BLRn.gff3
|
|
|
```
|
|
|
|
|
|
* Save the IGV Session
|
|
|
|
|
|
```plaintext
|
|
|
Menu File -> Save Session ...
|
|
|
TEdenovo/Visualization_Files/gff_reversed/igv_session.xml
|
|
|
```
|
|
|
|
|
|
* Reload an IGV session saved
|
|
|
|
|
|
```plaintext
|
|
|
Menu File -> Open Session ...
|
|
|
TEdenovo/Visualization_Files/gff_reversed/igv_session.xml
|
|
|
```
|
|
|
|
|
|
### Display multiple alignment of HSP used to build the consensus using Jaview
|
|
|
|
|
|
* We will use here jalview ([Download](http://www.jalview.org/Download)) to display multiple alignments used to build the consensuses
|
|
|
|
|
|
In the "\[other\]" section of "\~/TEdenovo/CreateGFF3sForClassifFeatures.cfg" if key "original_HSP: yes" : \
|
|
|
The "\~/TEdenovo/Visualization_Files/Original_HSP_fastaAlignment/" directory has been created and contains symbolic links (alias-like) to the original Consensuses alignments build at the TEdenovo step 4.
|
|
|
|
|
|
* Launch Jalview
|
|
|
|
|
|
\- 3 ways to launch jalview \
|
|
|
The Application installed on your computer, java webstart, or on your VM (using x2go client: jalview &)
|
|
|
|
|
|
* Close all the internal windows corresponding to a project opened by default:
|
|
|
|
|
|
```plaintext
|
|
|
Menu File -> Input Alignment -> From File -
|
|
|
```
|
|
|
|
|
|
* Setup your displaying preferences
|
|
|
|
|
|
```plaintext
|
|
|
Menu Tools -> Preferences -> Visual
|
|
|
```
|
|
|
|
|
|
![400px-Jalview_preference_Visual](uploads/0cb91663076ce458b154c99692629a91/400px-Jalview_preference_Visual.png)
|
|
|
|
|
|
```plaintext
|
|
|
Menu Tools -> Preferences -> Colours
|
|
|
```
|
|
|
|
|
|
![400px-Jalview_preference_Colours](uploads/8ff7e2a2bd55c3416599c777c04f7209/400px-Jalview_preference_Colours.png)
|
|
|
|
|
|
### Plot genome copies related to a consensus
|
|
|
|
|
|
* Output images of plotCoverage.py have been saved at:
|
|
|
|
|
|
```plaintext
|
|
|
~/TEannot/plotCoverage/*
|
|
|
``` |
|
|
\ No newline at end of file |