Skip to content
Snippets Groups Projects
Commit b3cb35f7 authored by Baptiste Imbert's avatar Baptiste Imbert
Browse files

docs: updating README

parent d57a710c
No related branches found
No related tags found
No related merge requests found
......@@ -2,7 +2,7 @@
This Nextflow pipeline can produce orthogroups from OrthoFinder and synteny blocks from MCScanX, using genome and associated annotation file.
# Requirements
2 modes to run are possible, the first one uses both the genome (FASTA) and the annotation file (GFF3) to extract proteins and run the rest of the pipeline (NOTE: make sure that sequences from the FASTA are folded, it is a requirement for AGAT). The second mode takes directly proteins as an input, along with the GFF3 file. This second mode gives more control over the proteins of interest and saves a bit of time (at least 20 %).
2 modes are available. The first one uses both the genome (FASTA) and the annotation file (GFF3) to extract proteins and run the rest of the pipeline (NOTE: make sure that sequences from the FASTA are folded, it is a requirement for AGAT/BioPerl). The second mode takes directly proteins as an input, along with the GFF3 file. This second mode gives more control over the proteins of interest and saves some of time.
## First mode (Genome + Annotation)
Genome files should be in FASTA format.
Annotation files should be in GFF3 format.
......@@ -12,7 +12,7 @@ ID genome gff3 chr_conversion
sp1 input/sp1_genome.fa input/sp1.gff3 input/chr_conversion/sp1.tsv
sp2 input/sp2_genome.fa input/sp2.gff3 input/chr_conversion/sp2.tsv
```
The ID of each species should be unique and will be used to name output files. The `chr_conversion` column is optional, but the header should always be complete.
The ID of each species should be unique and will be used to name output files. The `chr_conversion` column allows to provide a two-columns file for each species and change chromosome names in the FASTA and GFF3. This column is optional and files can be a placeholder, BUT the header must remain.
Please see the `example_data/synteny_genome_infiles.tsv` for a working template.
## Second mode (Protein + Annotation)
Protein files should be in FASTA format.
......@@ -28,17 +28,17 @@ Please see the `example_data/synteny_protein_infiles.tsv` for a working template
## Softwares
All tools required for the pipeline execution will be installed on launch with Conda (ensure you have it installed), with the exception of MCScanX.
MCScanX should be installed in the bin/ folder of the pipeline, from https://github.com/wyp1125/MCScanX/archive/refs/heads/master.zip, unzipped, then `make` (ensure javac is installed).
In case of difficulty, refer to the steps described here (https://github.com/wyp1125/MCScanX#installation)
All tools required for the pipeline execution will be installed on launch with Conda (ensure you have it installed) or Docker, with the exception of MCScanX.
MCScanX should be installed in the bin/ folder of the pipeline, from https://github.com/wyp1125/MCScanX/archive/refs/heads/master.zip, unzipped, then `make` (ensure javac is installed). In case of difficulty, refer to the steps described here (https://github.com/wyp1125/MCScanX#installation).
The programm should be accessed through bin/MCScanX-master/McScanX.
# Running the pipeline
Running with the example dataset:
An example dataset is provided to test the pipeline.
With the first mode:
```
nextflow run main.nf -c example_data/example_data.config --convert_chr false --species_genome_files example_data/synteny_genome_infiles.tsv --outdir results_example_data/
```
It is also possible to run the pipeline using only proteins and GFF3 files, in this case use the following command:
With the second mode:
```
nextflow run main.nf -c example_data/example_data.config --species_protein_files example_data/synteny_protein_infiles.tsv --outdir results_example_data/
```
......@@ -49,11 +49,11 @@ Default parameters are described in the `nextflow.config` file. User can either
Output files will be gathered in the outdir directory.
If all `Publish results` are set to `true` in the config file, the following outputs are expected for each species mentionned in the tsv file in the Requirements section:
- checked_gff/ -> GFF3 file after its verification with agat_convert_sp_gxf2gxf.pl
- converted_chr_names/ -> old to new chromosome names if `convert_chr` is `true`
- checked_gff/ -> GFF3 file after verification with agat_convert_sp_gxf2gxf.pl (if `check_gff` is `true`)
- converted_chr_names/ -> old to new chromosome names (if `convert_chr` is `true`)
- cds/ -> extracted CDS using jcvi.formats.gff load
- longest_isoform_gff/ -> selection of the longest isoform using agat_sp_keep_longest_isoform.pl (publish is set to `false` by default)
- proteins/ -> translated CDS using seqkit translate
- orthology/ -> OrthoFinder results
- synteny/ -> MCScanX results
- orthology/ -> OrthoFinder's results
- synteny/ -> MCScanX's results
More output directories can be created to check the results of each process, by changing the related options in the `nextflow.config` file.
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment