docs: updating README

b3cb35f7 · Baptiste Imbert · d57a710c · b3cb35f7
Commit b3cb35f7 authored 1 year ago by Baptiste Imbert
--- a/README.md
+++ b/README.md
@@ -2,7 +2,7 @@
 This Nextflow pipeline can produce orthogroups from OrthoFinder and synteny blocks from MCScanX, using genome and associated annotation file.

 # Requirements
-2 modes to run are possible, the first one uses both the genome (FASTA) and the annotation file (GFF3) to extract proteins and run the rest of the pipeline (NOTE: make sure that sequences from the FASTA are folded, it is a requirement for AGAT). The second mode takes directly proteins as an input, along with the GFF3 file. This second mode gives more control over the proteins of interest and saves a bit of time (at least 20 %).
+2 modes are available. The first one uses both the genome (FASTA) and the annotation file (GFF3) to extract proteins and run the rest of the pipeline (NOTE: make sure that sequences from the FASTA are folded, it is a requirement for AGAT/BioPerl). The second mode takes directly proteins as an input, along with the GFF3 file. This second mode gives more control over the proteins of interest and saves some of time.
 ## First mode (Genome + Annotation)
 Genome files should be in FASTA format.
 Annotation files should be in GFF3 format.
@@ -12,7 +12,7 @@ ID    genome                  gff3               chr_conversion
 sp1   input/sp1_genome.fa     input/sp1.gff3     input/chr_conversion/sp1.tsv
 sp2   input/sp2_genome.fa     input/sp2.gff3     input/chr_conversion/sp2.tsv
 ```
-The ID of each species should be unique and will be used to name output files. The `chr_conversion` column is optional, but the header should always be complete.
+The ID of each species should be unique and will be used to name output files. The `chr_conversion` column allows to provide a two-columns file for each species and change chromosome names in the FASTA and GFF3. This column is optional and files can be a placeholder, BUT the header must remain.
 Please see the `example_data/synteny_genome_infiles.tsv` for a working template.
 ## Second mode (Protein + Annotation)
 Protein files should be in FASTA format.
@@ -28,17 +28,17 @@ Please see the `example_data/synteny_protein_infiles.tsv` for a working template


 ## Softwares
-All tools required for the pipeline execution will be installed on launch with Conda (ensure you have it installed), with the exception of MCScanX.
-MCScanX should be installed in the bin/ folder of the pipeline, from https://github.com/wyp1125/MCScanX/archive/refs/heads/master.zip, unzipped, then `make` (ensure javac is installed).
-In case of difficulty, refer to the steps described here (https://github.com/wyp1125/MCScanX#installation)
+All tools required for the pipeline execution will be installed on launch with Conda (ensure you have it installed) or Docker, with the exception of MCScanX.
+MCScanX should be installed in the bin/ folder of the pipeline, from https://github.com/wyp1125/MCScanX/archive/refs/heads/master.zip, unzipped, then `make` (ensure javac is installed). In case of difficulty, refer to the steps described here (https://github.com/wyp1125/MCScanX#installation).
 The programm should be accessed through bin/MCScanX-master/McScanX.

 # Running the pipeline
-Running with the example dataset:
+An example dataset is provided to test the pipeline.
+With the first mode:
 ```
 nextflow run main.nf -c example_data/example_data.config --convert_chr false --species_genome_files example_data/synteny_genome_infiles.tsv --outdir results_example_data/
 ```
-It is also possible to run the pipeline using only proteins and GFF3 files, in this case use the following command:
+With the second mode:
 ```
 nextflow run main.nf -c example_data/example_data.config --species_protein_files example_data/synteny_protein_infiles.tsv --outdir results_example_data/
 ```
@@ -49,11 +49,11 @@ Default parameters are described in the `nextflow.config` file. User can either
 Output files will be gathered in the outdir directory.
 If all `Publish results` are set to `true` in the config file, the following outputs are expected for each species mentionned in the tsv file in the Requirements section:

- checked_gff/ -> GFF3 file after its verification with agat_convert_sp_gxf2gxf.pl
- converted_chr_names/ -> old to new chromosome names if `convert_chr` is `true`
+- checked_gff/ -> GFF3 file after verification with agat_convert_sp_gxf2gxf.pl (if `check_gff` is `true`)
+- converted_chr_names/ -> old to new chromosome names (if `convert_chr` is `true`)
 - cds/ -> extracted CDS using jcvi.formats.gff load
 - longest_isoform_gff/ -> selection of the longest isoform using agat_sp_keep_longest_isoform.pl (publish is set to `false` by default)
 - proteins/ -> translated CDS using seqkit translate
- orthology/ -> OrthoFinder results
- synteny/ -> MCScanX results
+- orthology/ -> OrthoFinder's results
+- synteny/ -> MCScanX's results
 More output directories can be created to check the results of each process, by changing the related options in the `nextflow.config` file.