With the availability of cheap long read sequences and efficient genome assembly software packages, novel genome assemblies are made available frequently. It is not rare to produce multiple assemblies of the same or closely related species in a project. These assemblies being produced independently, raw assembly file include chromosomes or scaffolds in random order, orientation and with different sequence names. Depending on the availability of genomic long range information (Hi-C, optical maps, linked reads) the assemblies are sometime not at chromosome scale. One option to bring further the assembly state is to scaffold these assemblies using a reference genome. For related species, scientist are interested in comparing assemblies with each others, and focusing at chromosome level. In these cases, dot-plots are a simple and efficient approach to find large genomic rearrangements. In this case, the dot-plot production through GenomOrder is fast and simple. The pipeline is implemented with Nextflow and can be run with Docker, the code is publicly available on Github.
With the availability of cheap long read sequences and efficient genome assembly software packages, novel genome assemblies are made available frequently. It is not rare to produce multiple assemblies of the same or closely related species in a project. These assemblies being produced independently, raw assembly file include contigs, scaffolds or chromosomes in random order, orientation and with different sequence names. Depending on the availability of genomic long range information (Hi-C, optical maps, linked reads) the assemblies are sometime not at chromosome scale. One option to bring further the assembly state is to scaffold these assemblies using a reference genome. In addition, for related species, scientist are interested in comparing assemblies with each others, and focusing at chromosome level. In these cases, dot-plots are a simple and efficient approach to find large genomic rearrangements.
In both case, GenomOrder allow the dot-plot production or the scaffolding of your assembly in a fast and simple way.
The pipeline is implemented with Nextflow and can be run with Docker, code is available on Github.
# Statement of need
Produced genome assemblies often require to be scaffolded or at least compared to a reference. The fastest method for that requires the use of alignment and visualization tools that are not always compatible. In addition, if multiple assemblies are produced, it would be necessary to perform again numerous analyzes for each of the new assemblies. This is why GenomOrder was designed with the aim of making assembly alignments visualization and scaffolding more accessible and faster. That tool has to be used in combination with DGenies which already have numerous studies [Cabanettes:2018]. The major advantages of GenomOrder is that it allows quick and easy production of multiple alignments and rearrangements of assemblies.
Produced genome assemblies often require to be scaffolded or at least compared to a reference. The fastest method for that requires the use of alignment and visualization tools that are not always compatible and need multiple parameters. In addition, if multiple assemblies are produced, it would be necessary to perform again numerous analyzes for each of the new assemblies. This is why GenomOrder was designed with the aim of making assembly alignments visualization and scaffolding more accessible and faster. That tool has to be used in combination with DGenies which already have numerous studies [Cabanettes:2018]. The major advantages of GenomOrder is that it allows quick and easy production of multiple alignments and rearrangements of assemblies while being reproducible.
# Material and Methods
## Features
This pipeline is implemented in Nextflow, a portable, reproducible, scalable and parallelizable workflow framework for pipelines [@Di:2017]. With Nextflow, GenomOrder pipeline is designed to parallelize and automate a list of process in a single command line, thus improving reproducibility and traceability. In addition, as the pipeline is developped with multiple different features, it allow users to customize pipeline parameters and option to fit the desired behaviour.
This pipeline is implemented in Nextflow, a portable, reproducible, scalable and parallelizable workflow framework for pipelines [@Di:2017]. With Nextflow, GenomOrder pipeline is designed to parallelize and automate a list of process in a single command line, thus improving reproducibility and traceability while allowing rapid production. In addition, as the pipeline is developped with multiple different features, it allow users to customize command-line options to fit the desired behaviour.
GenomOrder is developped with two main modules : genomic assembly reorganisation and assemblies reference comparision. GenomOrder can reorder, reorient and rename sequences from up to five assemblies according to the given reference assembly. If the reference assembly is in chromosomes and the other assemblies are not, genomeorder can scaffold the assemblies in chromosomes. Given a list of chromosomes, GenomOrder will align the chromosomes sharing the same name and produce an all-vs-all chromosome visualisation archives for http://dgenies.toulouse.inra.fr/.
In addition, GenomOrder can simply align multiple assemblies against a given reference assembly and quickly produce dot-plot archive for http://dgenies.toulouse.inra.fr/.
GenomOrder is developped with two main modules : genomic assembly reorganisation and assemblies reference comparision.
GenomOrder can reorder, reorient and rename sequences from up to five assemblies according to the given reference assembly. If the reference assembly is in chromosomes and the other assemblies are not, genomeorder can scaffold the assemblies in chromosomes.
Given a list of chromosomes, GenomOrder will align the chromosomes sharing the same name and produce an all-vs-all chromosome visualisation archives for http://dgenies.toulouse.inra.fr/ [FIGURE X]. In addition, GenomOrder can simply align multiple assemblies against a given reference assembly and quickly produce dot-plot archive for http://dgenies.toulouse.inra.fr/ [FIGURE Y].
## Workflow
Figure 1 depicts the workflow:
[FIGURE Z]
1. Input. GenomOrder require at least one assembly fasta file and one reference fasta file to produce an alignment and the resulting DGenies visualization files. Optionally, users can provide 4 more assemblies to be aligned to reference. Additionaly one option allow the users to scaffold the input assemblies against the reference, and one other is used to align chromosome from the different input assemblies against their equivalent in other assemblies.
2. Align and produce DGenies backup files. Assemblies are aligned to reference using minimap 2. Fasta file are then indexed and alignment file is sorted to produce an archive that can be given as input to DGenies for the dot-plot visualization.
...
...
@@ -65,7 +68,7 @@ Further information about running the pipeline are available on github, in the c
## Output and Error Handling
Output files are arranged in the given ouput folder. Each new folder contain only the principal output. Nextflow creates its own work folder to produce the intermediate output. Process logs and stored in the specific folder of each runs. If the run stop due to an error, user can fix this error and then run the pipeline with the initial command and '-resume' option.
Output files are arranged in the given ouput folder with '--output'. Each new folder contain only the principal output. Nextflow creates its own work folder to produce the intermediate output. Process logs and stored in the specific folder of each runs. If the run stop due to an error, user can fix this error and then run the pipeline with the initial command and '-resume' option.
## Conclusion and discussions
...
...
@@ -75,6 +78,8 @@ The GenomOrder pipeline is user-friendly and provide a one-step analysis tool. R
# Figures


