Skip to content
Snippets Groups Projects
Commit 4050b89b authored by Margot Zahm's avatar Margot Zahm
Browse files

Readme file

parent 4e1c9daf
No related branches found
No related tags found
No related merge requests found
README.md 0 → 100644
# nf-benchgapcloser
Pipeline to benchmark gapclosing tools. It compares:
* behavior for gaps of known and unknown length
* expected length
* identity of filled gaps
* running time
* max memory used
It can generate random reads and sequences or take given data as input.
## Quick Start
1. Install [`Nextflow`](https://www.nextflow.io/)
2. Install [`Singularity`](https://sylabs.io/guides/3.6/user-guide/) for full pipeline reproducibility
3. Generate the singularity image
```
cd benchgapcloser
singularity build Singularity.img Singularityfile
```
4. Run the process on a small dataset
```
```
5. Run the process with your data
```
nextflow run benchgapcloser/main.nf --all_gapcloser [--assembly 'assembly.fa' --reads 'reads.fq --reads_pos 'reads_pos.bed']
```
## Usage
```
nextflow run main.nf [options] --all_gapcloser [--assembly 'assembly.fa' --reads 'reads.fq --reads_pos 'reads_pos.bed']
Mandatory argument:
--all_gapcloser Runs all gapclosing tools for benchmark (GMcloser, LR_Gapcloser and TGS-GapCloser)
or
--GM_gapcloser Runs GMcloser only
or
--LR_gapcloser Runs LR_Gapcloser only
or
--TGS_gapcloser Runs TGS-GapCloser only
Scaffold options:
--assembly [file] Path to fasta file which contain one sequence without gaps. If not specified, a random sequence is generated.
--scaffold_length [int] Length of randomly generated sequence (default: 30Mb).
--contig_length [str] Contig length distribution (mean and stdev, default: '300000 50000')
--gap_length [str] Gap length distribution (mean and stdev, default: '20000 5000')
Reads options:
--reads [file] Path to fastq file of reads. If not specified, reads are generated using BadReads.
--reads_coord [file] Path to bed file of reads coordinates on assembly. It needs a fourth column: read ID.
Mandatory when --reads option is specified
--quantity [str] Reads depth to generate (default: '50x')
--length [str] Fragment length distribution (mean and stdev, default: '15000,13000')
--identity [str] Sequencing identity distribution (mean, max and stdev, default: '100,100,0')
--error_model [str] Can be "nanopore", "pacbio", "random" or a model filename (default: 'random')
--qscore_model [str] Can be "nanopore", "pacbio", "random", "ideal" or a model filename (default: 'random')
--glitches [str] Read glitch parameters (rate, size and skip, default: '0,0,0') [more info](https://github.com/rrwick/Badread#glitches)
--junk_reads [int] This percentage of reads will be low-complexity junk (default: 0) [more info](https://github.com/rrwick/Badread#junk-and-random-reads)
--random_reads [int] This percentage of reads will be random sequence (default: 0) [more info](https://github.com/rrwick/Badread#junk-and-random-reads)
--chimeras [int] Percentage at which separate fragments join together (default: 0) [more info](https://github.com/rrwick/Badread#chimeras)
--start_adapter_seq [str] Adapter sequence for read starts (default: '')
--end_adapter_seq [str] Adapter sequence for read ends (default: '')
General:
--seed [int] Random number generator seed for deterministic output (default: different ouput each time)
--outdir [str] Output directory (default: './results/')
```
## Input files
The only parameter needed is one of these: `--all_gapcloser`, `--GM_gapcloser`, `--LR_gapcloser` or `--TGS_gapcloser`. It specifies the gapcloser tool(s) to run. This will generate a random sequence and corresponding random reads.
If you want the pipeline to take as input your own sequence, use `--assembly` parameter. Your assembly must be a single fasta file without gaps. If you have a multi fasta file, split it and run the pipeline for each sequence.
If you want the pipeline to take as input your own reads, use `--reads` parameter. These reads can not be specified without the associated assembly. You must also give the coordinates of reads on your assembly in BED format with option `--reads_coord`.
## Output files
Output files are stored in the output directory specified by `--outdir` option (default: `./results`). It contains:
* `report.html`: An html report to show the efficiency of each gapcloser.
* `pipeline_trace.txt`: A table of each process run by Nextflow and some info such as mempry used, running time...
* `data/`: A directory with CSV files used to generate the report, assembly gapclosed by gapcloser, sequence and reads generated or given to the pipeline.
* `images/`: A directory with images of each gap and reads mapped on these regions. There is one directory of each gapcloser tool.
* `plots/`: A directory with plots generated for the report.
## Dependencies
If you do not use the singularity image, this is a list of required elements to install before running the workflow.
### Gapclosing tools
* [GMcloser](https://sourceforge.net/projects/gmcloser/)
* [LR_Gapcloser](https://github.com/CAFS-bioinformatics/LR_Gapcloser)
* [TGS-GapCloser](https://github.com/BGI-Qingdao/TGS-GapCloser)
Carefull: These tools heve dependencies not specified in Dependencies section. Please, take care of requirements when you install them.
### Other tools
* [badread](https://github.com/rrwick/Badread)
* [bedtools](https://bedtools.readthedocs.io)
* [blat](https://github.com/djhshih/blat)
* [bowtie2](http://bowtie-bio.sourceforge.net/bowtie2)
* [samtools](http://www.htslib.org/download/)
### Python modules
* [biopython](https://biopython.org/)
* [cython](https://cython.org/)
* [GenomeView](https://github.com/nspies/genomeview)
* [numpy](https://numpy.org/)
* [pysam](https://pysam.readthedocs.io/en/latest/index.html)
* [pytz](http://pytz.sourceforge.net/)
* [scipy](https://www.scipy.org/)
### R libraries
* [ggplot2](https://rdrr.io/cran/ggplot2/)
* [ggpubr](https://rdrr.io/cran/ggpubr/)
* [rmarkdown](https://rdrr.io/cran/rmarkdown/)
\ No newline at end of file
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment