Commit 263e2acc authored by MARTIN Pierre's avatar MARTIN Pierre
Browse files

Merge branch 'dsl2' into 'dev'

Merge dsl2 in dev to finalize dsl1 to dsl2 transition

See merge request !12
parents 02a9545e 82951f63
Pipeline #46464 skipped with stages
*.config linguist-language=groovy *.nf gitlab-language=groovy
*.nf linguist-language=groovy
\ No newline at end of file
# metagWGS # metagWGS: Documentation
## Introduction ## Introduction
**metagWGS** is a [Nextflow](https://www.nextflow.io/docs/latest/index.html#) bioinformatics analysis pipeline used for **metag**enomic **W**hole **G**enome **S**hotgun sequencing data (Illumina HiSeq3000 or NovaSeq, paired, 2\*150bp). **metagWGS** is a [Nextflow](https://www.nextflow.io/docs/latest/index.html#) bioinformatics analysis pipeline used for **metag**enomic **W**hole **G**enome **S**hotgun sequencing data (Illumina HiSeq3000 or NovaSeq, paired, 2\*150bp ; PacBio HiFi reads, single-end).
### Pipeline graphical representation ### Pipeline graphical representation
The workflow processes raw data from `.fastq` or `.fastq.gz` inputs and do the modules represented into this figure: The workflow processes raw data from `.fastq/.fastq.gz` input and/or assemblies (contigs) `.fa/.fasta` and uses the modules represented in this figure:
![](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/raw/dev/docs/Pipeline.png) ![](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/raw/master/docs/Pipeline.png)
### metagWGS steps ### metagWGS steps
metagWGS is splitted into different steps that correspond to different parts of the bioinformatics analysis: metagWGS is split into different steps that correspond to different parts of the bioinformatics analysis:
* `01_clean_qc` (can ke skipped) * `S01_CLEAN_QC` (can be stopped at with `--stop_at_clean` ; can ke skipped with `--skip_clean`)
* trims adapters sequences and deletes low quality reads ([Cutadapt](https://cutadapt.readthedocs.io/en/stable/#), [Sickle](https://github.com/najoshi/sickle)) * trims adapters sequences and deletes low quality reads ([Cutadapt](https://cutadapt.readthedocs.io/en/stable/#), [Sickle](https://github.com/najoshi/sickle))
* suppresses host contaminants ([BWA](http://bio-bwa.sourceforge.net/) + [Samtools](http://www.htslib.org/) + [Bedtools](https://bedtools.readthedocs.io/en/latest/)) * suppresses host contaminants ([BWA](http://bio-bwa.sourceforge.net/) + [Samtools](http://www.htslib.org/) + [Bedtools](https://bedtools.readthedocs.io/en/latest/))
* controls the quality of raw and cleaned data ([FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/)) * controls the quality of raw and cleaned data ([FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/))
* makes a taxonomic classification of cleaned reads ([Kaiju MEM](https://github.com/bioinformatics-centre/kaiju) + [kronaTools](https://github.com/marbl/Krona/wiki/KronaTools) + [Generate_barplot_kaiju.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/dev/bin/Generate_barplot_kaiju.py) + [merge_kaiju_results.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/dev/bin/merge_kaiju_results.py)) * makes a taxonomic classification of cleaned reads ([Kaiju MEM](https://github.com/bioinformatics-centre/kaiju) + [kronaTools](https://github.com/marbl/Krona/wiki/KronaTools) + [Generate_barplot_kaiju.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/bin/Generate_barplot_kaiju.py) + [merge_kaiju_results.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/bin/merge_kaiju_results.py))
* `02_assembly` * `S02_ASSEMBLY` (can be stopped at with `--stop_at_assembly`)
* assembles cleaned reads (combined with `01_clean_qc` step) or raw reads (combined with `--skip_01_clean_qc` parameter) ([metaSPAdes](https://github.com/ablab/spades) or [Megahit](https://github.com/voutcn/megahit)) * assembles cleaned reads (combined with `S01_CLEAN_QC` step) or raw reads (combined with `--skip_clean` parameter) ([metaSPAdes](https://github.com/ablab/spades) or [Megahit](https://github.com/voutcn/megahit))
* assesses the quality of assembly ([metaQUAST](http://quast.sourceforge.net/metaquast)) * assesses the quality of assembly ([metaQUAST](http://quast.sourceforge.net/metaquast))
* deduplicates cleaned reads (combined with `01_clean_qc` step) or raw reads (combined with `--skip_01_clean_qc` parameter) ([BWA](http://bio-bwa.sourceforge.net/) + [Samtools](http://www.htslib.org/) + [Bedtools](https://bedtools.readthedocs.io/en/latest/)) * deduplicates cleaned reads (combined with `S01_CLEAN_QC` step) or raw reads (combined with `--skip_clean` parameter) ([BWA](http://bio-bwa.sourceforge.net/) + [Samtools](http://www.htslib.org/) + [Bedtools](https://bedtools.readthedocs.io/en/latest/))
* `03_filtering` (can be skipped) * `S03_FILTERING` (can be stopped at with `--stop_at_filtering` ; can be skipped with `--skip_assembly`)
* filters contigs with low CPM value ([Filter_contig_per_cpm.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/dev/bin/Filter_contig_per_cpm.py) + [metaQUAST](http://quast.sourceforge.net/metaquast)) * filters contigs with low CPM value ([Filter_contig_per_cpm.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/bin/Filter_contig_per_cpm.py) + [metaQUAST](http://quast.sourceforge.net/metaquast))
* `04_structural_annot` * `S04_STRUCTURAL_ANNOT` (can be stopped at with `--stop_at_structural_annot`)
* makes a structural annotation of genes ([Prokka](https://github.com/tseemann/prokka) + [Rename_contigs_and_genes.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/dev/bin/Rename_contigs_and_genes.py)) * makes a structural annotation of genes ([Prokka](https://github.com/tseemann/prokka) + [Rename_contigs_and_genes.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/bin/Rename_contigs_and_genes.py))
* `05_alignment` * `S05_ALIGNMENT`
* aligns reads to the contigs ([BWA](http://bio-bwa.sourceforge.net/) + [Samtools](http://www.htslib.org/)) * aligns reads to the contigs ([BWA](http://bio-bwa.sourceforge.net/) + [Samtools](http://www.htslib.org/))
* aligns the protein sequence of genes against a protein database ([DIAMOND](https://github.com/bbuchfink/diamond)) * aligns the protein sequence of genes against a protein database ([DIAMOND](https://github.com/bbuchfink/diamond))
* `06_func_annot` * `S06_FUNC_ANNOT` (can ke skipped with `--skip_func_annot`)
* makes a sample and global clustering of genes ([cd-hit-est](http://weizhongli-lab.org/cd-hit/) + [cd_hit_produce_table_clstr.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/dev/bin/cd_hit_produce_table_clstr.py)) * makes a sample and global clustering of genes ([cd-hit-est](http://weizhongli-lab.org/cd-hit/) + [cd_hit_produce_table_clstr.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/bin/cd_hit_produce_table_clstr.py))
* quantifies reads that align with the genes ([featureCounts](http://subread.sourceforge.net/) + [Quantification_clusters.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/dev/bin/Quantification_clusters.py)) * quantifies reads that align with the genes ([featureCounts](http://subread.sourceforge.net/) + [Quantification_clusters.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/bin/Quantification_clusters.py))
* makes a functional annotation of genes and a quantification of reads by function ([eggNOG-mapper](http://eggnog-mapper.embl.de/) + [best_bitscore_diamond.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/dev/bin/best_bitscore_diamond.py) + [merge_abundance_and_functional_annotations.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/dev/bin/merge_abundance_and_functional_annotations.py) + [quantification_by_functional_annotation.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/dev/bin/quantification_by_functional_annotation.py)) * makes a functional annotation of genes and a quantification of reads by function ([eggNOG-mapper](http://eggnog-mapper.embl.de/) + [best_bitscore_diamond.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/bin/best_bitscore_diamond.py) + [merge_abundance_and_functional_annotations.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/bin/merge_abundance_and_functional_annotations.py) + [quantification_by_functional_annotation.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/bin/quantification_by_functional_annotation.py))
* `07_taxo_affi` * `S07_TAXO_AFFI` (can ke skipped with `--skip_taxo_affi`)
* taxonomically affiliates the genes ([Samtools](http://www.htslib.org/) + [aln2taxaffi.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/dev/bin/aln2taxaffi.py)) * taxonomically affiliates the genes ([Samtools](http://www.htslib.org/) + [aln2taxaffi.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/bin/aln2taxaffi.py))
* taxonomically affiliates the contigs ([Samtools](http://www.htslib.org/) + [aln2taxaffi.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/dev/bin/aln2taxaffi.py)) * taxonomically affiliates the contigs ([Samtools](http://www.htslib.org/) + [aln2taxaffi.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/bin/aln2taxaffi.py))
* counts the number of reads and contigs, for each taxonomic affiliation, per taxonomic level ([Samtools](http://www.htslib.org/) + [merge_contig_quantif_perlineage.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/dev/bin/merge_contig_quantif_perlineage.py) + [quantification_by_contig_lineage.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/dev/bin/quantification_by_contig_lineage.py)) * counts the number of reads and contigs, for each taxonomic affiliation, per taxonomic level ([Samtools](http://www.htslib.org/) + [merge_contig_quantif_perlineage.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/bin/merge_contig_quantif_perlineage.py) + [quantification_by_contig_lineage.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/bin/quantification_by_contig_lineage.py))
* `08_binning` from [nf-core/mag 1.0.0](https://github.com/nf-core/mag/releases/tag/1.0.0) * `S08_BINNING` (not yet implemented)
* makes binning of contigs ([MetaBAT2](https://bitbucket.org/berkeleylab/metabat/src/master/)) * binning strategies for assemblies and co-assemblies
* assesses bins ([BUSCO](https://busco.ezlab.org/) + [metaQUAST](http://quast.sourceforge.net/metaquast) + [summary_busco.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/dev/bin/summary_busco.py) and [combine_tables.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/dev/bin/combine_tables.py) from [nf-core/mag](https://github.com/nf-core/mag))
* taxonomically affiliates the bins ([BAT](https://github.com/dutilh/CAT)) All steps are launched one after another by default. Use `--stop_at_[STEP]` and `--skip_[STEP]` parameters to tweak execution to your will.
A report html file is generated at the end of the workflow with [MultiQC](https://multiqc.info/). A report html file is generated at the end of the workflow with [MultiQC](https://multiqc.info/).
...@@ -49,28 +49,15 @@ Two [Singularity](https://sylabs.io/docs/) containers are available making insta ...@@ -49,28 +49,15 @@ Two [Singularity](https://sylabs.io/docs/) containers are available making insta
## Documentation ## Documentation
metagWGS documentation is available [here](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/tree/dev/docs). The metagWGS documentation can be found in the following pages:
## License * [Installation](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/docs/installation.md)
metagWGS is distributed under the GNU General Public License v3. * The pipeline installation procedure.
* [Usage](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/docs/usage.md)
## Copyright * An overview of how the pipeline works, how to run it and a description of all of the different command-line flags.
2021 INRAE * [Output](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/docs/output.md)
* An overview of the different output files and directories produced by the pipeline.
## Funded by * [Use case](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/docs/use_case.md)
Anti-Selfish (Labex ECOFECT – N° 00002455-CT15000562) * A tutorial to learn how to launch the pipeline on a test dataset on [genologin cluster](http://bioinfo.genotoul.fr/).
* [Functional tests](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/functional_tests/README.md)
France Génomique National Infrastructure (funded as part of Investissement d’avenir program managed by Agence Nationale de la Recherche, contract ANR-10-INBS-09) * (for developers) A tool to launch a new version of the pipeline on curated input data and compare its results with known output.
With participation of SeqOccIn members financed by FEDER-FSE MIDI-PYRENEES ET GARONNE 2014-2020.
## Citation
metagWGS has been presented at JOBIM 2020:
Poster "Whole metagenome analysis with metagWGS", J. Fourquet, C. Noirot, C. Klopp, P. Pinton, S. Combes, C. Hoede, G. Pascal.
https://www.sfbi.fr/sites/sfbi.fr/files/jobim/jobim2020/posters/compressed/jobim2020_poster_9.pdf
metagWGS has been presented at JOBIM 2019 and at Genotoul Biostat Bioinfo day:
Poster "Whole metagenome analysis with metagWGS", J. Fourquet, A. Chaubet, H. Chiapello, C. Gaspin, M. Haenni, C. Klopp, A. Lupo, J. Mainguy, C. Noirot, T. Rochegue, M. Zytnicki, T. Ferry, C. Hoede.
report_comment: >
This report has been generated by the <a href="https://forgemia.inra.fr/genotoul-bioinfo/metagwgs" target="_blank">genotoul-bioinfo/metagwgs</a>
analysis pipeline. For information about how to interpret these results, please see the
<a href="https://forgemia.inra.fr/genotoul-bioinfo/metagwgs" target="_blank">documentation</a>.
extra_fn_clean_trim:
- "hifi_"
- '.count_reads_on_contigs'
- '_scaffolds'
- '.txt'
- '.contigs'
- '.sort'
module_order:
- fastqc:
name: 'FastQC'
path_filters:
- '*hifi_*.zip'
- quast:
name: 'Quast primary assembly'
info: 'This section of the report shows quast results after assembly'
path_filters:
- '*quast_hifi/*/report.tsv'
- prokka
- featureCounts
prokka_fn_snames: True
prokka_table: True
featurecounts:
fn: '*.summary'
shared: true
table_columns_visible:
FastQC:
percent_duplicates: False
percent_gc: False
...@@ -34,11 +34,6 @@ module_order: ...@@ -34,11 +34,6 @@ module_order:
info: 'This section reports of the reads alignement against host genome with bwa.' info: 'This section reports of the reads alignement against host genome with bwa.'
path_filters: path_filters:
- '*.no_filter.flagstat' - '*.no_filter.flagstat'
- samtools:
name : 'Reads aln on host genome'
info: 'This section of the cleaned reads alignement against host genome with bwa.'
path_filters:
- '*host_filter/*'
- samtools: - samtools:
name : 'Reads after host reads filter' name : 'Reads after host reads filter'
info: 'This section reports of the cleaned reads alignement against host genome with bwa.' info: 'This section reports of the cleaned reads alignement against host genome with bwa.'
...@@ -54,12 +49,12 @@ module_order: ...@@ -54,12 +49,12 @@ module_order:
name: 'Quast primary assembly' name: 'Quast primary assembly'
info: 'This section of the report shows quast results after assembly' info: 'This section of the report shows quast results after assembly'
path_filters: path_filters:
- '*_all_contigs_QC/*' - '*quast_primary/*/report.tsv'
- quast: - quast:
name: 'Quast filtered assembly' name: 'Quast filtered assembly'
info: 'This section of the report shows quast results after filtering of assembly' info: 'This section of the report shows quast results after filtering of assembly'
path_filters: path_filters:
- '*_select_contigs_QC/*' - '*quast_filtered/*/report.tsv'
- samtools: - samtools:
name : 'Reads after deduplication' name : 'Reads after deduplication'
info: 'This section reports of deduplicated reads alignement against contigs with bwa.' info: 'This section reports of deduplicated reads alignement against contigs with bwa.'
......
...@@ -79,7 +79,7 @@ to_write = [] ...@@ -79,7 +79,7 @@ to_write = []
#contig_renames [ald_name]=newname #contig_renames [ald_name]=newname
#reecriture du fasta #reecriture du fasta
with open(args.fnaFile, "rU") as fnaFile,\ with open(args.fnaFile, "r") as fnaFile,\
open(args.outFNAFile, "w") as outFNA_handle: open(args.outFNAFile, "w") as outFNA_handle:
for record in SeqIO.parse(fnaFile, "fasta"): for record in SeqIO.parse(fnaFile, "fasta"):
try : try :
...@@ -112,7 +112,13 @@ with open(args.file) as gffFile,\ ...@@ -112,7 +112,13 @@ with open(args.file) as gffFile,\
#Generate correspondance #Generate correspondance
old_prot_name = feature.qualifiers['ID'][0].replace("_gene","") old_prot_name = feature.qualifiers['ID'][0].replace("_gene","")
prot_number = old_prot_name.split("_")[-1] prot_number = old_prot_name.split("_")[-1]
new_prot_name = new_ctg_name + "." + prot_prefix + prot_number
subfeat_types = {subfeat.type for subfeat in feature.sub_features}
assert len(subfeat_types) == 1, f'Subfeature have different types {subfeat_types}'
subfeat_type = subfeat_types.pop()
new_prot_name = f"{new_ctg_name}.{subfeat_type}_{prot_number}"
prot_names[old_prot_name] = new_prot_name prot_names[old_prot_name] = new_prot_name
fh_prot_table.write(old_prot_name + "\t" + new_prot_name + "\n") fh_prot_table.write(old_prot_name + "\t" + new_prot_name + "\n")
...@@ -134,7 +140,7 @@ with open(args.file) as gffFile,\ ...@@ -134,7 +140,7 @@ with open(args.file) as gffFile,\
GFF.write(to_write, out_handle) GFF.write(to_write, out_handle)
with open(args.fastaFile, "rU") as handle,\ with open(args.fastaFile, "r") as handle,\
open(args.outFAAFile, "w") as outFasta_handle: open(args.outFAAFile, "w") as outFasta_handle:
for record in SeqIO.parse(handle, "fasta"): for record in SeqIO.parse(handle, "fasta"):
try : try :
...@@ -147,7 +153,7 @@ with open(args.fastaFile, "rU") as handle,\ ...@@ -147,7 +153,7 @@ with open(args.fastaFile, "rU") as handle,\
pass pass
with open(args.ffnFile, "rU") as handle,\ with open(args.ffnFile, "r") as handle,\
open(args.outFFNFile, "w") as outFFN_handle: open(args.outFFNFile, "w") as outFFN_handle:
for record in SeqIO.parse(handle, "fasta"): for record in SeqIO.parse(handle, "fasta"):
try : try :
......
This diff is collapsed.
#!/usr/bin/env python
"""----------------------------------------------------------------------------
Script Name: best_hit_diamond.py
Description: Have best diamond hits for each gene/protein (best bitscore)
Input files: Diamond output file (.m8)
Created By: Joanna Fourquet
Date: 2021-01-13
-------------------------------------------------------------------------------
"""
# Metadata
__author__ = 'Joanna Fourquet \
- GenPhySE - NED'
__copyright__ = 'Copyright (C) 2021 INRAE'
__license__ = 'GNU General Public License'
__version__ = '0.1'
__email__ = 'support.bioinfo.genotoul@inra.fr'
__status__ = 'dev'
# Status: dev
# Modules importation
try:
import argparse
import pandas as p
import re
import sys
import os
import operator
from collections import defaultdict
from collections import OrderedDict
from collections import Counter
from matplotlib import pyplot
except ImportError as error:
print(error)
exit(1)
def read_blast_input(blastinputfile):
#c1.Prot_00001 EFK63346.1 100.0 85 0 0 1 85 62 146 1.6e-36 158.3 85 \
# 146 EFK63346.1 LOW QUALITY PROTEIN: hypothetical protein HMPREF9008_04720, partial [Parabacteroides sp. 20_3]
#queryId, subjectId, percIdentity, alnLength, mismatchCount, gapOpenCount,
#queryStart, queryEnd, subjectStart, subjectEnd, eVal, bitScore, queryLength, subjectLength, subjectTitle
score = defaultdict(float)
best_lines = defaultdict(list)
nmatches = defaultdict(int);
for line in open(blastinputfile):
(queryId, subjectId, percIdentity, alnLength, mismatchCount, gapOpenCount, \
queryStart, queryEnd, subjectStart, subjectEnd, eVal, bitScore, queryLength, subjectLength, subjectTitle) \
= line.rstrip().split("\t")
if (nmatches[queryId] == 0):
score[queryId] = float(bitScore)
nmatches[queryId] += 1
best_lines[queryId] = [line]
else :
if (nmatches[queryId] > 0 and float(bitScore) > score[queryId]):
score[queryId] = float(bitScore)
best_lines[queryId] = [line]
else :
if (nmatches[queryId] > 0 and float(bitScore) == score[queryId]):
best_lines[queryId].append(line)
return(best_lines)
def main(argv):
parser = argparse.ArgumentParser()
parser.add_argument("aln_input_file", \
help = "file with blast/diamond matches expected format m8 \
\nqueryId, subjectId, percIdentity, alnLength, mismatchCount, gapOpenCount,\
queryStart, queryEnd, subjectStart, subjectEnd, eVal, bitScore")
parser.add_argument('-o','--output_file', type = str, \
default = "best_hit.tsv", help = ("string specifying output file"))
args = parser.parse_args()
out_lines = read_blast_input(args.aln_input_file)
with open (args.output_file, "w") as out :
for id in out_lines.keys():
for line in out_lines[id]:
out.write(line)
print("Finished")
if __name__ == "__main__":
main(sys.argv[1:])
#!/usr/bin/env python
#USAGE: ./combine_tables.py <BUSCO_table> <QUAST_table>
import pandas as pd
from sys import stdout
from sys import argv
# Read files
file1 = pd.read_csv(argv[1], sep="\t")
file2 = pd.read_csv(argv[2], sep="\t")
# Merge files
result = pd.merge(file1, file2, left_on="GenomeBin", right_on="Assembly", how='outer')
# Print to stdout
result.to_csv(stdout, sep='\t')
...@@ -3,119 +3,130 @@ ...@@ -3,119 +3,130 @@
"""-------------------------------------------------------------------- """--------------------------------------------------------------------
Script Name: merge_contig_quantif_perlineage.py Script Name: merge_contig_quantif_perlineage.py
Description: merge quantifications and lineage into one matrice for one sample. Description: merge quantifications and lineage into one matrice for one sample.
Input files: idxstats file, depth from mosdepth (bed.gz) and lineage percontig.tsv file. Input files: depth from samtools coverage and lineage percontig.tsv file.
Created By: Joanna Fourquet Created By: Joanna Fourquet
Date: 2021-01-19 Date: 2021-01-19
----------------------------------------------------------------------- ------------------------------------ -----------------------------------
""" """
# Metadata. # Metadata.
__author__ = 'Joanna Fourquet \ __author__ = 'Joanna Fourquet, Jean Mainguy'
- GenPhySE - NED'
__copyright__ = 'Copyright (C) 2021 INRAE' __copyright__ = 'Copyright (C) 2021 INRAE'
__license__ = 'GNU General Public License' __license__ = 'GNU General Public License'
__version__ = '0.1' __version__ = '0.1'
__email__ = 'support.bioinfo.genotoul@inra.fr' __email__ = 'support.bioinfo.genotoul@inra.fr'
__status__ = 'dev' __status__ = 'dev'
# Status: dev.
from argparse import ArgumentParser, ArgumentDefaultsHelpFormatter
# Modules importation. import pandas as pd
try: import logging
import argparse
import re def parse_arguments():
import sys # Manage parameters.
import pandas as pd parser = ArgumentParser(description = 'Script which \
import numpy as np merge quantifications and lineage into one matrice for one sample.',
from datetime import datetime formatter_class=ArgumentDefaultsHelpFormatter)
except ImportError as error:
print(error) parser.add_argument('-s', '--sam_coverage', required = True, \
exit(1) help = 'depth per contigs from samtools coverage tool.')
# Print time. parser.add_argument('-c', '--contig_tax_affi', required = True, \
print(str(datetime.now())) help = '.percontig.tsv file.')
# Manage parameters. parser.add_argument('-o', '--output_name', required = True, \
parser = argparse.ArgumentParser(description = 'Script which \ help = 'Name of output file containing counts of contigs and reads \
merge quantifications and lineage into one matrice for one sample.') for each lineage.')
parser.add_argument('-i', '--idxstats_file', required = True, \ parser.add_argument('-v', '--version', action = 'version', \
help = 'idxstats file.') version = __version__)
parser.add_argument('-m', '--mosdepth_file', required = True, \ parser.add_argument("--verbose", help="increase output verbosity",
help = 'depth per contigs from mosdepth (regions.bed.gz).') action="store_true")
parser.add_argument('-c', '--percontig_file', required = True, \ args = parser.parse_args()
help = '.percontig.tsv file.') return args
parser.add_argument('-o', '--output_name', required = True, \
help = 'Name of output file containing counts of contigs and reads \ def main():
for each lineage.')
args = parse_arguments()
parser.add_argument('-v', '--version', action = 'version', \
version = __version__) if args.verbose:
logging.basicConfig(format="%(levelname)s: %(message)s", level=logging.DEBUG)
args = parser.parse_args() logging.info('Mode verbose ON')
# Recovery of idxstats file. else:
idxstats = pd.read_csv(args.idxstats_file, delimiter='\t', header=None) logging.basicConfig(format="%(levelname)s: %(message)s")
idxstats.columns = ["contig","len","mapped","unmapped"]
# Recovery of mosdepth file; remove start/end columns sam_coverage_file = args.sam_coverage
mosdepth = pd.read_csv(args.mosdepth_file, delimiter='\t', header=None,compression='gzip') contig_taxaffi_file = args.contig_tax_affi
mosdepth.columns = ["contig","start","end","depth"] output_name = args.output_name
mosdepth.drop(["start","end"], inplace=True,axis=1)
ranks = ["superkingdom", "phylum", "order", "class",
# Recovery of .percontig.tsv file. "family", "genus", "species"]
percontig = pd.read_csv(args.percontig_file, delimiter='\t', dtype=str)
logging.info("Read and merge tables")
# Merge idxstats and .percontig.tsv files. cov_df = pd.read_csv(sam_coverage_file, delimiter='\t')
merge = pd.merge(idxstats,percontig,left_on='contig',right_on='#contig', how='outer')
contig_taxaffi_df = pd.read_csv(contig_taxaffi_file, delimiter='\t', dtype=str)
# Add depth print(cov_df)
merge = pd.merge(merge,mosdepth,left_on='contig',right_on='contig', how='outer') print("#####################")
print(contig_taxaffi_df)
# Fill NaN values to keep unmapped contigs.
merge['consensus_lineage'] = merge['consensus_lineage'].fillna('Unknown') depth_tax_contig_df = pd.merge(cov_df,contig_taxaffi_df,left_on='#rname',right_on='#contig', how='outer')
merge['tax_id_by_level'] = merge['tax_id_by_level'].fillna(1)
merge['consensus_tax_id'] = merge['consensus_tax_id'].fillna(1) # Fill NaN values to keep unmapped contigs.
depth_tax_contig_df['consensus_lineage'] = depth_tax_contig_df['consensus_lineage'].fillna('Unknown')
# Group by lineage and sum number of reads and contigs. depth_tax_contig_df['tax_id_by_level'] = depth_tax_contig_df['tax_id_by_level'].fillna(1)
res = merge.groupby(['consensus_lineage','consensus_tax_id', 'tax_id_by_level']).agg({'contig' : [';'.join, 'count'], 'mapped': 'sum', 'depth': 'mean'}).reset_index() depth_tax_contig_df['consensus_tax_id'] = depth_tax_contig_df['consensus_tax_id'].fillna(1)
res.columns=['lineage_by_level', 'consensus_tax_id', 'tax_id_by_level', 'name_contigs', 'nb_contigs', 'nb_reads', 'depth']
# Fill NaN values with 0. logging.info("group by lineage")
res.fillna(0, inplace=True) groupby_cols = ['consensus_lineage','consensus_tax_id', 'tax_id_by_level']
depth_lineage_df = depth_tax_contig_df.groupby(groupby_cols).agg({
# Split by taxonomic level '#rname' : [';'.join, 'count'],
res_split_tax_id = res.join(res['tax_id_by_level'].str.split(pat=";",expand=True)) 'numreads': 'sum',
res_split_tax_id.columns=['consensus_lineage', 'consensus_taxid', 'tax_id_by_level', 'name_contigs', 'nb_contigs', 'depth', 'nb_reads', "superkingdom_tax_id", "phylum_tax_id", "order_tax_id", "class_tax_id", "family_tax_id", "genus_tax_id", "species_tax_id"] 'meandepth': 'mean'}).reset_index()
res_split_tax_id.fillna(value='no_affi', inplace = True)
print(res_split_tax_id.head()) depth_lineage_df.columns=['lineage_by_level', 'consensus_tax_id', 'tax_id_by_level',
res_split = res_split_tax_id.join(res_split_tax_id['consensus_lineage'].str.split(pat=";",expand=True)) 'name_contigs', 'nb_contigs', 'nb_reads', 'depth']
res_split.columns=['consensus_lineage', 'consensus_taxid', 'tax_id_by_level', 'name_contigs', 'nb_contigs', 'nb_reads', 'depth', "superkingdom_tax_id", "phylum_tax_id", "order_tax_id", "class_tax_id", "family_tax_id", "genus_tax_id", "species_tax_id", "superkingdom_lineage", "phylum_lineage", "order_lineage", "class_lineage", "family_lineage", "genus_lineage", "species_lineage"]
res_split.fillna(value='no_affi', inplace = True) logging.info(f"Write out {output_name}.tsv")
levels_columns=['tax_id_by_level','lineage_by_level','name_contigs','nb_contigs', 'nb_reads', 'depth'] depth_lineage_df.to_csv(f"{output_name}.tsv", sep="\t", index=False)
level_superkingdom = res_split.groupby(['superkingdom_tax_id','superkingdom_lineage']).agg({'name_contigs' : [';'.join], 'nb_contigs' : 'sum', 'nb_reads' : 'sum', 'depth': 'mean'}).reset_index()
level_superkingdom.columns=levels_columns
level_phylum = res_split.groupby(['phylum_tax_id','phylum_lineage']).agg({'name_contigs' : [';'.join], 'nb_contigs' : 'sum', 'nb_reads' : 'sum', 'depth': 'mean'}).reset_index() # split lineage
level_phylum.columns=levels_columns ranks_taxid = [f"{r}_taxid" for r in ranks]
level_order = res_split.groupby(['order_tax_id','order_lineage']).agg({'name_contigs' : [';'.join], 'nb_contigs' : 'sum', 'nb_reads' : 'sum', 'depth': 'mean'}).reset_index() ranks_lineage = [f"{r}_lineage" for r in ranks]
level_order.columns=levels_columns
level_class = res_split.groupby(['class_tax_id','class_lineage']).agg({'name_contigs' : [';'.join], 'nb_contigs' : 'sum', 'nb_reads' : 'sum', 'depth': 'mean'}).reset_index() try:
level_class.columns=levels_columns depth_lineage_df[ranks_taxid] = depth_lineage_df['tax_id_by_level'].str.split(pat=";",expand=True)
level_family = res_split.groupby(['family_tax_id','family_lineage']).agg({'name_contigs' : [';'.join], 'nb_contigs' : 'sum', 'nb_reads' : 'sum', 'depth': 'mean'}).reset_index() depth_lineage_df[ranks_lineage] = depth_lineage_df["lineage_by_level"].str.split(pat=";",expand=True)
level_family.columns=levels_columns
level_genus = res_split.groupby(['genus_tax_id','genus_lineage']).agg({'name_contigs' : [';'.join], 'nb_contigs' : 'sum', 'nb_reads' : 'sum', 'depth': 'mean'}).reset_index() except ValueError:
level_genus.columns=levels_columns # Manage case when lineage_by_level is only equal to "Unable to found taxonomy consensus" or "Unknown"
level_species = res_split.groupby(['species_tax_id','species_lineage']).agg({'name_contigs' : [';'.join], 'nb_contigs' : 'sum', 'nb_reads' : 'sum', 'depth': 'mean'}).reset_index() df_noaffi = pd.DataFrame("no_affi", index=range(len(depth_lineage_df)), columns=ranks_taxid+ranks_lineage)
level_species.columns=levels_columns depth_lineage_df = pd.concat([depth_lineage_df, df_noaffi], axis=1)
depth_lineage_df = depth_lineage_df.fillna(value='no_affi')
# Write merge data frame in output files.
res.to_csv(args.output_name + ".tsv", sep="\t", index=False) # groupby each rank and write the resulting table
level_superkingdom.to_csv(args.output_name + "_by_superkingdom.tsv", sep="\t", index=False) levels_columns=['tax_id_by_level','lineage_by_level','name_contigs','nb_contigs', 'nb_reads', 'depth']
level_phylum.to_csv(args.output_name + "_by_phylum.tsv", sep="\t", index=False)
level_order.to_csv(args.output_name + "_by_order.tsv", sep="\t", index=False) logging.info("group by rank")
level_class.to_csv(args.output_name + "_by_class.tsv", sep="\t", index=False)