Commit 263e2acc authored by MARTIN Pierre's avatar MARTIN Pierre
Browse files

Merge branch 'dsl2' into 'dev'

Merge dsl2 in dev to finalize dsl1 to dsl2 transition

See merge request !12
parents 02a9545e 82951f63
Pipeline #46464 skipped with stages
*.config linguist-language=groovy
*.nf linguist-language=groovy
\ No newline at end of file
*.nf gitlab-language=groovy
# metagWGS
# metagWGS: Documentation
## Introduction
**metagWGS** is a [Nextflow](https://www.nextflow.io/docs/latest/index.html#) bioinformatics analysis pipeline used for **metag**enomic **W**hole **G**enome **S**hotgun sequencing data (Illumina HiSeq3000 or NovaSeq, paired, 2\*150bp).
**metagWGS** is a [Nextflow](https://www.nextflow.io/docs/latest/index.html#) bioinformatics analysis pipeline used for **metag**enomic **W**hole **G**enome **S**hotgun sequencing data (Illumina HiSeq3000 or NovaSeq, paired, 2\*150bp ; PacBio HiFi reads, single-end).
### Pipeline graphical representation
The workflow processes raw data from `.fastq` or `.fastq.gz` inputs and do the modules represented into this figure:
![](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/raw/dev/docs/Pipeline.png)
The workflow processes raw data from `.fastq/.fastq.gz` input and/or assemblies (contigs) `.fa/.fasta` and uses the modules represented in this figure:
![](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/raw/master/docs/Pipeline.png)
### metagWGS steps
metagWGS is splitted into different steps that correspond to different parts of the bioinformatics analysis:
metagWGS is split into different steps that correspond to different parts of the bioinformatics analysis:
* `01_clean_qc` (can ke skipped)
* `S01_CLEAN_QC` (can be stopped at with `--stop_at_clean` ; can ke skipped with `--skip_clean`)
* trims adapters sequences and deletes low quality reads ([Cutadapt](https://cutadapt.readthedocs.io/en/stable/#), [Sickle](https://github.com/najoshi/sickle))
* suppresses host contaminants ([BWA](http://bio-bwa.sourceforge.net/) + [Samtools](http://www.htslib.org/) + [Bedtools](https://bedtools.readthedocs.io/en/latest/))
* controls the quality of raw and cleaned data ([FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/))
* makes a taxonomic classification of cleaned reads ([Kaiju MEM](https://github.com/bioinformatics-centre/kaiju) + [kronaTools](https://github.com/marbl/Krona/wiki/KronaTools) + [Generate_barplot_kaiju.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/dev/bin/Generate_barplot_kaiju.py) + [merge_kaiju_results.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/dev/bin/merge_kaiju_results.py))
* `02_assembly`
* assembles cleaned reads (combined with `01_clean_qc` step) or raw reads (combined with `--skip_01_clean_qc` parameter) ([metaSPAdes](https://github.com/ablab/spades) or [Megahit](https://github.com/voutcn/megahit))
* makes a taxonomic classification of cleaned reads ([Kaiju MEM](https://github.com/bioinformatics-centre/kaiju) + [kronaTools](https://github.com/marbl/Krona/wiki/KronaTools) + [Generate_barplot_kaiju.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/bin/Generate_barplot_kaiju.py) + [merge_kaiju_results.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/bin/merge_kaiju_results.py))
* `S02_ASSEMBLY` (can be stopped at with `--stop_at_assembly`)
* assembles cleaned reads (combined with `S01_CLEAN_QC` step) or raw reads (combined with `--skip_clean` parameter) ([metaSPAdes](https://github.com/ablab/spades) or [Megahit](https://github.com/voutcn/megahit))
* assesses the quality of assembly ([metaQUAST](http://quast.sourceforge.net/metaquast))
* deduplicates cleaned reads (combined with `01_clean_qc` step) or raw reads (combined with `--skip_01_clean_qc` parameter) ([BWA](http://bio-bwa.sourceforge.net/) + [Samtools](http://www.htslib.org/) + [Bedtools](https://bedtools.readthedocs.io/en/latest/))
* `03_filtering` (can be skipped)
* filters contigs with low CPM value ([Filter_contig_per_cpm.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/dev/bin/Filter_contig_per_cpm.py) + [metaQUAST](http://quast.sourceforge.net/metaquast))
* `04_structural_annot`
* makes a structural annotation of genes ([Prokka](https://github.com/tseemann/prokka) + [Rename_contigs_and_genes.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/dev/bin/Rename_contigs_and_genes.py))
* `05_alignment`
* deduplicates cleaned reads (combined with `S01_CLEAN_QC` step) or raw reads (combined with `--skip_clean` parameter) ([BWA](http://bio-bwa.sourceforge.net/) + [Samtools](http://www.htslib.org/) + [Bedtools](https://bedtools.readthedocs.io/en/latest/))
* `S03_FILTERING` (can be stopped at with `--stop_at_filtering` ; can be skipped with `--skip_assembly`)
* filters contigs with low CPM value ([Filter_contig_per_cpm.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/bin/Filter_contig_per_cpm.py) + [metaQUAST](http://quast.sourceforge.net/metaquast))
* `S04_STRUCTURAL_ANNOT` (can be stopped at with `--stop_at_structural_annot`)
* makes a structural annotation of genes ([Prokka](https://github.com/tseemann/prokka) + [Rename_contigs_and_genes.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/bin/Rename_contigs_and_genes.py))
* `S05_ALIGNMENT`
* aligns reads to the contigs ([BWA](http://bio-bwa.sourceforge.net/) + [Samtools](http://www.htslib.org/))
* aligns the protein sequence of genes against a protein database ([DIAMOND](https://github.com/bbuchfink/diamond))
* `06_func_annot`
* makes a sample and global clustering of genes ([cd-hit-est](http://weizhongli-lab.org/cd-hit/) + [cd_hit_produce_table_clstr.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/dev/bin/cd_hit_produce_table_clstr.py))
* quantifies reads that align with the genes ([featureCounts](http://subread.sourceforge.net/) + [Quantification_clusters.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/dev/bin/Quantification_clusters.py))
* makes a functional annotation of genes and a quantification of reads by function ([eggNOG-mapper](http://eggnog-mapper.embl.de/) + [best_bitscore_diamond.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/dev/bin/best_bitscore_diamond.py) + [merge_abundance_and_functional_annotations.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/dev/bin/merge_abundance_and_functional_annotations.py) + [quantification_by_functional_annotation.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/dev/bin/quantification_by_functional_annotation.py))
* `07_taxo_affi`
* taxonomically affiliates the genes ([Samtools](http://www.htslib.org/) + [aln2taxaffi.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/dev/bin/aln2taxaffi.py))
* taxonomically affiliates the contigs ([Samtools](http://www.htslib.org/) + [aln2taxaffi.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/dev/bin/aln2taxaffi.py))
* counts the number of reads and contigs, for each taxonomic affiliation, per taxonomic level ([Samtools](http://www.htslib.org/) + [merge_contig_quantif_perlineage.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/dev/bin/merge_contig_quantif_perlineage.py) + [quantification_by_contig_lineage.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/dev/bin/quantification_by_contig_lineage.py))
* `08_binning` from [nf-core/mag 1.0.0](https://github.com/nf-core/mag/releases/tag/1.0.0)
* makes binning of contigs ([MetaBAT2](https://bitbucket.org/berkeleylab/metabat/src/master/))
* assesses bins ([BUSCO](https://busco.ezlab.org/) + [metaQUAST](http://quast.sourceforge.net/metaquast) + [summary_busco.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/dev/bin/summary_busco.py) and [combine_tables.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/dev/bin/combine_tables.py) from [nf-core/mag](https://github.com/nf-core/mag))
* taxonomically affiliates the bins ([BAT](https://github.com/dutilh/CAT))
* `S06_FUNC_ANNOT` (can ke skipped with `--skip_func_annot`)
* makes a sample and global clustering of genes ([cd-hit-est](http://weizhongli-lab.org/cd-hit/) + [cd_hit_produce_table_clstr.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/bin/cd_hit_produce_table_clstr.py))
* quantifies reads that align with the genes ([featureCounts](http://subread.sourceforge.net/) + [Quantification_clusters.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/bin/Quantification_clusters.py))
* makes a functional annotation of genes and a quantification of reads by function ([eggNOG-mapper](http://eggnog-mapper.embl.de/) + [best_bitscore_diamond.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/bin/best_bitscore_diamond.py) + [merge_abundance_and_functional_annotations.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/bin/merge_abundance_and_functional_annotations.py) + [quantification_by_functional_annotation.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/bin/quantification_by_functional_annotation.py))
* `S07_TAXO_AFFI` (can ke skipped with `--skip_taxo_affi`)
* taxonomically affiliates the genes ([Samtools](http://www.htslib.org/) + [aln2taxaffi.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/bin/aln2taxaffi.py))
* taxonomically affiliates the contigs ([Samtools](http://www.htslib.org/) + [aln2taxaffi.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/bin/aln2taxaffi.py))
* counts the number of reads and contigs, for each taxonomic affiliation, per taxonomic level ([Samtools](http://www.htslib.org/) + [merge_contig_quantif_perlineage.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/bin/merge_contig_quantif_perlineage.py) + [quantification_by_contig_lineage.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/bin/quantification_by_contig_lineage.py))
* `S08_BINNING` (not yet implemented)
* binning strategies for assemblies and co-assemblies
All steps are launched one after another by default. Use `--stop_at_[STEP]` and `--skip_[STEP]` parameters to tweak execution to your will.
A report html file is generated at the end of the workflow with [MultiQC](https://multiqc.info/).
......@@ -49,28 +49,15 @@ Two [Singularity](https://sylabs.io/docs/) containers are available making insta
## Documentation
metagWGS documentation is available [here](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/tree/dev/docs).
## License
metagWGS is distributed under the GNU General Public License v3.
## Copyright
2021 INRAE
## Funded by
Anti-Selfish (Labex ECOFECT – N° 00002455-CT15000562)
France Génomique National Infrastructure (funded as part of Investissement d’avenir program managed by Agence Nationale de la Recherche, contract ANR-10-INBS-09)
With participation of SeqOccIn members financed by FEDER-FSE MIDI-PYRENEES ET GARONNE 2014-2020.
## Citation
metagWGS has been presented at JOBIM 2020:
Poster "Whole metagenome analysis with metagWGS", J. Fourquet, C. Noirot, C. Klopp, P. Pinton, S. Combes, C. Hoede, G. Pascal.
https://www.sfbi.fr/sites/sfbi.fr/files/jobim/jobim2020/posters/compressed/jobim2020_poster_9.pdf
metagWGS has been presented at JOBIM 2019 and at Genotoul Biostat Bioinfo day:
Poster "Whole metagenome analysis with metagWGS", J. Fourquet, A. Chaubet, H. Chiapello, C. Gaspin, M. Haenni, C. Klopp, A. Lupo, J. Mainguy, C. Noirot, T. Rochegue, M. Zytnicki, T. Ferry, C. Hoede.
The metagWGS documentation can be found in the following pages:
* [Installation](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/docs/installation.md)
* The pipeline installation procedure.
* [Usage](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/docs/usage.md)
* An overview of how the pipeline works, how to run it and a description of all of the different command-line flags.
* [Output](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/docs/output.md)
* An overview of the different output files and directories produced by the pipeline.
* [Use case](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/docs/use_case.md)
* A tutorial to learn how to launch the pipeline on a test dataset on [genologin cluster](http://bioinfo.genotoul.fr/).
* [Functional tests](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/functional_tests/README.md)
* (for developers) A tool to launch a new version of the pipeline on curated input data and compare its results with known output.
report_comment: >
This report has been generated by the <a href="https://forgemia.inra.fr/genotoul-bioinfo/metagwgs" target="_blank">genotoul-bioinfo/metagwgs</a>
analysis pipeline. For information about how to interpret these results, please see the
<a href="https://forgemia.inra.fr/genotoul-bioinfo/metagwgs" target="_blank">documentation</a>.
extra_fn_clean_trim:
- "hifi_"
- '.count_reads_on_contigs'
- '_scaffolds'
- '.txt'
- '.contigs'
- '.sort'
module_order:
- fastqc:
name: 'FastQC'
path_filters:
- '*hifi_*.zip'
- quast:
name: 'Quast primary assembly'
info: 'This section of the report shows quast results after assembly'
path_filters:
- '*quast_hifi/*/report.tsv'
- prokka
- featureCounts
prokka_fn_snames: True
prokka_table: True
featurecounts:
fn: '*.summary'
shared: true
table_columns_visible:
FastQC:
percent_duplicates: False
percent_gc: False
......@@ -34,11 +34,6 @@ module_order:
info: 'This section reports of the reads alignement against host genome with bwa.'
path_filters:
- '*.no_filter.flagstat'
- samtools:
name : 'Reads aln on host genome'
info: 'This section of the cleaned reads alignement against host genome with bwa.'
path_filters:
- '*host_filter/*'
- samtools:
name : 'Reads after host reads filter'
info: 'This section reports of the cleaned reads alignement against host genome with bwa.'
......@@ -54,12 +49,12 @@ module_order:
name: 'Quast primary assembly'
info: 'This section of the report shows quast results after assembly'
path_filters:
- '*_all_contigs_QC/*'
- '*quast_primary/*/report.tsv'
- quast:
name: 'Quast filtered assembly'
info: 'This section of the report shows quast results after filtering of assembly'
path_filters:
- '*_select_contigs_QC/*'
- '*quast_filtered/*/report.tsv'
- samtools:
name : 'Reads after deduplication'
info: 'This section reports of deduplicated reads alignement against contigs with bwa.'
......
......@@ -79,7 +79,7 @@ to_write = []
#contig_renames [ald_name]=newname
#reecriture du fasta
with open(args.fnaFile, "rU") as fnaFile,\
with open(args.fnaFile, "r") as fnaFile,\
open(args.outFNAFile, "w") as outFNA_handle:
for record in SeqIO.parse(fnaFile, "fasta"):
try :
......@@ -112,7 +112,13 @@ with open(args.file) as gffFile,\
#Generate correspondance
old_prot_name = feature.qualifiers['ID'][0].replace("_gene","")
prot_number = old_prot_name.split("_")[-1]
new_prot_name = new_ctg_name + "." + prot_prefix + prot_number
subfeat_types = {subfeat.type for subfeat in feature.sub_features}
assert len(subfeat_types) == 1, f'Subfeature have different types {subfeat_types}'
subfeat_type = subfeat_types.pop()
new_prot_name = f"{new_ctg_name}.{subfeat_type}_{prot_number}"
prot_names[old_prot_name] = new_prot_name
fh_prot_table.write(old_prot_name + "\t" + new_prot_name + "\n")
......@@ -134,7 +140,7 @@ with open(args.file) as gffFile,\
GFF.write(to_write, out_handle)
with open(args.fastaFile, "rU") as handle,\
with open(args.fastaFile, "r") as handle,\
open(args.outFAAFile, "w") as outFasta_handle:
for record in SeqIO.parse(handle, "fasta"):
try :
......@@ -147,7 +153,7 @@ with open(args.fastaFile, "rU") as handle,\
pass
with open(args.ffnFile, "rU") as handle,\
with open(args.ffnFile, "r") as handle,\
open(args.outFFNFile, "w") as outFFN_handle:
for record in SeqIO.parse(handle, "fasta"):
try :
......
This diff is collapsed.
#!/usr/bin/env python
"""----------------------------------------------------------------------------
Script Name: best_hit_diamond.py
Description: Have best diamond hits for each gene/protein (best bitscore)
Input files: Diamond output file (.m8)
Created By: Joanna Fourquet
Date: 2021-01-13
-------------------------------------------------------------------------------
"""
# Metadata
__author__ = 'Joanna Fourquet \
- GenPhySE - NED'
__copyright__ = 'Copyright (C) 2021 INRAE'
__license__ = 'GNU General Public License'
__version__ = '0.1'
__email__ = 'support.bioinfo.genotoul@inra.fr'
__status__ = 'dev'
# Status: dev
# Modules importation
try:
import argparse
import pandas as p
import re
import sys
import os
import operator
from collections import defaultdict
from collections import OrderedDict
from collections import Counter
from matplotlib import pyplot
except ImportError as error:
print(error)
exit(1)
def read_blast_input(blastinputfile):
#c1.Prot_00001 EFK63346.1 100.0 85 0 0 1 85 62 146 1.6e-36 158.3 85 \
# 146 EFK63346.1 LOW QUALITY PROTEIN: hypothetical protein HMPREF9008_04720, partial [Parabacteroides sp. 20_3]
#queryId, subjectId, percIdentity, alnLength, mismatchCount, gapOpenCount,
#queryStart, queryEnd, subjectStart, subjectEnd, eVal, bitScore, queryLength, subjectLength, subjectTitle
score = defaultdict(float)
best_lines = defaultdict(list)
nmatches = defaultdict(int);
for line in open(blastinputfile):
(queryId, subjectId, percIdentity, alnLength, mismatchCount, gapOpenCount, \
queryStart, queryEnd, subjectStart, subjectEnd, eVal, bitScore, queryLength, subjectLength, subjectTitle) \
= line.rstrip().split("\t")
if (nmatches[queryId] == 0):
score[queryId] = float(bitScore)
nmatches[queryId] += 1
best_lines[queryId] = [line]
else :
if (nmatches[queryId] > 0 and float(bitScore) > score[queryId]):
score[queryId] = float(bitScore)
best_lines[queryId] = [line]
else :
if (nmatches[queryId] > 0 and float(bitScore) == score[queryId]):
best_lines[queryId].append(line)
return(best_lines)
def main(argv):
parser = argparse.ArgumentParser()
parser.add_argument("aln_input_file", \
help = "file with blast/diamond matches expected format m8 \
\nqueryId, subjectId, percIdentity, alnLength, mismatchCount, gapOpenCount,\
queryStart, queryEnd, subjectStart, subjectEnd, eVal, bitScore")
parser.add_argument('-o','--output_file', type = str, \
default = "best_hit.tsv", help = ("string specifying output file"))
args = parser.parse_args()
out_lines = read_blast_input(args.aln_input_file)
with open (args.output_file, "w") as out :
for id in out_lines.keys():
for line in out_lines[id]:
out.write(line)
print("Finished")
if __name__ == "__main__":
main(sys.argv[1:])
#!/usr/bin/env python
#USAGE: ./combine_tables.py <BUSCO_table> <QUAST_table>
import pandas as pd
from sys import stdout
from sys import argv
# Read files
file1 = pd.read_csv(argv[1], sep="\t")
file2 = pd.read_csv(argv[2], sep="\t")
# Merge files
result = pd.merge(file1, file2, left_on="GenomeBin", right_on="Assembly", how='outer')
# Print to stdout
result.to_csv(stdout, sep='\t')
......@@ -3,119 +3,130 @@
"""--------------------------------------------------------------------
Script Name: merge_contig_quantif_perlineage.py
Description: merge quantifications and lineage into one matrice for one sample.
Input files: idxstats file, depth from mosdepth (bed.gz) and lineage percontig.tsv file.
Input files: depth from samtools coverage and lineage percontig.tsv file.
Created By: Joanna Fourquet
Date: 2021-01-19
-----------------------------------------------------------------------
------------------------------------ -----------------------------------
"""
# Metadata.
__author__ = 'Joanna Fourquet \
- GenPhySE - NED'
__author__ = 'Joanna Fourquet, Jean Mainguy'
__copyright__ = 'Copyright (C) 2021 INRAE'
__license__ = 'GNU General Public License'
__version__ = '0.1'
__email__ = 'support.bioinfo.genotoul@inra.fr'
__status__ = 'dev'
# Status: dev.
# Modules importation.
try:
import argparse
import re
import sys
import pandas as pd
import numpy as np
from datetime import datetime
except ImportError as error:
print(error)
exit(1)
# Print time.
print(str(datetime.now()))
# Manage parameters.
parser = argparse.ArgumentParser(description = 'Script which \
merge quantifications and lineage into one matrice for one sample.')
parser.add_argument('-i', '--idxstats_file', required = True, \
help = 'idxstats file.')
parser.add_argument('-m', '--mosdepth_file', required = True, \
help = 'depth per contigs from mosdepth (regions.bed.gz).')
parser.add_argument('-c', '--percontig_file', required = True, \
help = '.percontig.tsv file.')
parser.add_argument('-o', '--output_name', required = True, \
help = 'Name of output file containing counts of contigs and reads \
for each lineage.')
parser.add_argument('-v', '--version', action = 'version', \
version = __version__)
args = parser.parse_args()
# Recovery of idxstats file.
idxstats = pd.read_csv(args.idxstats_file, delimiter='\t', header=None)
idxstats.columns = ["contig","len","mapped","unmapped"]
# Recovery of mosdepth file; remove start/end columns
mosdepth = pd.read_csv(args.mosdepth_file, delimiter='\t', header=None,compression='gzip')
mosdepth.columns = ["contig","start","end","depth"]
mosdepth.drop(["start","end"], inplace=True,axis=1)
# Recovery of .percontig.tsv file.
percontig = pd.read_csv(args.percontig_file, delimiter='\t', dtype=str)
# Merge idxstats and .percontig.tsv files.
merge = pd.merge(idxstats,percontig,left_on='contig',right_on='#contig', how='outer')
# Add depth
merge = pd.merge(merge,mosdepth,left_on='contig',right_on='contig', how='outer')
# Fill NaN values to keep unmapped contigs.
merge['consensus_lineage'] = merge['consensus_lineage'].fillna('Unknown')
merge['tax_id_by_level'] = merge['tax_id_by_level'].fillna(1)
merge['consensus_tax_id'] = merge['consensus_tax_id'].fillna(1)
# Group by lineage and sum number of reads and contigs.
res = merge.groupby(['consensus_lineage','consensus_tax_id', 'tax_id_by_level']).agg({'contig' : [';'.join, 'count'], 'mapped': 'sum', 'depth': 'mean'}).reset_index()
res.columns=['lineage_by_level', 'consensus_tax_id', 'tax_id_by_level', 'name_contigs', 'nb_contigs', 'nb_reads', 'depth']
# Fill NaN values with 0.
res.fillna(0, inplace=True)
# Split by taxonomic level
res_split_tax_id = res.join(res['tax_id_by_level'].str.split(pat=";",expand=True))
res_split_tax_id.columns=['consensus_lineage', 'consensus_taxid', 'tax_id_by_level', 'name_contigs', 'nb_contigs', 'depth', 'nb_reads', "superkingdom_tax_id", "phylum_tax_id", "order_tax_id", "class_tax_id", "family_tax_id", "genus_tax_id", "species_tax_id"]
res_split_tax_id.fillna(value='no_affi', inplace = True)
print(res_split_tax_id.head())
res_split = res_split_tax_id.join(res_split_tax_id['consensus_lineage'].str.split(pat=";",expand=True))
res_split.columns=['consensus_lineage', 'consensus_taxid', 'tax_id_by_level', 'name_contigs', 'nb_contigs', 'nb_reads', 'depth', "superkingdom_tax_id", "phylum_tax_id", "order_tax_id", "class_tax_id", "family_tax_id", "genus_tax_id", "species_tax_id", "superkingdom_lineage", "phylum_lineage", "order_lineage", "class_lineage", "family_lineage", "genus_lineage", "species_lineage"]
res_split.fillna(value='no_affi', inplace = True)
levels_columns=['tax_id_by_level','lineage_by_level','name_contigs','nb_contigs', 'nb_reads', 'depth']
level_superkingdom = res_split.groupby(['superkingdom_tax_id','superkingdom_lineage']).agg({'name_contigs' : [';'.join], 'nb_contigs' : 'sum', 'nb_reads' : 'sum', 'depth': 'mean'}).reset_index()
level_superkingdom.columns=levels_columns
level_phylum = res_split.groupby(['phylum_tax_id','phylum_lineage']).agg({'name_contigs' : [';'.join], 'nb_contigs' : 'sum', 'nb_reads' : 'sum', 'depth': 'mean'}).reset_index()
level_phylum.columns=levels_columns
level_order = res_split.groupby(['order_tax_id','order_lineage']).agg({'name_contigs' : [';'.join], 'nb_contigs' : 'sum', 'nb_reads' : 'sum', 'depth': 'mean'}).reset_index()
level_order.columns=levels_columns
level_class = res_split.groupby(['class_tax_id','class_lineage']).agg({'name_contigs' : [';'.join], 'nb_contigs' : 'sum', 'nb_reads' : 'sum', 'depth': 'mean'}).reset_index()
level_class.columns=levels_columns
level_family = res_split.groupby(['family_tax_id','family_lineage']).agg({'name_contigs' : [';'.join], 'nb_contigs' : 'sum', 'nb_reads' : 'sum', 'depth': 'mean'}).reset_index()
level_family.columns=levels_columns
level_genus = res_split.groupby(['genus_tax_id','genus_lineage']).agg({'name_contigs' : [';'.join], 'nb_contigs' : 'sum', 'nb_reads' : 'sum', 'depth': 'mean'}).reset_index()
level_genus.columns=levels_columns
level_species = res_split.groupby(['species_tax_id','species_lineage']).agg({'name_contigs' : [';'.join], 'nb_contigs' : 'sum', 'nb_reads' : 'sum', 'depth': 'mean'}).reset_index()
level_species.columns=levels_columns
# Write merge data frame in output files.
res.to_csv(args.output_name + ".tsv", sep="\t", index=False)
level_superkingdom.to_csv(args.output_name + "_by_superkingdom.tsv", sep="\t", index=False)
level_phylum.to_csv(args.output_name + "_by_phylum.tsv", sep="\t", index=False)
level_order.to_csv(args.output_name + "_by_order.tsv", sep="\t", index=False)
level_class.to_csv(args.output_name + "_by_class.tsv", sep="\t", index=False)
level_family.to_csv(args.output_name + "_by_family.tsv", sep="\t", index=False)
level_genus.to_csv(args.output_name + "_by_genus.tsv", sep="\t", index=False)
level_species.to_csv(args.output_name + "_by_species.tsv", sep="\t", index=False)
from argparse import ArgumentParser, ArgumentDefaultsHelpFormatter
import pandas as pd
import logging
def parse_arguments():
# Manage parameters.
parser = ArgumentParser(description = 'Script which \
merge quantifications and lineage into one matrice for one sample.',
formatter_class=ArgumentDefaultsHelpFormatter)
parser.add_argument('-s', '--sam_coverage', required = True, \
help = 'depth per contigs from samtools coverage tool.')
parser.add_argument('-c', '--contig_tax_affi', required = True, \
help = '.percontig.tsv file.')
parser.add_argument('-o', '--output_name', required = True, \
help = 'Name of output file containing counts of contigs and reads \
for each lineage.')
parser.add_argument('-v', '--version', action = 'version', \
version = __version__)
parser.add_argument("--verbose", help="increase output verbosity",
action="store_true")
args = parser.parse_args()
return args
def main():
args = parse_arguments()
if args.verbose:
logging.basicConfig(format="%(levelname)s: %(message)s", level=logging.DEBUG)
logging.info('Mode verbose ON')
else:
logging.basicConfig(format="%(levelname)s: %(message)s")
sam_coverage_file = args.sam_coverage
contig_taxaffi_file = args.contig_tax_affi
output_name = args.output_name
ranks = ["superkingdom", "phylum", "order", "class",
"family", "genus", "species"]
logging.info("Read and merge tables")
cov_df = pd.read_csv(sam_coverage_file, delimiter='\t')
contig_taxaffi_df = pd.read_csv(contig_taxaffi_file, delimiter='\t', dtype=str)
print(cov_df)
print("#####################")
print(contig_taxaffi_df)
depth_tax_contig_df = pd.merge(cov_df,contig_taxaffi_df,left_on='#rname',right_on='#contig', how='outer')
# Fill NaN values to keep unmapped contigs.
depth_tax_contig_df['consensus_lineage'] = depth_tax_contig_df['consensus_lineage'].fillna('Unknown')
depth_tax_contig_df['tax_id_by_level'] = depth_tax_contig_df['tax_id_by_level'].fillna(1)
depth_tax_contig_df['consensus_tax_id'] = depth_tax_contig_df['consensus_tax_id'].fillna(1)
logging.info("group by lineage")
groupby_cols = ['consensus_lineage','consensus_tax_id', 'tax_id_by_level']
depth_lineage_df = depth_tax_contig_df.groupby(groupby_cols).agg({
'#rname' : [';'.join, 'count'],
'numreads': 'sum',
'meandepth': 'mean'}).reset_index()
depth_lineage_df.columns=['lineage_by_level', 'consensus_tax_id', 'tax_id_by_level',
'name_contigs', 'nb_contigs', 'nb_reads', 'depth']
logging.info(f"Write out {output_name}.tsv")
depth_lineage_df.to_csv(f"{output_name}.tsv", sep="\t", index=False)
# split lineage
ranks_taxid = [f"{r}_taxid" for r in ranks]
ranks_lineage = [f"{r}_lineage" for r in ranks]
try:
depth_lineage_df[ranks_taxid] = depth_lineage_df['tax_id_by_level'].str.split(pat=";",expand=True)
depth_lineage_df[ranks_lineage] = depth_lineage_df["lineage_by_level"].str.split(pat=";",expand=True)
except ValueError:
# Manage case when lineage_by_level is only equal to "Unable to found taxonomy consensus" or "Unknown"
df_noaffi = pd.DataFrame("no_affi", index=range(len(depth_lineage_df)), columns=ranks_taxid+ranks_lineage)
depth_lineage_df = pd.concat([depth_lineage_df, df_noaffi], axis=1)
depth_lineage_df = depth_lineage_df.fillna(value='no_affi')
# groupby each rank and write the resulting table
levels_columns=['tax_id_by_level','lineage_by_level','name_contigs','nb_contigs', 'nb_reads', 'depth']
logging.info("group by rank")
for rank in ranks:
depth_rank_lineage_df = depth_lineage_df.groupby([f'{rank}_taxid',f'{rank}_lineage']).agg({
'name_contigs' : [';'.join],
'nb_contigs' : 'sum',
'nb_reads' : 'sum',
'depth': 'mean'}).reset_index()
depth_rank_lineage_df.columns=levels_columns
depth_rank_lineage_df['rank'] = rank
logging.info(f"Write out {output_name}_by_{rank}.tsv")
depth_rank_lineage_df.to_csv(f"{output_name}_by_{rank}.tsv", sep="\t", index=False)
if __name__ == '__main__':
main()
......@@ -22,7 +22,8 @@ regexes = {
'Prokka': ['v_prokka.txt', r"prokka (\S+)"],
'Kaiju': ['v_kaiju.txt', r"Kaiju (\S+)"],
'Samtools': ['v_samtools.txt', r"samtools (\S+)"],
'Bedtools': ['v_bedtools.txt', r"bedtools v(\S+)"]
'Bedtools': ['v_bedtools.txt', r"bedtools v(\S+)"],
'Eggnog-Mapper': ['v_eggnogmapper.txt', r"emapper-(\S+)"]
}
results = OrderedDict()
results['metagWGS'] = '<span style="color:#999999;\">N/A</span>'
......@@ -44,6 +45,7 @@ results['Prokka'] = '<span style="color:#999999;\">N/A</span>'
results['Kaiju'] = '<span style="color:#999999;\">N/A</span>'
results