Commit d90c0efa authored by Celine Noirot's avatar Celine Noirot
Browse files

Merge branch 'dev' into 'master'

DSL2

Closes #108

See merge request !15
parents 54210901 529e7714
Pipeline #48016 skipped with stage
......@@ -5,31 +5,17 @@ image:
stages:
- build
- deploy
# Build Singularity container bwa_v0.7.17.sif
singularity-image:
singularity-image-metag:
stage: build
script:
- singularity build metagWGS.sif env/Singularity_recipe_metagWGS
- singularity build eggnog_mapper.sif env/Singularity_recipe_eggnog_mapper
artifacts:
paths:
- metagWGS.sif
- eggnog_mapper.sif
only:
changes:
- .gitlab-ci.yml
- env/*
- singularity push --docker-username "${CI_REGISTRY_USER}" --docker-password "${CI_REGISTRY_PASSWORD}" metagWGS.sif oras://"$CI_REGISTRY_IMAGE"/"$CI_PROJECT_NAME":"$CI_COMMIT_TAG"
when: manual
# Push the image template.sif on the registry
deploy:
stage: deploy
singularity-image-eggnog:
stage: build
script:
- singularity push --docker-username "${CI_REGISTRY_USER}" --docker-password "${CI_REGISTRY_PASSWORD}" metagWGS.sif oras://"$CI_REGISTRY_IMAGE"/"$CI_PROJECT_NAME":"$CI_COMMIT_TAG"
- singularity build eggnog_mapper.sif env/Singularity_recipe_eggnog_mapper
- singularity push --docker-username "${CI_REGISTRY_USER}" --docker-password "${CI_REGISTRY_PASSWORD}" eggnog_mapper.sif oras://"$CI_REGISTRY_IMAGE"/eggnog_mapper:"$CI_COMMIT_TAG"
only:
changes:
- .gitlab-ci.yml
- env/*
when: manual
\ No newline at end of file
when: manual
# metagWGS
# metagWGS: Documentation
## Introduction
**metagWGS** is a [Nextflow](https://www.nextflow.io/docs/latest/index.html#) bioinformatics analysis pipeline used for **metag**enomic **W**hole **G**enome **S**hotgun sequencing data (Illumina HiSeq3000 or NovaSeq, paired, 2\*150bp).
**metagWGS** is a [Nextflow](https://www.nextflow.io/docs/latest/index.html#) bioinformatics analysis pipeline used for **metag**enomic **W**hole **G**enome **S**hotgun sequencing data (Illumina HiSeq3000 or NovaSeq, paired, 2\*150bp ; PacBio HiFi reads, single-end).
### Pipeline graphical representation
The workflow processes raw data from `.fastq` or `.fastq.gz` inputs and do the modules represented into this figure:
The workflow processes raw data from `.fastq/.fastq.gz` input and/or assemblies (contigs) `.fa/.fasta` and uses the modules represented in this figure:
![](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/raw/master/docs/Pipeline.png)
### metagWGS steps
metagWGS is splitted into different steps that correspond to different parts of the bioinformatics analysis:
metagWGS is split into different steps that correspond to different parts of the bioinformatics analysis:
* `01_clean_qc` (can ke skipped)
* `S01_CLEAN_QC` (can be stopped at with `--stop_at_clean` ; can ke skipped with `--skip_clean`)
* trims adapters sequences and deletes low quality reads ([Cutadapt](https://cutadapt.readthedocs.io/en/stable/#), [Sickle](https://github.com/najoshi/sickle))
* suppresses host contaminants ([BWA](http://bio-bwa.sourceforge.net/) + [Samtools](http://www.htslib.org/) + [Bedtools](https://bedtools.readthedocs.io/en/latest/))
* controls the quality of raw and cleaned data ([FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/))
* makes a taxonomic classification of cleaned reads ([Kaiju MEM](https://github.com/bioinformatics-centre/kaiju) + [kronaTools](https://github.com/marbl/Krona/wiki/KronaTools) + [Generate_barplot_kaiju.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/bin/Generate_barplot_kaiju.py) + [merge_kaiju_results.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/bin/merge_kaiju_results.py))
* `02_assembly`
* assembles cleaned reads (combined with `01_clean_qc` step) or raw reads (combined with `--skip_01_clean_qc` parameter) ([metaSPAdes](https://github.com/ablab/spades) or [Megahit](https://github.com/voutcn/megahit))
* `S02_ASSEMBLY` (can be stopped at with `--stop_at_assembly`)
* assembles cleaned reads (combined with `S01_CLEAN_QC` step) or raw reads (combined with `--skip_clean` parameter) ([metaSPAdes](https://github.com/ablab/spades) or [Megahit](https://github.com/voutcn/megahit))
* assesses the quality of assembly ([metaQUAST](http://quast.sourceforge.net/metaquast))
* deduplicates cleaned reads (combined with `01_clean_qc` step) or raw reads (combined with `--skip_01_clean_qc` parameter) ([BWA](http://bio-bwa.sourceforge.net/) + [Samtools](http://www.htslib.org/) + [Bedtools](https://bedtools.readthedocs.io/en/latest/))
* `03_filtering` (can be skipped)
* deduplicates cleaned reads (combined with `S01_CLEAN_QC` step) or raw reads (combined with `--skip_clean` parameter) ([BWA](http://bio-bwa.sourceforge.net/) + [Samtools](http://www.htslib.org/) + [Bedtools](https://bedtools.readthedocs.io/en/latest/))
* `S03_FILTERING` (can be stopped at with `--stop_at_filtering` ; can be skipped with `--skip_assembly`)
* filters contigs with low CPM value ([Filter_contig_per_cpm.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/bin/Filter_contig_per_cpm.py) + [metaQUAST](http://quast.sourceforge.net/metaquast))
* `04_structural_annot`
* `S04_STRUCTURAL_ANNOT` (can be stopped at with `--stop_at_structural_annot`)
* makes a structural annotation of genes ([Prokka](https://github.com/tseemann/prokka) + [Rename_contigs_and_genes.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/bin/Rename_contigs_and_genes.py))
* `05_alignment`
* `S05_ALIGNMENT`
* aligns reads to the contigs ([BWA](http://bio-bwa.sourceforge.net/) + [Samtools](http://www.htslib.org/))
* aligns the protein sequence of genes against a protein database ([DIAMOND](https://github.com/bbuchfink/diamond))
* `06_func_annot`
* `S06_FUNC_ANNOT` (can ke skipped with `--skip_func_annot`)
* makes a sample and global clustering of genes ([cd-hit-est](http://weizhongli-lab.org/cd-hit/) + [cd_hit_produce_table_clstr.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/bin/cd_hit_produce_table_clstr.py))
* quantifies reads that align with the genes ([featureCounts](http://subread.sourceforge.net/) + [Quantification_clusters.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/bin/Quantification_clusters.py))
* makes a functional annotation of genes and a quantification of reads by function ([eggNOG-mapper](http://eggnog-mapper.embl.de/) + [best_bitscore_diamond.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/bin/best_bitscore_diamond.py) + [merge_abundance_and_functional_annotations.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/bin/merge_abundance_and_functional_annotations.py) + [quantification_by_functional_annotation.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/bin/quantification_by_functional_annotation.py))
* `07_taxo_affi`
* `S07_TAXO_AFFI` (can ke skipped with `--skip_taxo_affi`)
* taxonomically affiliates the genes ([Samtools](http://www.htslib.org/) + [aln2taxaffi.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/bin/aln2taxaffi.py))
* taxonomically affiliates the contigs ([Samtools](http://www.htslib.org/) + [aln2taxaffi.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/bin/aln2taxaffi.py))
* counts the number of reads and contigs, for each taxonomic affiliation, per taxonomic level ([Samtools](http://www.htslib.org/) + [merge_contig_quantif_perlineage.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/bin/merge_contig_quantif_perlineage.py) + [quantification_by_contig_lineage.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/bin/quantification_by_contig_lineage.py))
* `08_binning` from [nf-core/mag 1.0.0](https://github.com/nf-core/mag/releases/tag/1.0.0)
* makes binning of contigs ([MetaBAT2](https://bitbucket.org/berkeleylab/metabat/src/master/))
* assesses bins ([BUSCO](https://busco.ezlab.org/) + [metaQUAST](http://quast.sourceforge.net/metaquast) + [summary_busco.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/bin/summary_busco.py) and [combine_tables.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/bin/combine_tables.py) from [nf-core/mag](https://github.com/nf-core/mag))
* taxonomically affiliates the bins ([BAT](https://github.com/dutilh/CAT))
* `S08_BINNING` (not yet implemented)
* binning strategies for assemblies and co-assemblies
All steps are launched one after another by default. Use `--stop_at_[STEP]` and `--skip_[STEP]` parameters to tweak execution to your will.
A report html file is generated at the end of the workflow with [MultiQC](https://multiqc.info/).
......@@ -49,26 +49,15 @@ Three [Singularity](https://sylabs.io/docs/) containers are available making ins
## Documentation
metagWGS documentation is available [here](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/tree/master/docs).
## License
metagWGS is distributed under the GNU General Public License v3.
## Copyright
2021 INRAE
## Funded by
Anti-Selfish (Labex ECOFECT – N° 00002455-CT15000562)
France Génomique National Infrastructure (funded as part of Investissement d’avenir program managed by Agence Nationale de la Recherche, contract ANR-10-INBS-09)
With participation of SeqOccIn members financed by FEDER-FSE MIDI-PYRENEES ET GARONNE 2014-2020.
## Citation
metagWGS has been presented at JOBIM 2020:
Poster "Whole metagenome analysis with metagWGS", J. Fourquet, C. Noirot, C. Klopp, P. Pinton, S. Combes, C. Hoede, G. Pascal.
https://www.sfbi.fr/sites/sfbi.fr/files/jobim/jobim2020/posters/compressed/jobim2020_poster_9.pdf
metagWGS has been presented at JOBIM 2019 and at Genotoul Biostat Bioinfo day:
Poster "Whole metagenome analysis with metagWGS", J. Fourquet, A. Chaubet, H. Chiapello, C. Gaspin, M. Haenni, C. Klopp, A. Lupo, J. Mainguy, C. Noirot, T. Rochegue, M. Zytnicki, T. Ferry, C. Hoede.
The metagWGS documentation can be found in the following pages:
* [Installation](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/docs/installation.md)
* The pipeline installation procedure.
* [Usage](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/docs/usage.md)
* An overview of how the pipeline works, how to run it and a description of all of the different command-line flags.
* [Output](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/docs/output.md)
* An overview of the different output files and directories produced by the pipeline.
* [Use case](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/docs/use_case.md)
* A tutorial to learn how to launch the pipeline on a test dataset on [genologin cluster](http://bioinfo.genotoul.fr/).
* [Functional tests](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/functional_tests/README.md)
* (for developers) A tool to launch a new version of the pipeline on curated input data and compare its results with known output.
report_comment: >
This report has been generated by the <a href="https://forgemia.inra.fr/genotoul-bioinfo/metagwgs" target="_blank">genotoul-bioinfo/metagwgs</a>
analysis pipeline. For information about how to interpret these results, please see the
<a href="https://forgemia.inra.fr/genotoul-bioinfo/metagwgs" target="_blank">documentation</a>.
extra_fn_clean_trim:
- "hifi_"
- '.count_reads_on_contigs'
- '_scaffolds'
- '.txt'
- '.contigs'
- '.sort'
module_order:
- fastqc:
name: 'FastQC'
path_filters:
- '*fastqc.zip'
- quast:
name: 'Quast primary assembly'
info: 'This section of the report shows quast results after assembly'
path_filters:
- '*quast_hifi/*/report.tsv'
- quast:
name: 'Quast filtered assembly'
info: 'This section of the report shows quast results after filtering of assembly'
path_filters:
- '*quast_filtered/*/report.tsv'
- prokka
- featureCounts
prokka_fn_snames: True
prokka_table: True
featurecounts:
fn: '*.summary'
shared: true
table_columns_visible:
FastQC:
percent_duplicates: False
percent_gc: False
......@@ -34,11 +34,6 @@ module_order:
info: 'This section reports of the reads alignement against host genome with bwa.'
path_filters:
- '*.no_filter.flagstat'
- samtools:
name : 'Reads aln on host genome'
info: 'This section of the cleaned reads alignement against host genome with bwa.'
path_filters:
- '*host_filter/*'
- samtools:
name : 'Reads after host reads filter'
info: 'This section reports of the cleaned reads alignement against host genome with bwa.'
......@@ -54,12 +49,12 @@ module_order:
name: 'Quast primary assembly'
info: 'This section of the report shows quast results after assembly'
path_filters:
- '*_all_contigs_QC/*'
- '*quast_primary/*/report.tsv'
- quast:
name: 'Quast filtered assembly'
info: 'This section of the report shows quast results after filtering of assembly'
path_filters:
- '*_select_contigs_QC/*'
- '*quast_filtered/*/report.tsv'
- samtools:
name : 'Reads after deduplication'
info: 'This section reports of deduplicated reads alignement against contigs with bwa.'
......
......@@ -145,7 +145,7 @@ print(str(datetime.now()))
# For each count file (output of featureCounts), reading of lines one by one,
# recovery of name of gene and count number and incrementing of corresponding
# value in d_count_by_g_clstr.
for (count_idx,counts_path) in enumerate(files_of_counts):
for (count_idx,counts_path) in enumerate(sorted(files_of_counts)):
with open(counts_path) as fh:
for f_gene_counts in fh:
if f_gene_counts.startswith('#') \
......
......@@ -79,7 +79,7 @@ to_write = []
#contig_renames [ald_name]=newname
#reecriture du fasta
with open(args.fnaFile, "rU") as fnaFile,\
with open(args.fnaFile, "r") as fnaFile,\
open(args.outFNAFile, "w") as outFNA_handle:
for record in SeqIO.parse(fnaFile, "fasta"):
try :
......@@ -112,7 +112,13 @@ with open(args.file) as gffFile,\
#Generate correspondance
old_prot_name = feature.qualifiers['ID'][0].replace("_gene","")
prot_number = old_prot_name.split("_")[-1]
new_prot_name = new_ctg_name + "." + prot_prefix + prot_number
subfeat_types = {subfeat.type for subfeat in feature.sub_features}
assert len(subfeat_types) == 1, f'Subfeature have different types {subfeat_types}'
subfeat_type = subfeat_types.pop()
new_prot_name = f"{new_ctg_name}.{subfeat_type}_{prot_number}"
prot_names[old_prot_name] = new_prot_name
fh_prot_table.write(old_prot_name + "\t" + new_prot_name + "\n")
......@@ -134,7 +140,7 @@ with open(args.file) as gffFile,\
GFF.write(to_write, out_handle)
with open(args.fastaFile, "rU") as handle,\
with open(args.fastaFile, "r") as handle,\
open(args.outFAAFile, "w") as outFasta_handle:
for record in SeqIO.parse(handle, "fasta"):
try :
......@@ -147,7 +153,7 @@ with open(args.fastaFile, "rU") as handle,\
pass
with open(args.ffnFile, "rU") as handle,\
with open(args.ffnFile, "r") as handle,\
open(args.outFFNFile, "w") as outFFN_handle:
for record in SeqIO.parse(handle, "fasta"):
try :
......
This diff is collapsed.
#!/usr/bin/env python
"""----------------------------------------------------------------------------
Script Name: best_hit_diamond.py
Description: Have best diamond hits for each gene/protein (best bitscore)
Input files: Diamond output file (.m8)
Created By: Joanna Fourquet
Date: 2021-01-13
-------------------------------------------------------------------------------
"""
# Metadata
__author__ = 'Joanna Fourquet \
- GenPhySE - NED'
__copyright__ = 'Copyright (C) 2021 INRAE'
__license__ = 'GNU General Public License'
__version__ = '0.1'
__email__ = 'support.bioinfo.genotoul@inra.fr'
__status__ = 'dev'
# Status: dev
# Modules importation
try:
import argparse
import pandas as p
import re
import sys
import os
import operator
from collections import defaultdict
from collections import OrderedDict
from collections import Counter
from matplotlib import pyplot
except ImportError as error:
print(error)
exit(1)
def read_blast_input(blastinputfile):
#c1.Prot_00001 EFK63346.1 100.0 85 0 0 1 85 62 146 1.6e-36 158.3 85 \
# 146 EFK63346.1 LOW QUALITY PROTEIN: hypothetical protein HMPREF9008_04720, partial [Parabacteroides sp. 20_3]
#queryId, subjectId, percIdentity, alnLength, mismatchCount, gapOpenCount,
#queryStart, queryEnd, subjectStart, subjectEnd, eVal, bitScore, queryLength, subjectLength, subjectTitle
score = defaultdict(float)
best_lines = defaultdict(list)
nmatches = defaultdict(int);
for line in open(blastinputfile):
(queryId, subjectId, percIdentity, alnLength, mismatchCount, gapOpenCount, \
queryStart, queryEnd, subjectStart, subjectEnd, eVal, bitScore, queryLength, subjectLength, subjectTitle) \
= line.rstrip().split("\t")
if (nmatches[queryId] == 0):
score[queryId] = float(bitScore)
nmatches[queryId] += 1
best_lines[queryId] = [line]
else :
if (nmatches[queryId] > 0 and float(bitScore) > score[queryId]):
score[queryId] = float(bitScore)
best_lines[queryId] = [line]
else :
if (nmatches[queryId] > 0 and float(bitScore) == score[queryId]):
best_lines[queryId].append(line)
return(best_lines)
def main(argv):
parser = argparse.ArgumentParser()
parser.add_argument("aln_input_file", \
help = "file with blast/diamond matches expected format m8 \
\nqueryId, subjectId, percIdentity, alnLength, mismatchCount, gapOpenCount,\
queryStart, queryEnd, subjectStart, subjectEnd, eVal, bitScore")
parser.add_argument('-o','--output_file', type = str, \
default = "best_hit.tsv", help = ("string specifying output file"))
args = parser.parse_args()
out_lines = read_blast_input(args.aln_input_file)
with open (args.output_file, "w") as out :
for id in out_lines.keys():
for line in out_lines[id]:
out.write(line)
print("Finished")
if __name__ == "__main__":
main(sys.argv[1:])
#!/usr/bin/env python
#USAGE: ./combine_tables.py <BUSCO_table> <QUAST_table>
import pandas as pd
from sys import stdout
from sys import argv
# Read files
file1 = pd.read_csv(argv[1], sep="\t")
file2 = pd.read_csv(argv[2], sep="\t")
# Merge files
result = pd.merge(file1, file2, left_on="GenomeBin", right_on="Assembly", how='outer')
# Print to stdout
result.to_csv(stdout, sep='\t')
......@@ -22,16 +22,15 @@ __status__ = 'dev'
from argparse import ArgumentParser, ArgumentDefaultsHelpFormatter, FileType
import logging
import sys
import csv
def get_hits_with_highest_bitscore(hits):
highest_bitscore = max([float(hit['bitScore']) for hit in hits])
return [hit for hit in hits if float(hit['bitScore']) == highest_bitscore]
highest_bitscore = max([float(hit['bitscore']) for hit in hits])
return [hit for hit in hits if float(hit['bitscore']) == highest_bitscore]
def get_all_hits_per_query(blast_result_file, header_list):
def get_all_hits_per_query(blast_result_file):
# Assertion: Hit are already sorted by query in diamond output.
# Both commands should output the same number of line:
# cut -f1 blast_result_file | uniq | wc -l
......@@ -39,7 +38,7 @@ def get_all_hits_per_query(blast_result_file, header_list):
with open(blast_result_file) as in_fl:
result_reader = csv.DictReader(in_fl, delimiter='\t', fieldnames=header_list)
result_reader = csv.DictReader(in_fl, delimiter='\t')
query_ids_processed = []
......@@ -49,13 +48,13 @@ def get_all_hits_per_query(blast_result_file, header_list):
for hit in result_reader:
if not current_query_id:
current_query_id = hit['queryId']
current_query_id = hit['qseqid']
if current_query_id and current_query_id != hit['queryId']:
if current_query_id and current_query_id != hit['qseqid']:
yield hits
hits = []
current_query_id = hit['queryId']
current_query_id = hit['qseqid']
assert current_query_id not in query_ids_processed, f"Queries are not sorted in blast result. Query {current_query_id} is found in different part of the file."
query_ids_processed.append(current_query_id)
......@@ -68,8 +67,8 @@ def get_all_hits_per_query(blast_result_file, header_list):
def is_identity_and_coverage_ok(hit, min_identity, min_coverage):
qcovhsp = (int(hit["queryEnd"]) - int(hit["queryStart"]) + 1) / int(hit['queryLength'])
if float(hit['percIdentity']) >= min_identity or qcovhsp >= min_coverage:
qcovhsp = (int(hit["qend"]) - int(hit["qstart"]) + 1) / int(hit['qlen'])
if float(hit['pident']) >= min_identity or qcovhsp >= min_coverage:
return True
return False
......@@ -81,8 +80,7 @@ def parse_arguments():
parser.add_argument('aln_input_file',
help="File with blast/diamond matches expected format m8 \
\nqueryId, subjectId, percIdentity, alnLength, mismatchCount, gapOpenCount,\
queryStart, queryEnd, subjectStart, subjectEnd, eVal, bitScore")
\nqseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qlen slen stitle")
parser.add_argument('-o', '--output_file', type=str,
default="best_hit.tsv", help=("string specifying output file path"))
......@@ -111,20 +109,19 @@ def main():
else:
logging.basicConfig(format="%(levelname)s: %(message)s")
headers = "queryId subjectId percIdentity alnLength mismatchCount gapOpenCount queryStart queryEnd subjectStart subjectEnd eVal bitScore queryLength subjectLength subjectTitle"
header_list = headers.split(' ')
blast_result = args.aln_input_file
outfile = args.output_file
min_coverage = args.min_coverage
min_identity = args.min_identity
best_hit_count = 0
query_count_with_low_hit = 0
with open(blast_result) as f:
header = f.readline().rstrip().split('\t')
with open(outfile, 'w') as out_fl:
writer = csv.DictWriter(out_fl, fieldnames=header_list, delimiter='\t')
for query_i, query_hits in enumerate(get_all_hits_per_query(blast_result, header_list)):
writer = csv.DictWriter(out_fl, delimiter='\t',fieldnames=header)
writer.writeheader()
for query_i, query_hits in enumerate(get_all_hits_per_query(blast_result)):
if query_i % 10000 == 0:
logging.info(f'{query_i} queries processed... ')
......
......@@ -80,7 +80,7 @@ with open(args.list_of_file_diamond) as fdiamond_list:
concat_eggnog_mapper_files = pd.DataFrame()
# Concatenate annotation files.
for (annotations_idx,annotations_path) in enumerate(files_of_annotations):
for (annotations_idx,annotations_path) in enumerate(sorted(files_of_annotations)):
eggnog_mapper_file = pd.read_csv(annotations_path, delimiter='\t', decimal='.',skiprows=4)
concat_eggnog_mapper_files = pd.concat([concat_eggnog_mapper_files, eggnog_mapper_file])
......@@ -88,7 +88,7 @@ for (annotations_idx,annotations_path) in enumerate(files_of_annotations):
concat_diamond_files = pd.DataFrame()
# Concatenate diamond files.
for (diamond_idx,diamond_path) in enumerate(diamond_files):
for (diamond_idx,diamond_path) in enumerate(sorted(diamond_files)):
diamond_columns = ["qseqid","sseqid","pident","length","mismatch","gapopen","qstart","qend","sstart","send","evalue","bitscore","qlen","slen","stitle"]
diamond_file = pd.read_csv(diamond_path, delimiter='\t', decimal='.', header=None, names=diamond_columns)
diamond_file.loc[:,"sseqid"] = 'https://www.ncbi.nlm.nih.gov/protein/' + diamond_file.loc[:,"sseqid"]
......
......@@ -3,119 +3,130 @@
"""--------------------------------------------------------------------
Script Name: merge_contig_quantif_perlineage.py
Description: merge quantifications and lineage into one matrice for one sample.
Input files: idxstats file, depth from mosdepth (bed.gz) and lineage percontig.tsv file.
Input files: depth from samtools coverage and lineage percontig.tsv file.
Created By: Joanna Fourquet
Date: 2021-01-19
-----------------------------------------------------------------------
------------------------------------ -----------------------------------
"""
# Metadata.
__author__ = 'Joanna Fourquet \
- GenPhySE - NED'
__author__ = 'Joanna Fourquet, Jean Mainguy'
__copyright__ = 'Copyright (C) 2021 INRAE'
__license__ = 'GNU General Public License'
__version__ = '0.1'
__email__ = 'support.bioinfo.genotoul@inra.fr'
__status__ = 'dev'
# Status: dev.
# Modules importation.
try:
import argparse
import re
import sys
import pandas as pd
import numpy as np
from datetime import datetime
except ImportError as error:
print(error)
exit(1)
# Print time.
print(str(datetime.now()))
# Manage parameters.
parser = argparse.ArgumentParser(description = 'Script which \
merge quantifications and lineage into one matrice for one sample.')
parser.add_argument('-i', '--idxstats_file', required = True, \
help = 'idxstats file.')
parser.add_argument('-m', '--mosdepth_file', required = True, \
help = 'depth per contigs from mosdepth (regions.bed.gz).')
parser.add_argument('-c', '--percontig_file', required = True, \
help = '.percontig.tsv file.')
parser.add_argument('-o', '--output_name', required = True, \
help = 'Name of output file containing counts of contigs and reads \
for each lineage.')
parser.add_argument('-v', '--version', action = 'version', \
version = __version__)
args = parser.parse_args()
# Recovery of idxstats file.
idxstats = pd.read_csv(args.idxstats_file, delimiter='\t', header=None)
idxstats.columns = ["contig","len","mapped","unmapped"]
# Recovery of mosdepth file; remove start/end columns
mosdepth = pd.read_csv(args.mosdepth_file, delimiter='\t', header=None,compression='gzip')
mosdepth.columns = ["contig","start","end","depth"]
mosdepth.drop(["start","end"], inplace=True,axis=1)
# Recovery of .percontig.tsv file.
percontig = pd.read_csv(args.percontig_file, delimiter='\t', dtype=str)
# Merge idxstats and .percontig.tsv files.
merge = pd.merge(idxstats,percontig,left_on='contig',right_on='#contig', how='outer')
# Add depth
merge = pd.merge(merge,mosdepth,left_on='contig',right_on='contig', how='outer')
# Fill NaN values to keep unmapped contigs.
merge['consensus_lineage'] = merge['consensus_lineage'].fillna('Unknown')
merge['tax_id_by_level'] = merge['tax_id_by_level'].fillna(1)
merge['consensus_tax_id'] = merge['consensus_tax_id'].fillna(1)
# Group by lineage and sum number of reads and contigs.
res = merge.groupby(['consensus_lineage','consensus_tax_id', 'tax_id_by_level']).agg({'contig' : [';'.join, 'count'], 'mapped': 'sum', 'depth': 'mean'}).reset_index()
res.columns=['lineage_by_level', 'consensus_tax_id', 'tax_id_by_level', 'name_contigs', 'nb_contigs', 'nb_reads', 'depth']
# Fill NaN values with 0.
res.fillna(0, inplace=True)
# Split by taxonomic level
res_split_tax_id = res.join(res['tax_id_by_level'].str.split(pat=";",expand=True))
res_split_tax_id.columns=['consensus_lineage', 'consensus_taxid', 'tax_id_by_level', 'name_contigs', 'nb_contigs', 'depth', 'nb_reads', "superkingdom_tax_id", "phylum_tax_id", "order_tax_id", "class_tax_id", "family_tax_id", "genus_tax_id", "species_tax_id"]
res_split_tax_id.fillna(value='no_affi', inplace = True)
print(res_split_tax_id.head())
res_split = res_split_tax_id.join(res_split_tax_id['consensus_lineage'].str.split(pat=";",expand=True))
res_split.columns=['consensus_lineage', 'consensus_taxid', 'tax_id_by_level', 'name_contigs', 'nb_contigs', 'nb_reads', 'depth', "superkingdom_tax_id", "phylum_tax_id", "order_tax_id", "class_tax_id", "family_tax_id", "genus_tax_id", "species_tax_id", "superkingdom_lineage", "phylum_lineage", "order_lineage", "class_lineage", "family_lineage", "genus_lineage", "species_lineage"]
res_split.fillna(value='no_affi', inplace = True)
levels_columns=['tax_id_by_level','lineage_by_level','name_contigs','nb_contigs', 'nb_reads', 'depth']
level_superkingdom = res_split.groupby(['superkingdom_tax_id','superkingdom_lineage']).agg({'name_contigs' : [';'.join], 'nb_contigs' : 'sum', 'nb_reads' : 'sum', 'depth': 'mean'}).reset_index()
level_superkingdom.columns=levels_columns
level_phylum = res_split.groupby(['phylum_tax_id','phylum_lineage']).agg({'name_contigs' : [';'.join], 'nb_contigs' : 'sum', 'nb_reads' : 'sum', 'depth': 'mean'}).reset_index()
level_phylum.columns=levels_columns
level_order = res_split.groupby(['order_tax_id','order_lineage']).agg({'name_contigs' : [';'.join], 'nb_contigs' : 'sum', 'nb_reads' : 'sum', 'depth': 'mean'}).reset_index()
level_order.columns=levels_columns
level_class = res_split.groupby(['class_tax_id','class_lineage']).agg({'name_contigs' : [';'.join], 'nb_contigs' : 'sum', 'nb_reads' : 'sum', 'depth': 'mean'}).reset_index()
level_class.columns=levels_columns
level_family = res_split.groupby(['family_tax_id','family_lineage']).agg({'name_contigs' : [';'.join], 'nb_contigs' : 'sum', 'nb_reads' : 'sum', 'depth': 'mean'}).reset_index()
level_family.columns=levels_columns