Commit 263e2acc authored by MARTIN Pierre's avatar MARTIN Pierre
Browse files

Merge branch 'dsl2' into 'dev'

Merge dsl2 in dev to finalize dsl1 to dsl2 transition

See merge request !12
parents 02a9545e 82951f63
Pipeline #46464 skipped with stages
*.config linguist-language=groovy
*.nf linguist-language=groovy
\ No newline at end of file
*.nf gitlab-language=groovy
# metagWGS
# metagWGS: Documentation
## Introduction
**metagWGS** is a [Nextflow](https://www.nextflow.io/docs/latest/index.html#) bioinformatics analysis pipeline used for **metag**enomic **W**hole **G**enome **S**hotgun sequencing data (Illumina HiSeq3000 or NovaSeq, paired, 2\*150bp).
**metagWGS** is a [Nextflow](https://www.nextflow.io/docs/latest/index.html#) bioinformatics analysis pipeline used for **metag**enomic **W**hole **G**enome **S**hotgun sequencing data (Illumina HiSeq3000 or NovaSeq, paired, 2\*150bp ; PacBio HiFi reads, single-end).
### Pipeline graphical representation
The workflow processes raw data from `.fastq` or `.fastq.gz` inputs and do the modules represented into this figure:
![](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/raw/dev/docs/Pipeline.png)
The workflow processes raw data from `.fastq/.fastq.gz` input and/or assemblies (contigs) `.fa/.fasta` and uses the modules represented in this figure:
![](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/raw/master/docs/Pipeline.png)
### metagWGS steps
metagWGS is splitted into different steps that correspond to different parts of the bioinformatics analysis:
metagWGS is split into different steps that correspond to different parts of the bioinformatics analysis:
* `01_clean_qc` (can ke skipped)
* `S01_CLEAN_QC` (can be stopped at with `--stop_at_clean` ; can ke skipped with `--skip_clean`)
* trims adapters sequences and deletes low quality reads ([Cutadapt](https://cutadapt.readthedocs.io/en/stable/#), [Sickle](https://github.com/najoshi/sickle))
* suppresses host contaminants ([BWA](http://bio-bwa.sourceforge.net/) + [Samtools](http://www.htslib.org/) + [Bedtools](https://bedtools.readthedocs.io/en/latest/))
* controls the quality of raw and cleaned data ([FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/))
* makes a taxonomic classification of cleaned reads ([Kaiju MEM](https://github.com/bioinformatics-centre/kaiju) + [kronaTools](https://github.com/marbl/Krona/wiki/KronaTools) + [Generate_barplot_kaiju.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/dev/bin/Generate_barplot_kaiju.py) + [merge_kaiju_results.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/dev/bin/merge_kaiju_results.py))
* `02_assembly`
* assembles cleaned reads (combined with `01_clean_qc` step) or raw reads (combined with `--skip_01_clean_qc` parameter) ([metaSPAdes](https://github.com/ablab/spades) or [Megahit](https://github.com/voutcn/megahit))
* makes a taxonomic classification of cleaned reads ([Kaiju MEM](https://github.com/bioinformatics-centre/kaiju) + [kronaTools](https://github.com/marbl/Krona/wiki/KronaTools) + [Generate_barplot_kaiju.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/bin/Generate_barplot_kaiju.py) + [merge_kaiju_results.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/bin/merge_kaiju_results.py))
* `S02_ASSEMBLY` (can be stopped at with `--stop_at_assembly`)
* assembles cleaned reads (combined with `S01_CLEAN_QC` step) or raw reads (combined with `--skip_clean` parameter) ([metaSPAdes](https://github.com/ablab/spades) or [Megahit](https://github.com/voutcn/megahit))
* assesses the quality of assembly ([metaQUAST](http://quast.sourceforge.net/metaquast))
* deduplicates cleaned reads (combined with `01_clean_qc` step) or raw reads (combined with `--skip_01_clean_qc` parameter) ([BWA](http://bio-bwa.sourceforge.net/) + [Samtools](http://www.htslib.org/) + [Bedtools](https://bedtools.readthedocs.io/en/latest/))
* `03_filtering` (can be skipped)
* filters contigs with low CPM value ([Filter_contig_per_cpm.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/dev/bin/Filter_contig_per_cpm.py) + [metaQUAST](http://quast.sourceforge.net/metaquast))
* `04_structural_annot`
* makes a structural annotation of genes ([Prokka](https://github.com/tseemann/prokka) + [Rename_contigs_and_genes.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/dev/bin/Rename_contigs_and_genes.py))
* `05_alignment`
* deduplicates cleaned reads (combined with `S01_CLEAN_QC` step) or raw reads (combined with `--skip_clean` parameter) ([BWA](http://bio-bwa.sourceforge.net/) + [Samtools](http://www.htslib.org/) + [Bedtools](https://bedtools.readthedocs.io/en/latest/))
* `S03_FILTERING` (can be stopped at with `--stop_at_filtering` ; can be skipped with `--skip_assembly`)
* filters contigs with low CPM value ([Filter_contig_per_cpm.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/bin/Filter_contig_per_cpm.py) + [metaQUAST](http://quast.sourceforge.net/metaquast))
* `S04_STRUCTURAL_ANNOT` (can be stopped at with `--stop_at_structural_annot`)
* makes a structural annotation of genes ([Prokka](https://github.com/tseemann/prokka) + [Rename_contigs_and_genes.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/bin/Rename_contigs_and_genes.py))
* `S05_ALIGNMENT`
* aligns reads to the contigs ([BWA](http://bio-bwa.sourceforge.net/) + [Samtools](http://www.htslib.org/))
* aligns the protein sequence of genes against a protein database ([DIAMOND](https://github.com/bbuchfink/diamond))
* `06_func_annot`
* makes a sample and global clustering of genes ([cd-hit-est](http://weizhongli-lab.org/cd-hit/) + [cd_hit_produce_table_clstr.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/dev/bin/cd_hit_produce_table_clstr.py))
* quantifies reads that align with the genes ([featureCounts](http://subread.sourceforge.net/) + [Quantification_clusters.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/dev/bin/Quantification_clusters.py))
* makes a functional annotation of genes and a quantification of reads by function ([eggNOG-mapper](http://eggnog-mapper.embl.de/) + [best_bitscore_diamond.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/dev/bin/best_bitscore_diamond.py) + [merge_abundance_and_functional_annotations.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/dev/bin/merge_abundance_and_functional_annotations.py) + [quantification_by_functional_annotation.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/dev/bin/quantification_by_functional_annotation.py))
* `07_taxo_affi`
* taxonomically affiliates the genes ([Samtools](http://www.htslib.org/) + [aln2taxaffi.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/dev/bin/aln2taxaffi.py))
* taxonomically affiliates the contigs ([Samtools](http://www.htslib.org/) + [aln2taxaffi.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/dev/bin/aln2taxaffi.py))
* counts the number of reads and contigs, for each taxonomic affiliation, per taxonomic level ([Samtools](http://www.htslib.org/) + [merge_contig_quantif_perlineage.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/dev/bin/merge_contig_quantif_perlineage.py) + [quantification_by_contig_lineage.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/dev/bin/quantification_by_contig_lineage.py))
* `08_binning` from [nf-core/mag 1.0.0](https://github.com/nf-core/mag/releases/tag/1.0.0)
* makes binning of contigs ([MetaBAT2](https://bitbucket.org/berkeleylab/metabat/src/master/))
* assesses bins ([BUSCO](https://busco.ezlab.org/) + [metaQUAST](http://quast.sourceforge.net/metaquast) + [summary_busco.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/dev/bin/summary_busco.py) and [combine_tables.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/dev/bin/combine_tables.py) from [nf-core/mag](https://github.com/nf-core/mag))
* taxonomically affiliates the bins ([BAT](https://github.com/dutilh/CAT))
* `S06_FUNC_ANNOT` (can ke skipped with `--skip_func_annot`)
* makes a sample and global clustering of genes ([cd-hit-est](http://weizhongli-lab.org/cd-hit/) + [cd_hit_produce_table_clstr.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/bin/cd_hit_produce_table_clstr.py))
* quantifies reads that align with the genes ([featureCounts](http://subread.sourceforge.net/) + [Quantification_clusters.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/bin/Quantification_clusters.py))
* makes a functional annotation of genes and a quantification of reads by function ([eggNOG-mapper](http://eggnog-mapper.embl.de/) + [best_bitscore_diamond.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/bin/best_bitscore_diamond.py) + [merge_abundance_and_functional_annotations.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/bin/merge_abundance_and_functional_annotations.py) + [quantification_by_functional_annotation.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/bin/quantification_by_functional_annotation.py))
* `S07_TAXO_AFFI` (can ke skipped with `--skip_taxo_affi`)
* taxonomically affiliates the genes ([Samtools](http://www.htslib.org/) + [aln2taxaffi.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/bin/aln2taxaffi.py))
* taxonomically affiliates the contigs ([Samtools](http://www.htslib.org/) + [aln2taxaffi.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/bin/aln2taxaffi.py))
* counts the number of reads and contigs, for each taxonomic affiliation, per taxonomic level ([Samtools](http://www.htslib.org/) + [merge_contig_quantif_perlineage.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/bin/merge_contig_quantif_perlineage.py) + [quantification_by_contig_lineage.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/bin/quantification_by_contig_lineage.py))
* `S08_BINNING` (not yet implemented)
* binning strategies for assemblies and co-assemblies
All steps are launched one after another by default. Use `--stop_at_[STEP]` and `--skip_[STEP]` parameters to tweak execution to your will.
A report html file is generated at the end of the workflow with [MultiQC](https://multiqc.info/).
......@@ -49,28 +49,15 @@ Two [Singularity](https://sylabs.io/docs/) containers are available making insta
## Documentation
metagWGS documentation is available [here](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/tree/dev/docs).
## License
metagWGS is distributed under the GNU General Public License v3.
## Copyright
2021 INRAE
## Funded by
Anti-Selfish (Labex ECOFECT – N° 00002455-CT15000562)
France Génomique National Infrastructure (funded as part of Investissement d’avenir program managed by Agence Nationale de la Recherche, contract ANR-10-INBS-09)
With participation of SeqOccIn members financed by FEDER-FSE MIDI-PYRENEES ET GARONNE 2014-2020.
## Citation
metagWGS has been presented at JOBIM 2020:
Poster "Whole metagenome analysis with metagWGS", J. Fourquet, C. Noirot, C. Klopp, P. Pinton, S. Combes, C. Hoede, G. Pascal.
https://www.sfbi.fr/sites/sfbi.fr/files/jobim/jobim2020/posters/compressed/jobim2020_poster_9.pdf
metagWGS has been presented at JOBIM 2019 and at Genotoul Biostat Bioinfo day:
Poster "Whole metagenome analysis with metagWGS", J. Fourquet, A. Chaubet, H. Chiapello, C. Gaspin, M. Haenni, C. Klopp, A. Lupo, J. Mainguy, C. Noirot, T. Rochegue, M. Zytnicki, T. Ferry, C. Hoede.
The metagWGS documentation can be found in the following pages:
* [Installation](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/docs/installation.md)
* The pipeline installation procedure.
* [Usage](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/docs/usage.md)
* An overview of how the pipeline works, how to run it and a description of all of the different command-line flags.
* [Output](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/docs/output.md)
* An overview of the different output files and directories produced by the pipeline.
* [Use case](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/docs/use_case.md)
* A tutorial to learn how to launch the pipeline on a test dataset on [genologin cluster](http://bioinfo.genotoul.fr/).
* [Functional tests](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/master/functional_tests/README.md)
* (for developers) A tool to launch a new version of the pipeline on curated input data and compare its results with known output.
report_comment: >
This report has been generated by the <a href="https://forgemia.inra.fr/genotoul-bioinfo/metagwgs" target="_blank">genotoul-bioinfo/metagwgs</a>
analysis pipeline. For information about how to interpret these results, please see the
<a href="https://forgemia.inra.fr/genotoul-bioinfo/metagwgs" target="_blank">documentation</a>.
extra_fn_clean_trim:
- "hifi_"
- '.count_reads_on_contigs'
- '_scaffolds'
- '.txt'
- '.contigs'
- '.sort'
module_order:
- fastqc:
name: 'FastQC'
path_filters:
- '*hifi_*.zip'
- quast:
name: 'Quast primary assembly'
info: 'This section of the report shows quast results after assembly'
path_filters:
- '*quast_hifi/*/report.tsv'
- prokka
- featureCounts
prokka_fn_snames: True
prokka_table: True
featurecounts:
fn: '*.summary'
shared: true
table_columns_visible:
FastQC:
percent_duplicates: False
percent_gc: False
......@@ -34,11 +34,6 @@ module_order:
info: 'This section reports of the reads alignement against host genome with bwa.'
path_filters:
- '*.no_filter.flagstat'
- samtools:
name : 'Reads aln on host genome'
info: 'This section of the cleaned reads alignement against host genome with bwa.'
path_filters:
- '*host_filter/*'
- samtools:
name : 'Reads after host reads filter'
info: 'This section reports of the cleaned reads alignement against host genome with bwa.'
......@@ -54,12 +49,12 @@ module_order:
name: 'Quast primary assembly'
info: 'This section of the report shows quast results after assembly'
path_filters:
- '*_all_contigs_QC/*'
- '*quast_primary/*/report.tsv'
- quast:
name: 'Quast filtered assembly'
info: 'This section of the report shows quast results after filtering of assembly'
path_filters:
- '*_select_contigs_QC/*'
- '*quast_filtered/*/report.tsv'
- samtools:
name : 'Reads after deduplication'
info: 'This section reports of deduplicated reads alignement against contigs with bwa.'
......
......@@ -79,7 +79,7 @@ to_write = []
#contig_renames [ald_name]=newname
#reecriture du fasta
with open(args.fnaFile, "rU") as fnaFile,\
with open(args.fnaFile, "r") as fnaFile,\
open(args.outFNAFile, "w") as outFNA_handle:
for record in SeqIO.parse(fnaFile, "fasta"):
try :
......@@ -112,7 +112,13 @@ with open(args.file) as gffFile,\
#Generate correspondance
old_prot_name = feature.qualifiers['ID'][0].replace("_gene","")
prot_number = old_prot_name.split("_")[-1]
new_prot_name = new_ctg_name + "." + prot_prefix + prot_number
subfeat_types = {subfeat.type for subfeat in feature.sub_features}
assert len(subfeat_types) == 1, f'Subfeature have different types {subfeat_types}'
subfeat_type = subfeat_types.pop()
new_prot_name = f"{new_ctg_name}.{subfeat_type}_{prot_number}"
prot_names[old_prot_name] = new_prot_name
fh_prot_table.write(old_prot_name + "\t" + new_prot_name + "\n")
......@@ -134,7 +140,7 @@ with open(args.file) as gffFile,\
GFF.write(to_write, out_handle)
with open(args.fastaFile, "rU") as handle,\
with open(args.fastaFile, "r") as handle,\
open(args.outFAAFile, "w") as outFasta_handle:
for record in SeqIO.parse(handle, "fasta"):
try :
......@@ -147,7 +153,7 @@ with open(args.fastaFile, "rU") as handle,\
pass
with open(args.ffnFile, "rU") as handle,\
with open(args.ffnFile, "r") as handle,\
open(args.outFFNFile, "w") as outFFN_handle:
for record in SeqIO.parse(handle, "fasta"):
try :
......
......@@ -43,7 +43,7 @@ except ImportError as error:
# Variables
# These are identities normalized with query coverage:
MIN_IDENTITY_TAXA = (0.40,0.50,0.60,0.70,0.80,0.90,0.95)
MIN_IDENTITY_TAXA = (0.40, 0.50, 0.60, 0.70, 0.80, 0.90, 0.95)
# Fraction of weights needed to assign at a specific level,
# a measure of concensus at that level.
......@@ -60,12 +60,14 @@ d_taxonomy = {}
# Define taxonomy main levels
global main_level
main_level = \
["superkingdom", "phylum", "class", "order", "family", "genus", "species"]
["superkingdom", "phylum", "class", "order", "family", "genus", "species"]
##### SAME AS Script_renameVcn.py
prot_prefix = "Prot_"
# SAME AS Script_renameVcn.py
prot_prefix = "CDS_"
# Definition of the class Node
class Node:
def __init__(self):
self.tax_id = 0 # Number of the tax id.
......@@ -73,11 +75,12 @@ class Node:
self.children = [] # List of the children of this node
self.tip = 0 # Tip=1 if it's a terminal node, 0 if not.
self.name = "" # Name of the node: taxa if it's a terminal node,
# numero if not.
# numero if not.
self.level = "None"
def genealogy(self): # Trace genealogy from root to leaf
ancestors = [] # Initialize the list of all nodes
# from root to leaf.
# from root to leaf.
tax_id = self.tax_id # Define leaf
while 1:
if tax_id in d_taxonomy:
......@@ -91,9 +94,10 @@ class Node:
ancestors.append(tax_id)
break
return ancestors # Return the list
def fullnamelineage(self): # Trace genealogy from root to leaf
def fullnamelineage(self): # Trace genealogy from root to leaf
ancestors = [] # Initialise the list of all nodes
# from root to leaf.
# from root to leaf.
tax_id = self.tax_id # Define leaf
while 1:
if tax_id in d_taxonomy:
......@@ -104,15 +108,16 @@ class Node:
if tax_id == "1":
break
ancestors.reverse()
return "; ".join(ancestors) # Return the list
return "; ".join(ancestors) # Return the list
def genealogy_main_level(self):
ancestors = ["None"] * 7 # Initialise the list of all nodes
# from root to leaf.
# from root to leaf.
tax_id = self.tax_id
while 1:
if tax_id in d_taxonomy:
cur_level = d_taxonomy[tax_id].level
if cur_level in main_level :
if cur_level in main_level:
ancestors[main_level.index(cur_level)] = tax_id
tax_id = d_taxonomy[tax_id].parent
else:
......@@ -120,27 +125,30 @@ class Node:
if tax_id == "1":
# If it is the root, we reached the end.
break
return ancestors # Return the list
return ancestors # Return the list
def lineage_main_level(self):
ancestors = ["None"] * 7 # Initialise the list of all nodes
# from root to leaf.
# from root to leaf.
ancestors_tax_id = ["None"] * 7 # Initialise the list of all nodes
tax_id = self.tax_id
while 1:
if tax_id in d_taxonomy:
cur_level = d_taxonomy[tax_id].level
if cur_level in main_level :
if cur_level in main_level:
ancestors[main_level.index(cur_level)] = d_taxonomy[tax_id].name
ancestors_tax_id [main_level.index(cur_level)] = str(tax_id)
ancestors_tax_id[main_level.index(cur_level)] = str(tax_id)
tax_id = d_taxonomy[tax_id].parent
else:
break
if tax_id == "1":
# If it is the root, we reached the end.
break
return ("; ".join(ancestors), "; ".join(ancestors_tax_id))# Return the two lists
return ("; ".join(ancestors), "; ".join(ancestors_tax_id)) # Return the two lists
# Function to find common ancestor between two nodes or more
def common_ancestor(node_list):
global d_taxonomy
# Define the whole genealogy of the first node
......@@ -150,15 +158,16 @@ def common_ancestor(node_list):
list2 = d_taxonomy[node].genealogy()
ancestral_list = []
for i in list1:
if i in list2: # Identify common nodes between the two genealogy
if i in list2: # Identify common nodes between the two genealogy
ancestral_list.append(i)
list1 = ancestral_list # Reassing ancestral_list to list 1.
list1 = ancestral_list # Reassing ancestral_list to list 1.
# Finally, the first node of the ancestra_list is the common ancestor
# of all nodes.
common_ancestor = ancestral_list[0]
# Return a node
return common_ancestor
def load_taxonomy(directory):
# Load taxonomy
global d_taxonomy
......@@ -177,9 +186,8 @@ def load_taxonomy(directory):
d_name_by_tax_id[tax_id] = name # ... and load them
d_name_by_tax_id_reverse[name] = tax_id # ... into dictionaries
# Load taxonomy NCBI file ("nodes.dmp")
with open(os.path.join(directory, "nodes.dmp"), "r") as taxonomy_file :
with open(os.path.join(directory, "nodes.dmp"), "r") as taxonomy_file:
for line in taxonomy_file:
line = line.rstrip().replace("\t", "")
tab = line.split("|")
......@@ -194,10 +202,10 @@ def load_taxonomy(directory):
if tax_id not in d_taxonomy:
d_taxonomy[tax_id] = Node()
d_taxonomy[tax_id].tax_id = tax_id # Assign tax_id
d_taxonomy[tax_id].parent = tax_id_parent # Assign tax_id parent
d_taxonomy[tax_id].parent = tax_id_parent # Assign tax_id parent
d_taxonomy[tax_id].name = name # Assign name
d_taxonomy[tax_id].level = str(tab[2].strip()) # Assign level
if tax_id_parent in d_taxonomy:
d_taxonomy[tax_id].level = str(tab[2].strip()) # Assign level
if tax_id_parent in d_taxonomy:
children = d_taxonomy[tax_id].children # If parent is already in the object
children.append(tax_id) # ... we found its children
d_taxonomy[tax_id].children = children # ... so add them to the parent.
......@@ -205,6 +213,7 @@ def load_taxonomy(directory):
# END Functions for taxonomy taxdump.tar.gz
################################################
def read_query_length_file(query_length_file):
lengths = {}
for line in open(query_length_file):
......@@ -212,16 +221,17 @@ def read_query_length_file(query_length_file):
lengths[queryid] = float(length)
return lengths
def read_blast_input(blastinputfile,lengths,min_identity,max_matches,min_coverage):
#c1.Prot_00001 EFK63346.1 100.0 85 0 0 1 85 62 146 1.6e-36 158.3 85 \
# 146 EFK63346.1 LOW QUALITY PROTEIN: hypothetical protein HMPREF9008_04720, partial [Parabacteroides sp. 20_3]
def read_blast_input(blastinputfile, lengths, min_identity, max_matches, min_coverage):
# c1.Prot_00001 EFK63346.1 100.0 85 0 0 1 85 62 146 1.6e-36 158.3 85 \
# 146 EFK63346.1 LOW QUALITY PROTEIN: hypothetical protein HMPREF9008_04720, partial [Parabacteroides sp. 20_3]
#queryId, subjectId, percIdentity, alnLength, mismatchCount, gapOpenCount,
#queryStart, queryEnd, subjectStart, subjectEnd, eVal, bitScore, queryLength, subjectLength, subjectTitle
matches = defaultdict(list)
accs = Counter()
nmatches = Counter();
nmatches = Counter()
with open(blastinputfile) as blast_handler:
reader = csv.DictReader(blast_handler, delimiter='\t')
......@@ -231,24 +241,24 @@ def read_blast_input(blastinputfile,lengths,min_identity,max_matches,min_coverag
# queryStart, queryEnd, subjectStart, subjectEnd, eVal, bitScore, queryLength, subjectLength, subjectTitle) \
# = line.rstrip().split("\t")
if aln['sseqid'].startswith("gi|") :
if aln['sseqid'].startswith("gi|"):
m = re.search(r"gi\|.*?\|.*\|(.*)\|", aln['sseqid'])
acc = m.group(1)
else :
else:
acc = aln['sseqid']
qLength = lengths[aln['qseqid']]
alnLength_in_query = abs(int(aln['qend']) - int(aln['qstart'])) + 1
fHit = float(alnLength_in_query) / qLength
coverage = fHit * 100
fHit *= float(aln['pident']) / 100.0
fHit = min(1.0,fHit)
fHit = min(1.0, fHit)
#hits[queryId] = hits[queryId] + 1
if float(aln['pident']) > min_identity and nmatches[aln['qseqid']] < max_matches and float(coverage) > min_coverage:
matches[aln['qseqid']].append((acc, fHit))
nmatches[aln['qseqid']] += 1
accs[acc] +=1
accs[acc] += 1
return (OrderedDict(sorted(matches.items(), key = lambda t: t[0])), list(accs.keys()))
return (OrderedDict(sorted(matches.items(), key=lambda t: t[0])), list(accs.keys()))
def map_accessions(accs, mapping_file):
......@@ -263,53 +273,55 @@ def map_accessions(accs, mapping_file):
return mappings
def get_consensus (collate_table):
#From collapse_hit retrieve consensus tax_id and lineage
#Compute best lineage consensus
def get_consensus(collate_table):
# From collapse_hit retrieve consensus tax_id and lineage
# Compute best lineage consensus
for depth in range(6, -1, -1):
collate = collate_table[depth]
dWeight = sum(collate.values())
sortCollate = sorted(list(collate.items()), key = operator.itemgetter(1), reverse = True)
sortCollate = sorted(list(collate.items()), key=operator.itemgetter(1), reverse=True)
nL = len(collate)
if nL > 0:
dP = 0.0
if dWeight > 0.0:
dP = float(sortCollate[0][1]) / dWeight
if dP > MIN_FRACTION:
(fullnamelineage_text, fullnamelineage_ids) = d_taxonomy[str(sortCollate[0][0])].lineage_main_level()
(fullnamelineage_text, fullnamelineage_ids) = d_taxonomy[str(
sortCollate[0][0])].lineage_main_level()
tax_id_keep = str(sortCollate[0][0])
return (tax_id_keep, fullnamelineage_text, fullnamelineage_ids)
return (1,"Unable to found taxonomy consensus",1)
return (1, "Unable to found taxonomy consensus", 1)
def main(argv):
parser = argparse.ArgumentParser()
parser.add_argument("aln_input_file", \
help = "file with blast/diamond matches expected format m8 \
parser.add_argument("aln_input_file",
help="file with blast/diamond matches expected format m8 \
\nqueryId, subjectId, percIdentity, alnLength, mismatchCount, gapOpenCount,\
queryStart, queryEnd, subjectStart, subjectEnd, eVal, bitScore")
parser.add_argument("query_length_file", \
help = "tab delimited file of query lengths")
parser.add_argument('-a','--acc_taxaid_mapping_file', \
help = "mapping from accession to taxaid gzipped")
parser.add_argument('-t','--taxonomy', \
help = "path of taxdump.tar.gz extracted directory")
parser.add_argument('-o','--output_file', type = str, \
default = "taxonomyassignation", help = ("string specifying output file"))
parser.add_argument('-i','--identity', default = 60, \
help = "percentage of identity")
parser.add_argument('-m','--max_matches', default = 10, \
help = "max number of matches to analyze")
parser.add_argument('-c','--min_coverage', default = 70, \
help = "percentage of coverage")
parser.add_argument("query_length_file",
help="tab delimited file of query lengths")
parser.add_argument('-a', '--acc_taxaid_mapping_file',
help="mapping from accession to taxaid gzipped")
parser.add_argument('-t', '--taxonomy',
help="path of taxdump.tar.gz extracted directory")
parser.add_argument('-o', '--output_file', type=str,
default="taxonomyassignation", help=("string specifying output file"))
parser.add_argument('-i', '--identity', default=60,
help="percentage of identity")
parser.add_argument('-m', '--max_matches', default=10,
help="max number of matches to analyze")
parser.add_argument('-c', '--min_coverage', default=70,
help="percentage of coverage")
args = parser.parse_args()
lengths = read_query_length_file(args.query_length_file)
print("Finished reading lengths file")
nb_total_prot = len(lengths)
(matches,accs) = read_blast_input(args.aln_input_file, lengths, \
args.identity,args.max_matches,args.min_coverage)
(matches, accs) = read_blast_input(args.aln_input_file, lengths,
args.identity, args.max_matches, args.min_coverage)
print("Finished reading in blast results file")
nb_prot_annotated = len(matches)
......@@ -320,10 +332,10 @@ def main(argv):
print("Finished loading taxa directory " + args.taxonomy)
re_contig = re.compile('(.*)\.' + prot_prefix)
with open (args.output_file + ".pergene.tsv", "w") as out, \
open (args.output_file + ".percontig.tsv", "w") as outpercontig, \
open (args.output_file + ".warn.tsv", "w") as outdisc :
#Write header
with open(args.output_file + ".pergene.tsv", "w") as out, \
open(args.output_file + ".percontig.tsv", "w") as outpercontig, \
open(args.output_file + ".warn.tsv", "w") as outdisc:
# Write header
out.write("#prot_id\tconsensus_tax_id\tconsensus_lineage\ttax_id_by_level\n")
outpercontig.write("#contig\tconsensus_tax_id\tconsensus_lineage\ttax_id_by_level\n")
outdisc.write("#prot_id\tlist nr hit not found in taxo\n")
......@@ -340,45 +352,49 @@ def main(argv):
contig_id = None
for prot, matchs in list(matches.items()):
hit_sorted = sorted(matchs, key = lambda x: x[1], reverse = True)
hit_sorted = sorted(matchs, key=lambda x: x[1], reverse=True)
####
# handle contig consensus
match = re_contig.match(prot)
if match :
if match:
contig_id = match.group(1)
if prev_contig == None :
if prev_contig == None:
prev_contig = contig_id
if prev_contig != contig_id :
if prev_contig != contig_id:
###
(tax_id_keep, fullnamelineage_text, fullnamelineage_ids) = get_consensus(collate_hits_per_contig)
(tax_id_keep, fullnamelineage_text, fullnamelineage_ids) = get_consensus(
collate_hits_per_contig)
count_genealogy_contig[d_taxonomy[str(tax_id_keep)].level] += 1
outpercontig.write(prev_contig + "\t" + str(tax_id_keep) + "\t" + fullnamelineage_text + "\t" + str(fullnamelineage_ids) + "\n")
outpercontig.write(prev_contig + "\t" + str(tax_id_keep) + "\t" +
fullnamelineage_text + "\t" + str(fullnamelineage_ids) + "\n")
collate_hits_per_contig = list()
for depth in range(7):
collate_hits_per_contig.append(Counter())
prev_contig = contig_id
best_hit = hit_sorted [0][0]
fHit = hit_sorted [0][1]
best_hit = hit_sorted[0][0]
fHit = hit_sorted[0][1]
if mapping[best_hit] > -1:
#Retrieve hit taxa id
# Retrieve hit taxa id
tax_id = mapping[best_hit]
if str(tax_id) in d_taxonomy : # taxid in taxonomy ?
#Retreive lineage on main level only (no no rank)
if str(tax_id) in d_taxonomy: # taxid in taxonomy ?
# Retreive lineage on main level only (no no rank)
hits = d_taxonomy[str(tax_id)].genealogy_main_level()
for depth in range(7):
if hits[depth] != "None":
weight = (fHit - MIN_IDENTITY_TAXA[depth]) / (1.0 - MIN_IDENTITY_TAXA[depth])
weight = max(weight,0.0)
weight = (fHit - MIN_IDENTITY_TAXA[depth]
) / (1.0 - MIN_IDENTITY_TAXA[depth])
weight = max(weight, 0.0)
if weight > 0.0:
collate_hits_per_contig[depth][hits[depth]] += weight #could put a transform in here
# could put a transform in here
collate_hits_per_contig[depth][hits[depth]] += weight
# end handle contig consensus
####
####
#Handle a protein, init variable
# Handle a protein, init variable
added_matches = []
collate_hits = list()