Commit 685d8cd9 authored by Joanna Fourquet's avatar Joanna Fourquet
Browse files

Merge branch 'dev' into 'master'

5!

See merge request !5
parents 879e75d7 bb124b05
......@@ -6,7 +6,7 @@
The workflow processes raw data from FastQ inputs and do following steps:
* controls the quality of data (FastQC and MultiQC)
* trims adapters sequences and clean reads (Cutadapt, Sickle)
* trims adapters sequences and deletes low quality reads (Cutadapt, Sickle)
* suppresses contaminants (BWA mem, samtools, bedtools)
* makes a taxonomic classification of reads (kaiju MEM and kronaTools)
* assembles cleaned reads (metaSPAdes or megahit)
......@@ -15,10 +15,11 @@ The workflow processes raw data from FastQ inputs and do following steps:
* clusterizes at sample and global level (cd-hit and home-made python script)
* quantifies reads on genes (BWA index, BWA-MEM and featureCounts)
* makes a quantification table (home-made python script)
* makes a taxonomic affiliation of contigs (DIAMOND and home-made python script)
The pipeline is built using [Nextflow,](https://www.nextflow.io/docs/latest/index.html#) a bioinformatics workflow tool to run tasks across multiple compute infrastructures in a very portable manner.
It will come with a [Singularity](https://sylabs.io/docs/) container making installation trivial and results highly reproducible.
It can be run with a [Singularity](https://sylabs.io/docs/) container making installation trivial and results highly reproducible.
# Schematic representation
......@@ -40,9 +41,13 @@ metagWGS requires all the following tools. They must be installed and copied or
* [Megahit](https://github.com/voutcn/megahit) v1.1.3
* [Prokka](https://github.com/tseemann/prokka) v1.13.4 - WARNING : always have the new release
* [Cd-hit](http://weizhongli-lab.org/cd-hit/) v4.6.8
* [Samtools](http://www.htslib.org/) v0.1.19
* [Samtools](http://www.htslib.org/) v1.9
* [Bedtools](https://bedtools.readthedocs.io/en/latest/) v2.27.1
* [Subread](http://subread.sourceforge.net/) v1.6.0
* Python3 BCBio (+ BCBio.GFF), Bio (+ Bio.Seq, Bio.SeqRecord, Bio.SeqFeature) and pprint libraries
* [Singularity](https://sylabs.io/docs/) v3.0.1
* [DIAMOND](https://github.com/bbuchfink/diamond) v0.9.22
* Python3 Pandas
# Installation
## Install NextFlow
......@@ -66,19 +71,27 @@ mv nextflow ~/bin/
```
git clone git@forgemia.inra.fr:genotoul-bioinfo/metagwgs.git
```
* **Configure profiles**
* **Configure profiles with modules**
A configuration file has been developped ([nextflow.config](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/blob/dev/nextflow.config)) to run the pipeline on a local machine or on a SLURM cluster.
To use these configurations run the pipeline with following parameters:
* `-profile standard` runs metagWGS on a local machine.
* `-profile local,modules` runs metagWGS on a local machine without Singularity container (with modules).
* `-profile cluster_slurm` runs metagWGS on a SLURM cluster.
* `-profile slurm,modules` runs metagWGS on a SLURM cluster without Singularity container (with modules).
* **Reproducibility with a Singularity container**
* **Configure profiles with a Singularity container**
A [Singularity](https://sylabs.io/docs/) container will be soon available to run the pipeline metagWGS.
A [Singularity](https://sylabs.io/docs/) container is available for this pipeline.
All informations about how we built it is in [this Wiki page](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/wikis/Singularity%20container).
To use the configuration profiles with Singularity container run the pipeline with following parameters:
* `-profile local,singularity` runs metagWGS on a local machine with our Singularity container (in addition to -with-singularity singularity/metagWGS_with_dependancies.img parameter).
* `-profile slurm,singularity` runs metagWGS on a SLURM cluster with our Singularity container (in addition to -with-singularity singularity/metagWGS_with_dependancies.img parameter).
# Usage
......@@ -87,43 +100,88 @@ A [Singularity](https://sylabs.io/docs/) container will be soon available to run
A basic command line running the pipeline is:
```python
./nextflow run -profile [standard or cluster_slurm] main.nf --reads '*_{R1,R2}.fastq.gz' --assembly [metaspades or megahit]
./nextflow run -profile [local,modules or slurm,modules or local,singularity or slurm,singularity] main.nf --reads '*_{R1,R2}.fastq.gz' --assembly [metaspades or megahit] [-with-singularity singularity/metagWGS_with_dependancies.img]
```
'*_{R1,R2}.fastq.gz' run the pipeline with all the R1.fastq.gz and R2.fastq.gz files available in your working directory.
WARNING: the user has choice between metaspades or megahit for assembly step.
The choice can be based on CPUs and memory availability: metaspades needs more CPUs and memory than megahit but our tests showed that assembly metrics are better than megahit.
## Other parameters
Other parameters are available:
```
Mode:
--mode: Paired-end ('pe') or single-end ('se') reads. Default: 'pe'. Single-end mode has not been developped yet.
Trimming options:
--adapter1 Sequence of adapter 1. Default: Illumina TruSeq adapter.
--adapter2 Sequence of adapter 2. Default: Illumina TruSeq adapter.
Quality options:
--qualityType Sickle supports three types of quality values: Illumina, Solexa, and Sanger. Default: 'sanger'.
Alignment options:
--db_alignment Alignment data base.
Taxonomic classification options (to avoid kaiju indexation provide following files):
--kaiju_nodes File nodes.dmp built with kaiju-makedb.
--kaiju_db File kaiju_db_refseq.fmi built with kaiju-makedb.
--kaiju_names File names.dmp built with kaiju-makedb.
Other options:
--outdir The output directory where the results will be saved.
--help Show the help message and exit.
Skip
Usage:
The typical command for running the pipeline is as follows:
nextflow run -profile standard main.nf --reads '*_{R1,R2}.fastq.gz' --assembly metaspades
Mandatory arguments:
--reads Path to input data (must be surrounded with quotes).
--assembly Tool used for assembly: 'metaspades' or 'megahit'.
-profile Configuration profile to use.
Available: local,modules (local with module load), slurm,modules (run pipeline on slurm cluster with module load),
local,singularity (local with Singularity container) and slurm,singularity (run pipeline on slurm cluster with Singularity container)
Options:
Mode:
--mode: Paired-end ('pe') or single-end ('se') reads. Default: 'pe'.
Trimming options:
--adapter1 Sequence of adapter 1. Default: Illumina TruSeq adapter.
--adapter2 Sequence of adapter 2. Default: Illumina TruSeq adapter.
Quality option:
--qualityType Sickle supports three types of quality values: Illumina, Solexa, and Sanger. Default: 'sanger'.
Alignment options:
--db_host Host database already indexed (bwa index).
Taxonomic classification options:
--kaiju_nodes File nodes.dmp built with kaiju-makedb.
--kaiju_db File kaiju_db_refseq.fmi built with kaiju-makedb.
--kaiju_names File names.dmp built with kaiju-makedb.
--diamond_bank NR Diamond bank
--accession2taxid FTP adress of file prot.accession2taxid.gz
--taxdump FTP adress of file taxdump.tar.gz
Clustering option:
--percentage_identity Sequence identity threshold. Default: 0.95.
Other options:
--outdir The output directory where the results will be saved.
--help Show this message and exit.
Softwares versions:
Cutadapt v1.15
Sickle v1.33
FastQC v0.11.7
MultiQC v1.5
BWA 0.7.17
Python v3.6.3
Kaiju v1.7.0
SPAdes v3.11.1
megahit v1.1.3
prokka v1.13.4
cdhit v4.6.8
samtools v1.9
bedtools v2.27.1
subread v1.6.0
diamond v0.9.22
Skip
--skip_sickle Skip sickle process.
--skip_kaiju_index Skip built of kaiju database (index_db_kaiju process).
--skip_taxonomy Skip taxonomy process
```
......@@ -141,6 +199,7 @@ The pipeline will create the following files in your working directory:
** results/05_Annotation: files .gff, .ffn, .fna, .faa, etc after prokka annotation and .gff, .ffn, .fna, .faa files with renamed contigs and genes
** results/06_Clustering: cd-hit results for each sample, correspondance table of intermediate clusters and genes, cd-hit results at global level and correspondance table of global cluster and intermediate clusters (table_clstr.txt)
** results/07_Quantification: .bam et .bam.bai file after reads alignment on contigs, .count files (featureCounts count), .summary file (featureCounts summary), .output file (featureCounts output), Correspondence_global_clstr_contigs.txt (correspondance table between global cluster and genes), Clusters_Count_table_all_samples.txt (quantification table of aligned reads for each global cluster and for each sample).
** results/08_Diamond: diamond results and taxonomic affiliation consensus for proteins and contigs.
* .nextflow_log # Log file from Nextflow
* # Other nextflow hidden files, eg. history of pipeline runs and old logs.
......@@ -148,13 +207,29 @@ The pipeline will create the following files in your working directory:
## How to run demonstration on genologin cluster
* Data test are available [here](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/tree/master/test)
* BWA index of human reference genome is available at /bank/bwadb/ensembl_homo_sapiens_genome
* kaiju database index file are avaiblable at /bank/kaijudb/kaijudb_Juin2019/
* BWA index of human reference genome is available at /work/bank/bwadb/ensembl_homo_sapiens_genome
* kaiju database index file are avaiblable at /work/bank/kaijudb/kaijudb_Juin2019/
WARNING: to be noticed, on genologin modules are written "bioinfo/Name-version". We encountered issues with load of several modules at the same time (compatibility). To avoid this problem, we created for some process one module which contains the load of different modules:
* bioinfo_bwa_samtools_subread
* bioinfo_bwa_samtools
You can run the pipeline as follow without Singularity container:
```python
./nextflow run -profile cluster_slurm_modules main.nf --reads '*_{R1,R2}.fastq.gz' --assembly metaspades --skip_kaiju_index --kaiju_nodes /work/bank/kaijudb/kaijudb_Juin2019/nodes.dmp --kaiju_db /w
ork/bank/kaijudb/kaijudb_Juin2019/refseq/kaiju_db_refseq.fmi --kaiju_names /work/bank/kaijudb/kaijudb_Juin2019/names.dmp
./nextflow run -profile slurm,modules main.nf --reads '*_{R1,R2}.fastq.gz' --assembly metaspades --skip_kaiju_index --kaiju_nodes /work/bank/kaijudb/kaijudb_Juin2019/nodes.dmp --kaiju_db /work/
bank/kaijudb/kaijudb_Juin2019/refseq/kaiju_db_refseq.fmi --kaiju_names /work/bank/kaijudb/kaijudb_Juin2019/names.dmp
```
You can run the pipeline as follow with our Singularity container:
You can run the pipeline as follow:
```python
./nextflow run -profile cluster_slurm main.nf --reads '*_{R1,R2}.fastq.gz' --assembly metaspades --skip_kaiju_index --kaiju_nodes /bank/kaijudb/kaijudb_Juin2019/nodes.dmp --kaiju_db /bank/kaijudb/kaijudb_Juin2019/refseq/kaiju_db_refseq.fmi --kaiju_names /bank/kaijudb/kaijudb_Juin2019/names.dmp
module load system/singularity-3.0.1
./nextflow run -profile slurm,singularity main.nf --reads '*_{R1,R2}.fastq.gz' --assembly metaspades --skip_kaiju_index --kaiju_nodes /work/bank/kaijudb/kaijudb_Juin2019/nodes.dmp --kaiju_db /work/
bank/kaijudb/kaijudb_Juin2019/refseq/kaiju_db_refseq.fmi --kaiju_names /work/bank/kaijudb/kaijudb_Juin2019/names.dmp -with-singularity singularity/metagWGS_with_dependancies.img
```
# License
......@@ -167,6 +242,6 @@ metagWGS is distributed under the GNU General Public License v3.
# Citation
metagWGS will be presented at JOBIM 2019:
metagWGS has been presented at JOBIM 2019 and at Genotoul Biostat Bioinfo day:
"Whole metagenome analysis with metagWGS", J. Fourquet, A. Chaubet, H. Chiapello, C. Gaspin, M. Haenni, C. Klopp, A. Lupo, J. Mainguy, C. Noirot, T. Rochegue, M. Zytnicki, T. Ferry, C. Hoede.
\ No newline at end of file
Poster "Whole metagenome analysis with metagWGS", J. Fourquet, A. Chaubet, H. Chiapello, C. Gaspin, M. Haenni, C. Klopp, A. Lupo, J. Mainguy, C. Noirot, T. Rochegue, M. Zytnicki, T. Ferry, C. Hoede.
\ No newline at end of file
#!/usr/bin/env python
"""----------------------------------------------------------------------------------------------------------------------------------------------------------
Script Name: Script_histogram_kaiju.py
Description: Generates histogram of the distribution of reads according
to the size of the matches
Input files:
Verbose file generated by Kaiju.
Created By: Joanna Fourquet
Date: 2019-09-09
----------------------------------------------------------------------------------------------------------------------------------------------------------
"""
# Metadata
__author__ = 'Joanna Fourquet - Plateforme bioinformatique Toulouse'
__copyright__ = 'Copyright (C) 2019 INRA'
__license__ = 'GNU General Public License'
__version__ = '0.1'
__email__ = 'support.bioinfo.genotoul@inra.fr'
__status__ = 'dev'
# Status: dev
# Modules importation
try:
import argparse
import re
import sys
import os
from collections import Counter
from collections import OrderedDict
from matplotlib import pyplot
except ImportError as error:
print(error)
exit(1)
# Manage parameters
parser = argparse.ArgumentParser(description = \
'Generates histogram of the distribution of reads according \
to the size of the matches.')
parser.add_argument('-i', '--input', required = True, help = \
'Verbose file generated by Kaiju.')
args = parser.parse_args()
def make_figure(ord_dictionary, name_sample, type_figure):
pyplot.bar(list(ord_dictionary.keys()), ord_dictionary.values())
axes = pyplot.gca()
axes.xaxis.set_ticks(range(4,44,5))
axes.xaxis.set_ticklabels(range(15,51,5))
pyplot.xlabel("Match length")
pyplot.ylabel("Number of reads (" + type_figure + ")")
pyplot.title(name_sample)
pyplot.savefig(name_sample + "_Kaiju_MEM_" + type_figure + ".pdf")
pyplot.close()
def main(argv):
with open(args.input) as kaiju_file:
# Variable initialization.
d_nb_reads_by_match_length = Counter()
for entire_line_kaiju_file in kaiju_file:
line_kaiju_file = entire_line_kaiju_file.split()
# Take into account only classified reads.
if line_kaiju_file[0] == "C":
# Record length of the match.
d_nb_reads_by_match_length[line_kaiju_file[3]] += 1
# For all others match lengths, initialize the value to 0.
for i in range(11,51):
if str(i) not in d_nb_reads_by_match_length.keys():
d_nb_reads_by_match_length[str(i)] = 0
d_nb_reads_by_match_length_ord = OrderedDict(sorted(d_nb_reads_by_match_length.items(), key=lambda t: t[0]))
print(d_nb_reads_by_match_length_ord)
# Figure.
make_figure(d_nb_reads_by_match_length_ord, args.input, "counts")
# Normalized reads by total number of reads.
sum_reads = sum(d_nb_reads_by_match_length_ord.values())
d_norm_reads_by_match_length = Counter()
for match_length, nb_reads in d_nb_reads_by_match_length_ord.items():
d_norm_reads_by_match_length[match_length] = nb_reads / sum_reads
d_norm_reads_by_match_length_ord = OrderedDict(d_norm_reads_by_match_length)
print(d_norm_reads_by_match_length_ord)
# Figure.
make_figure(d_norm_reads_by_match_length_ord, args.input, "normalized")
if __name__ == "__main__":
main(sys.argv[1:])
#!/usr/bin/env Rscript
args = commandArgs(trailingOnly=TRUE)
# test if there is at least one argument: if not, return an error
if (length(args)==0) {
stop("At least one argument must be supplied (input file).n", call.=FALSE)
} else if (length(args)>=1) {
for (i in 1:length(args))
{
tab_init <- read.table(args[i], header=FALSE, sep="\t", fill=TRUE, col.names = paste0("V",seq_len(7)))
print(head(tab_init))
tab <- tab_init[tab_init$V1=="C",]
print(head(tab))
tab_hist <- hist(tab$V4, breaks=seq(10,50,1), plot=FALSE)
print(paste0(args[i], " breaks and counts sum"))
print(tab_hist$breaks)
print(sum(tab_hist$counts))
print(tab_hist$counts)
jpeg(paste0(args[i], '_Kaiju_MEM_counts.jpg'))
plot(tab_hist, ylim=c(0,15000000))
dev.off()
tab_hist$counts <- tab_hist$counts/sum(tab_hist$counts)
jpeg(paste0(args[i], "_Kaiju_MEM_normalized.jpg"))
plot(tab_hist, ylim=c(0,0.25))
dev.off()
print(paste0(args[i], " breaks and counts sum"))
print(tab_hist$breaks)
print(sum(tab_hist$counts))
print(tab_hist$counts)
}
}
#!/bin/env python
#!/usr/bin/env python
"""----------------------------------------------------------------------------------------------------------------------------------------------------------
Script Name: Script_quantification_clusters.py
"""--------------------------------------------------------------------
Script Name: Script_quantification_clusters_opti.py
Description: Create a file which join
table with global cluster id and intermediate clusters id
table with global cluster id and intermediate cluster id
to table with intermediate cluster id and contigs id.
Create a file which contains
sum of reads aligned
to each contig of a cluster.
Input files:
1st input file: table_clstr.txt (table with cluster id and corresponding intermediate cluster ids)
2nd input file: file containing list of file names generated with 1st cd-hit for each sample (intermediate cluster id and contig id).
3rd input file: file containing list of file names generated with featureCounts for each sample (.featureCounts.count files)
Created By: Joanna Fourquet
1st input file: table_clstr.txt (table with cluster id
and corresponding intermediate cluster ids)
2nd input file: file containing list of file names
generated with 1st cd-hit for each sample
(intermediate cluster id and contig id).
3rd input file: file containing list of file names
generated with featureCounts for each sample
(.featureCounts.count files)
Created By: Joanna Fourquet et Celine Noirot
Date: 2019-04-11
----------------------------------------------------------------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------
"""
# Metadata
__author__ = 'Fourquet Joanna - Plateforme bioinformatique Toulouse'
# Metadata.
__author__ = 'Joanna Fourquet, Celine Noirot \
- Plateforme bioinformatique Toulouse'
__copyright__ = 'Copyright (C) 2019 INRA'
__license__ = 'GNU General Public License'
__version__ = '0.1'
__email__ = 'support.bioinfo.genotoul@inra.fr'
__status__ = 'dev'
# Status: dev
# Status: dev.
# Modules importation
# Modules importation.
try:
import argparse
import re
......@@ -38,109 +43,119 @@ except ImportError as error:
print(error)
exit(1)
# Print date
# Print time.
print(str(datetime.now()))
# Print input arguments
# Manage parameters
parser =argparse.ArgumentParser(description='Script which create a correspondence table between global cluster id and contig id and a table with number of aligned reads in each sample and for each global cluster id.')
parser.add_argument('-t', '--table', required=True, help='Correspondence table between global cluster id and intermediate cluster id.')
parser.add_argument('-l', '--listoffileCluster', required=True, help='List of files containing correspondence tables between cluster intermediate cluster id and contig id per sample.')
parser.add_argument('-c', '--listoffileCounts', required=True, help='List of files storing read counts for each contig per sample.')
parser.add_argument('-oc', '--outputCounts', required=True, help='Name of output file containing counts for each global cluster id and each sample.')
parser.add_argument('-oID', '--outputID', required=True, help='Name of output file containing correspondence table between global cluster id and contig id.')
parser.add_argument('-v', '--version', action='version', version=__version__)
# Manage parameters.
parser = argparse.ArgumentParser(description = 'Script which create a \
correspondence table between global cluster id and contig id and \
a table with number of aligned reads in each sample and for each \
global cluster id.')
args = parser.parse_args()
parser.add_argument('-t', '--table_of_corespondences', required = True, \
help = 'Correspondence table between global cluster \
id and intermediate cluster id.')
parser.add_argument('-l', '--list_of_file_clusters', required = True, \
help = 'List of files containing correspondence tables between \
cluster intermediate cluster id and contig id per sample.')
parser.add_argument('-c', '--list_of_file_counts', required = True, \
help = 'List of files storing read counts for each contig per sample.')
# Initialization of dictionnary dict_cltr_count.
# It will contains for each cluster the sum of reads
# corresponding to contigs belonging to this cluster for each sample.
dict_cltr_count={}
parser.add_argument('-oc', '--output_counts', required = True, \
help = 'Name of output file containing counts \
for each global cluster id and each sample.')
parser.add_argument('-oid', '--output_id', required = True, \
help = 'Name of output file containing correspondence table \
between global cluster id and contig id.')
parser.add_argument('-v', '--version', action = 'version', \
version = __version__)
args = parser.parse_args()
# Recovery of the list of file names.
with open(args.listoffileCounts) as fcounts_list:
with open(args.list_of_file_counts) as fcounts_list:
files_of_counts = fcounts_list.read().split()
# The dictionnary dict_cltr_global_cltr will store intermediate cluster id (key) and global cluster id (value) of file after argument -t.
# Dictionnaries dict_cltr_global_cltr and dict_cltr_count initialization.
dict_cltr_global_cltr = {}
dict_cltr_count = {}
# For all varaible names:
# g_clstr: global cluster,
# int_clstr: intermediate cluster,
# ctg: contig.
# Dictionnaries d_g_clstr_id_by_int_clstr_id
# and d_count_by_g_clstr initialization.
d_g_clstr_id_by_int_clstr_id = {}
d_count_by_g_clstr = {}
with open(args.table) as fp:
for cluster in fp:
# If we are not at the end of file
# we store intermediate cluster id of the reading line in "key" variable
# and global cluster id in "value" variable of the dictionnary.
glob_cluster, *int_cluster = cluster.split()
for c in int_cluster :
dict_cltr_global_cltr[c]=glob_cluster
# Initialization of dict_cltr_count keys with the name of keys of dict (name of clusters).
# Initialization of dict_cltr_count values at 0.
dict_cltr_count[glob_cluster] = [0]*len(files_of_counts)
with open(args.table_of_corespondences) as fp:
for g_clstr_int_clstr_line in fp:
g_clstr, *int_clstr = g_clstr_int_clstr_line.split()
for clstr in int_clstr :
d_g_clstr_id_by_int_clstr_id[clstr] = g_clstr
d_count_by_g_clstr[g_clstr] = [0]*len(files_of_counts)
print(dict_cltr_global_cltr)
print(dict_cltr_count)
print(d_g_clstr_id_by_int_clstr_id)
print(d_count_by_g_clstr)
# Print date.
print(str(datetime.now()))
# The dictionnary dict_contig_global_cltr will contain for each global cluster id (key)
# the corresponding contigs id (values).
# Initialization of dictionnary dict_contig_global_cltr.
dict_contig_global_cltr={}
# Opening of file after -l argument.
# This file contains the list of sample files names which contains
# correspondence between intermediate cluster id and contig id.
with open(args.listoffileCluster) as fcluster_list:
files_of_contigs = fcluster_list.read().split()
print(files_of_contigs)
# For each line of each sample file, store the contig id (value) in the dictionnary
# dict_contig_global_cltr.
# The key of dict_contig_global_cltr is the value of dict_cltr_global_cltr (global cluster id).
for cluster_contigs_path in files_of_contigs:
print(cluster_contigs_path)
with open(cluster_contigs_path) as fh:
for file in fh:
line_split = file.split()
print(line_split)
intermediate_cluster_id = line_split[0]
contig_id_from_cluster_contigs_path = line_split[1]
if 'dict_contig_global_cltr[contig_id_from_cluster_contigs_path]' not in dict_contig_global_cltr:
# Initialization of dictionnary d_g_clstr_id_by_ctg_id.
d_g_clstr_id_by_ctg_id = {}
# Store into files_of_int_clstr_id_ctg_id the list of sample files names
# which contains correspondence between intermediate cluster id and contig id.
with open(args.list_of_file_clusters) as fcluster_list:
files_of_int_clstr_id_ctg_id = fcluster_list.read().split()
print(files_of_int_clstr_id_ctg_id)
# For each line of each sample file into files_of_int_clstr_id_ctg_id,
# store the contig id (key) in the dictionnary
# d_g_clstr_id_by_ctg_id.
# The value of d_g_clstr_id_by_ctg_id is the value of
# d_g_clstr_id_by_int_clstr_id (global cluster id).
for int_clstr_ctg_path in files_of_int_clstr_id_ctg_id:
print(int_clstr_ctg_path)
with open(int_clstr_ctg_path) as fh:
for file_int_clstr_ctg in fh:
line_int_clstr_ctg = file_int_clstr_ctg.split()
print(line_int_clstr_ctg)
int_clstr_id = line_int_clstr_ctg[0]
ctg_id_from_clstr_ctg_path = line_int_clstr_ctg[1]
if \
'd_g_clstr_id_by_ctg_id[ctg_id_from_clstr_ctg_path]' \
not in d_g_clstr_id_by_ctg_id:
print("if")
dict_contig_global_cltr[contig_id_from_cluster_contigs_path] = dict_cltr_global_cltr[intermediate_cluster_id]
d_g_clstr_id_by_ctg_id[ctg_id_from_clstr_ctg_path] \
= d_g_clstr_id_by_int_clstr_id[int_clstr_id]
else:
dict_contig_global_cltr[contig_id_from_cluster_contigs_path].append(dict_cltr_global_cltr[intermediate_cluster_id])
d_g_clstr_id_by_ctg_id[ctg_id_from_clstr_ctg_path]\
.append(d_g_clstr_id_by_int_clstr_id[int_clstr_id])
print(dict_contig_global_cltr)
print(d_g_clstr_id_by_ctg_id)
# Print date.
print(str(datetime.now()))
# For each count file (output of featureCounts), reading of lines one by one,
# recovery of name of contig and count number and incrementing of corresponding value in dict_cltr_count.
for (ech_idx,counts_path) in enumerate(files_of_counts):
# recovery of name of contig and count number and incrementing of corresponding
# value in d_count_by_g_clstr.
for (count_idx,counts_path) in enumerate(files_of_counts):
with open(counts_path) as fh:
for f_contig_counts in fh:
# Test: if the line of file contains '#' or 'bam', this line doesn't contain counts so it must be passed.
if f_contig_counts.startswith('#') or f_contig_counts.startswith('Geneid'): continue
# Recovery of contig id and corresponding count.
line_split = f_contig_counts.split()
contig_id=line_split[0].split("_gene")[0]
contig_count=int(line_split[6])
dict_cltr_count[dict_contig_global_cltr[contig_id]][ech_idx] += contig_count
for f_ctg_counts in fh:
if f_ctg_counts.startswith('#') \
or f_ctg_counts.startswith('Geneid'):
continue
line_ctg_counts_split = f_ctg_counts.split()
ctg_id = line_ctg_counts_split[0].split("_gene")[0]
ctg_count = int(line_ctg_counts_split[6])
d_count_by_g_clstr[d_g_clstr_id_by_ctg_id[ctg_id]]\
[count_idx] += ctg_count
# Print date.
print(str(datetime.now()))
......@@ -149,28 +164,33 @@ print(str(datetime.now()))
# Write in the output files.
#######################################
# Open output file.
with open(args.outputID,"w") as foutput_res_table:
# Write output file containing correspondence table
# between global cluster id and contig id..
with open(args.output_id,"w") as foutput_res_table:
# Heading of output file: name of columns.