Commit 782d2a00 authored by Celine Noirot's avatar Celine Noirot
Browse files

Merge branch 'dev' into 'master'

Dev

See merge request !1
parents ebf9db5b 3d4f2775
.nextflow*
work/
This diff is collapsed.
# metagWGS
# **metagWGS**
# Introduction
**metagWGS** is a Nextflow bioinformatics analysis pipeline used for Metagenomic Shotgun Sequencing data (Illumina HiSeq3000, paired, 2*150bp).
The workflow processes raw data from FastQ inputs and do following steps:
* control the quality of data (FastQC and multiQC)
* trim adapters sequences and clean reads (Cutadapt, Sickle)
* suppress contaminants (BWA mem, samtools, bedtools)
* makes a taxonomic classification of reads (kaiju MEM and kronaTools)
* assembles cleaned reads and annotate contigs (metaSPAdes or megahit, prokka)
* rename contigs and genes (home-made python script)
* clusterize at sample and global level (cd-hit and home-made python script)
* quantify reads on genes (BWA index, BWA-MEM and featureCounts)
* makes a quantification table (home-made python script)
The pipeline is built using [Nextflow,](https://www.nextflow.io/docs/latest/index.html#) a bioinformatics workflow tool to run tasks across multiple compute infrastructures in a very portable manner.
It will come with a singularity container making installation trivial and results highly reproducible.
# Prerequisites
metagWGS requires all the following tools. They must be installed and copied or moved to a directory in your $PATH:
* [Nextflow](https://www.nextflow.io/docs/latest/index.html#) v19.01.0
* [Cutadapt](https://cutadapt.readthedocs.io/en/stable/#) v1.15
* [Sickle](https://github.com/najoshi/sickle) v1.33
* [FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) v0.11.7
* [MultiQC](https://multiqc.info/) v1.5
* [BWA](http://bio-bwa.sourceforge.net/) v0.7.17
* [Python](https://www.python.org/) v3.6.3
* [Kaiju](https://github.com/bioinformatics-centre/kaiju) v1.7.0
* [SPAdes](https://github.com/ablab/spades) v3.11.1
* [megahit](https://github.com/voutcn/megahit) v1.1.3
* [prokka](https://github.com/tseemann/prokka) v1.13.4 - WARNING : always have the new release
* [cdhit](http://weizhongli-lab.org/cd-hit/) v4.6.8
* [samtools](http://www.htslib.org/) v0.1.19
* [bedtools](https://bedtools.readthedocs.io/en/latest/)s v2.27.1
* [subread](http://subread.sourceforge.net/) v1.6.0
# Installation
## Install NextFlow
Nextflow runs on most POSIX systems (Linux, Mac OSX etc). It can be installed by running the following commands:
```bash
# Make sure that Java v8+ is installed:
java -version
# Install Nextflow
curl -fsSL get.nextflow.io | bash
# Add Nextflow binary to your PATH:
mv nextflow ~/bin/
# OR system-wide installation:
# sudo mv nextflow /usr/local/bin
```
## Install workflow
* **Retrieve workflow sources**
```
git clone git@forgemia.inra.fr:genotoul-bioinfo/metagwgs.git
```
* **Configure profiles**
A configuration file has been developped ([nextflow.config](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/blob/dev/nextflow.config)) to run the pipeline on a local machine or on a SLURM cluster.
To use these configurations run the pipeline with following parameters:
* `-profile standard` runs metagWGS on a local machine.
* `-profile cluster_slurm` runs metagWGS on a SLURM cluster.
* **Reproducibility with a Singularity container**
A [Singularity](https://sylabs.io/docs/) container will be soon available to run the pipeline metagWGS.
# Usage
## Basic usage
A basic command line running the pipeline is:
```python
nextflow run -profile [standard or cluster_slurm] main.nf --reads '*_{R1,R2}.fastq.gz' --assembly [metaspades or megahit]
```
'*_{R1,R2}.fastq.gz' run the pipeline with all the R1.fastq.gz and R2.fastq.gz files available in your working directory.
## Other parameters
Other parameters are available:
```
Mode:
--mode: Paired-end ('pe') or single-end ('se') reads. Default: 'pe'. Single-end mode has not been developped yet.
Trimming options:
--adapter1 Sequence of adapter 1. Default: Illumina TruSeq adapter.
--adapter2 Sequence of adapter 2. Default: Illumina TruSeq adapter.
Quality options:
--qualityType Sickle supports three types of quality values: Illumina, Solexa, and Sanger. Default: 'sanger'.
Alignment options:
--db_alignment Alignment data base.
Taxonomic classification options (to avoid kaiju indexation provide following files):
--kaiju_nodes File nodes.dmp built with kaiju-makedb.
--kaiju_db File kaiju_db_refseq.fmi built with kaiju-makedb.
--kaiju_names File names.dmp built with kaiju-makedb.
Other options:
--outdir The output directory where the results will be saved.
--help Show the help message and exit.
```
## Generated files
The pipeline will create the following files in your working directory:
```
* work # Directory containing the nextflow working files
* results # Directory containing result files
* .nextflow_log # Log file from Nextflow
* # Other nextflow hidden files, eg. history of pipeline runs and old logs.
```
# License
metagWGS is distributed under the GNU General Public License v3.
# Copyright
2019 INRA
# Citation
metagWGS has not been published yet.
# Author: Ian Korf, Genome Center, UC Davis
# This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.
# This software is provided AS IS, without warranty of any kind.
use strict;
use warnings;
package FAlite;
use strict;
sub new {
my ($class, $fh) = @_;
if (ref $fh !~ /GLOB/)
{die ref $fh, "\n", "FAlite ERROR: expect a GLOB reference\n"}
my $this = bless {};
$this->{FH} = $fh;
while(<$fh>) {last if $_ =~ /\S/} # not supposed to have blanks, but...
my $firstline = $_;
if (not defined $firstline) {warn "FAlite: Empty\n"; return $this}
if ($firstline !~ /^>/) {warn "FAlite: Not FASTA formatted\n"; return $this}
$this->{LASTLINE} = $firstline;
chomp $this->{LASTLINE};
return $this;
}
sub nextEntry {
my ($this) = @_;
return 0 if not defined $this->{LASTLINE};
my $fh = $this->{FH};
my $def = $this->{LASTLINE};
my @seq;
my $lines_read = 0;
while(<$fh>) {
$lines_read++;
if ($_ =~ /^>/) {
$this->{LASTLINE} = $_;
chomp $this->{LASTLINE};
last;
}
push @seq, $_;
}
return 0 if $lines_read == 0;
chomp @seq;
my $entry = FAlite::Entry::new($def, \@seq);
return $entry;
}
package FAlite::Entry;
use overload '""' => 'all';
sub new {
my ($def, $seqarry) = @_;
my $this = bless {};
$this->{DEF} = $def;
$this->{SEQ} = join("", @$seqarry);
$this->{SEQ} =~ s/\s//g; # just in case more spaces
return $this;
}
sub def {shift->{DEF}}
sub seq {shift->{SEQ}}
sub all {my $e = shift; return $e->{DEF}."\n".$e->{SEQ}."\n"}
1;
__END__
=head1 NAME
FAlite;
=head1 SYNOPSIS
use FAlite;
my $fasta = new FAlite(\*STDIN);
while(my $entry = $fasta->nextEntry) {
$entry->def;
$entry->seq;
}
=head1 DESCRIPTION
FAlite is a package for parsing FASTA files and databases. The FASTA format is
widely used in bioinformatics. It consists of a definition line followed by
sequence with an arbitrary number of lines and line lengths.
A FASTA file looks like this:
>identifier descriptive text
GAATTC
A FASTA database looks like this:
>identifier1 some text describing this entry
GAATTC
ACTAGT
>identifier2 some text describing this entry
AAACCT
GCTAAT
=head2 Object
FAlite has two kinds of objects, the file and the entry.
my $fasta_file = new FAlite(\*STDIN); # or any other filehandle
$entry = $fasta_file->nextEntry; # single fasta fle
while(my $entry = $fasta_file->nextEntry) {
# canonical form of use for fasta database
}
The entry has two attributes (def and seq).
$entry->def; # access the def line
$entry->seq; # access the sequence
"$entry"; # overload to fasta file ($entry->def . "\n" . $entry->seq)
=head1 AUTHOR
Ian Korf (ikorf@sapiens.wustl.edu, http://sapiens.wustl.edu/~ikorf)
=head1 ACKNOWLEDGEMENTS
This software was developed at the Genome Sequencing Center at Washington
Univeristy, St. Louis, MO.
=head1 COPYRIGHT
Copyright (C) 1999 Ian Korf. All Rights Reserved.
=head1 DISCLAIMER
This software is provided "as is" without warranty of any kind.
=cut
#!/usr/bin/env Rscript
args = commandArgs(trailingOnly=TRUE)
# test if there is at least one argument: if not, return an error
if (length(args)==0) {
stop("At least one argument must be supplied (input file).n", call.=FALSE)
} else if (length(args)>=1) {
for (i in 1:length(args))
{
tab_init <- read.table(args[i], header=FALSE, sep="\t", fill=TRUE, col.names = paste0("V",seq_len(7)))
print(head(tab_init))
tab <- tab_init[tab_init$V1=="C",]
print(head(tab))
tab_hist <- hist(tab$V4, breaks=seq(10,50,1), plot=FALSE)
print(paste0(args[i], " breaks and counts sum"))
print(tab_hist$breaks)
print(sum(tab_hist$counts))
print(tab_hist$counts)
jpeg(paste0(args[i], '_Kaiju_MEM_counts.jpg'))
plot(tab_hist, ylim=c(0,15000000))
dev.off()
tab_hist$counts <- tab_hist$counts/sum(tab_hist$counts)
jpeg(paste0(args[i], "_Kaiju_MEM_normalized.jpg"))
plot(tab_hist, ylim=c(0,0.25))
dev.off()
print(paste0(args[i], " breaks and counts sum"))
print(tab_hist$breaks)
print(sum(tab_hist$counts))
print(tab_hist$counts)
}
}
#!/bin/env python
"""----------------------------------------------------------------------------------------------------------------------------------------------------------
Script Name: Script_quantification_clusters.py
Description: Create a file which join
table with global cluster id and intermediate clusters id
to table with intermediate cluster id and contigs id.
Create a file which contains
sum of reads aligned
to each contig of a cluster.
Input files:
1st input file: table_clstr.txt (table with cluster id and corresponding intermediate cluster ids)
2nd input file: file containing list of file names generated with 1st cd-hit for each sample (intermediate cluster id and contig id).
3rd input file: file containing list of file names generated with featureCounts for each sample (.featureCounts.count files)
Created By: Joanna Fourquet
Date: 2019-04-11
----------------------------------------------------------------------------------------------------------------------------------------------------------
"""
# Metadata
__author__ = 'Fourquet Joanna - Plateforme bioinformatique Toulouse'
__copyright__ = 'Copyright (C) 2019 INRA'
__license__ = 'GNU General Public License'
__version__ = '0.1'
__email__ = 'support.bioinfo.genotoul@inra.fr'
__status__ = 'dev'
# Status: dev
# Modules importation
try:
import argparse
import re
import sys
from datetime import datetime
except ImportError as error:
print(error)
exit(1)
# Print date
print(str(datetime.now()))
# Print input arguments
# Manage parameters
parser =argparse.ArgumentParser(description='Script which create a correspondence table between global cluster id and contig id and a table with number of aligned reads in each sample and for each global cluster id.')
parser.add_argument('-t', '--table', required=True, help='Correspondence table between global cluster id and intermediate cluster id.')
parser.add_argument('-l', '--listoffileCluster', required=True, help='List of files containing correspondence tables between cluster intermediate cluster id and contig id per sample.')
parser.add_argument('-c', '--listoffileCounts', required=True, help='List of files storing read counts for each contig per sample.')
parser.add_argument('-oc', '--outputCounts', required=True, help='Name of output file containing counts for each global cluster id and each sample.')
parser.add_argument('-oID', '--outputID', required=True, help='Name of output file containing correspondence table between global cluster id and contig id.')
parser.add_argument('-v', '--version', action='version', version=__version__)
args = parser.parse_args()
# Initialization of dictionnary dict_cltr_count.
# It will contains for each cluster the sum of reads
# corresponding to contigs belonging to this cluster for each sample.
dict_cltr_count={}
# Recovery of the list of file names.
with open(args.listoffileCounts) as fcounts_list:
files_of_counts = fcounts_list.read().split()
# The dictionnary dict_cltr_global_cltr will store intermediate cluster id (key) and global cluster id (value) of file after argument -t.
# Dictionnaries dict_cltr_global_cltr and dict_cltr_count initialization.
dict_cltr_global_cltr = {}
dict_cltr_count = {}
with open(args.table) as fp:
for cluster in fp:
# If we are not at the end of file
# we store intermediate cluster id of the reading line in "key" variable
# and global cluster id in "value" variable of the dictionnary.
glob_cluster, *int_cluster = cluster.split()
for c in int_cluster :
dict_cltr_global_cltr[c]=glob_cluster
# Initialization of dict_cltr_count keys with the name of keys of dict (name of clusters).
# Initialization of dict_cltr_count values at 0.
dict_cltr_count[glob_cluster] = [0]*len(files_of_counts)
print(dict_cltr_global_cltr)
print(dict_cltr_count)
# Print date.
print(str(datetime.now()))
# The dictionnary dict_contig_global_cltr will contain for each global cluster id (key)
# the corresponding contigs id (values).
# Initialization of dictionnary dict_contig_global_cltr.
dict_contig_global_cltr={}
# Opening of file after -l argument.
# This file contains the list of sample files names which contains
# correspondence between intermediate cluster id and contig id.
with open(args.listoffileCluster) as fcluster_list:
files_of_contigs = fcluster_list.read().split()
print(files_of_contigs)
# For each line of each sample file, store the contig id (value) in the dictionnary
# dict_contig_global_cltr.
# The key of dict_contig_global_cltr is the value of dict_cltr_global_cltr (global cluster id).
for cluster_contigs_path in files_of_contigs:
print(cluster_contigs_path)
with open(cluster_contigs_path) as fh:
for file in fh:
line_split = file.split()
print(line_split)
intermediate_cluster_id = line_split[0]
contig_id_from_cluster_contigs_path = line_split[1]
if 'dict_contig_global_cltr[contig_id_from_cluster_contigs_path]' not in dict_contig_global_cltr:
print("if")
dict_contig_global_cltr[contig_id_from_cluster_contigs_path] = dict_cltr_global_cltr[intermediate_cluster_id]
else:
dict_contig_global_cltr[contig_id_from_cluster_contigs_path].append(dict_cltr_global_cltr[intermediate_cluster_id])
print(dict_contig_global_cltr)
# Print date.
print(str(datetime.now()))
# For each count file (output of featureCounts), reading of lines one by one,
# recovery of name of contig and count number and incrementing of corresponding value in dict_cltr_count.
for (ech_idx,counts_path) in enumerate(files_of_counts):
with open(counts_path) as fh:
for f_contig_counts in fh:
# Test: if the line of file contains '#' or 'bam', this line doesn't contain counts so it must be passed.
if f_contig_counts.startswith('#') or f_contig_counts.startswith('Geneid'): continue
# Recovery of contig id and corresponding count.
line_split = f_contig_counts.split()
contig_id=line_split[0].split("_gene")[0]
contig_count=int(line_split[6])
dict_cltr_count[dict_contig_global_cltr[contig_id]][ech_idx] += contig_count
# Print date.
print(str(datetime.now()))
#######################################
# Write in the output files.
#######################################
# Open output file.
with open(args.outputID,"w") as foutput_res_table:
# Heading of output file: name of columns.
foutput_res_table.write("id_cluster"+"\t"+"id_contig"+"\n")
# Writing cluster ids and contigs ids for each sample contained in dict_contig_global_cltr in the output file line by line.
for key_dict_contig_global_cltr, value_dict_contig_global_cltr in dict_contig_global_cltr.items():
foutput_res_table.write(value_dict_contig_global_cltr+"\t"+key_dict_contig_global_cltr+"\n")
# Print date.
print(str(datetime.now()))
# Open output file.
with open(args.outputCounts,"w") as foutput_res_counts:
# Heading of output file: name of columns.
foutput_res_counts.write("id_cluster\t"+"\t".join(files_of_counts)+"\n")
# Writing cluster ids and counts for each sample contained in dict_cltr_count in the output file line by line.
for key_dict_cltr_count, value_dict_cltr_count in dict_cltr_count.items():
foutput_res_counts.write(key_dict_cltr_count+"\t"+"\t".join([str(i) for i in value_dict_cltr_count])+"\n")
# Print date.
print(str(datetime.now()))
#!/bin/env python
"""----------------------------------------------------------------------------------------------------------------------------------------------------------
Script Name: Script_rename.py
Description: Rename contigs and genes in GFF file generated by PROKKA.
Input files:
GFF file produced by PROKKA.
Created By: Céline Noirot and Joanna Fourquet
Date: 2019-06-12
----------------------------------------------------------------------------------------------------------------------------------------------------------
"""
# Metadata
__author__ = 'Céline Noirot and Fourquet Joanna - Plateforme bioinformatique Toulouse'
__copyright__ = 'Copyright (C) 2019 INRA'
__license__ = 'GNU General Public License'
__version__ = '0.1'
__email__ = 'support.bioinfo.genotoul@inra.fr'
__status__ = 'dev'
# Status: dev
# Modules importation
try:
import argparse
from BCBio import GFF
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.SeqFeature import SeqFeature, FeatureLocation
import pprint
from BCBio.GFF import GFFExaminer
from Bio import SeqIO
except ImportError as error:
print(error)
exit(1)
# Manage parameters
parser =argparse.ArgumentParser(description='Script which rename contigs and genes in GFF file generated by PROKKA.')
parser.add_argument('-f', '--file', required=True, help='GFF file generated by PROKKA.')
parser.add_argument('-faa', '--fastaFile', required=True, help='Fasta of predicted sequence (aa) generated by PROKKA.')
parser.add_argument('-ffn', '--ffnFile', required=True, help='Fasta of predicted sequence (nuc) generated by PROKKA.')
parser.add_argument('-p', '--prefix', required=True, help='Contig name prefix.')
parser.add_argument('-oGFF', '--outGFFFile', required=True, help='Name of output GFF file.')
parser.add_argument('-oFAA', '--outFAAFile', required=True, help='Filename of renamed predicted fasta sequences (aa).')
parser.add_argument('-oFFN', '--outFFNFile', required=True, help='Filename of renamed predicted fasta sequences (nuc).')
parser.add_argument('-oprottable', '--outProtein', default="protein_table.csv", help='Filename for protein names correspondance table.')
parser.add_argument('-oconttable', '--outContig', default="contig_table.csv", help='Filename for contig names correspondance table.')
args = parser.parse_args()
protein_names={}
contig_prefix = args.prefix
prot_prefix = "Prot_"
to_write = []
with open(args.file) as gffFile,\
open(args.outGFFFile, "w") as out_handle,\
open(args.outProtein, "w") as fh_prot_table,\
open(args.outContig, "w") as fh_cont_table,\
open(args.outContig+".sed", "w") as fh_cont_sed:
for rec in GFF.parse(gffFile):
# Access to contig id
old_contig_name = rec.id
new_contig_name = contig_prefix + old_contig_name.split("_")[-1]
rec.id = new_contig_name
fh_cont_table.write(old_contig_name+"\t"+new_contig_name+"\n")
fh_cont_sed.write("s/"+old_contig_name+"/"+new_contig_name+"/\n")
# Access to features
for f_index,feature in enumerate(rec.features):
if(feature.qualifiers['source'][0]!="minced:0.2.0"):
#Generate correspondance
old_prot_name=feature.qualifiers['ID'][0].replace("_gene","")
prot_number = old_prot_name.split("_")[-1]
new_prot_name = new_contig_name + "." + prot_prefix + prot_number
protein_names[old_prot_name] = new_prot_name
fh_prot_table.write(old_prot_name+"\t"+new_prot_name+"\n")
#Initialise field of "gene" feature (the parent)
rec.features[f_index].qualifiers["ID"]=new_prot_name+"_gene"
rec.features[f_index].qualifiers["locus_tag"]=[new_prot_name]
#Annotations (not prokka lines) are in sub_features
for fsub_index,sub_feature in enumerate(feature.sub_features):
# Update ID
rec.features[f_index].sub_features[fsub_index].qualifiers["ID"]=new_prot_name
rec.features[f_index].sub_features[fsub_index].qualifiers["Parent"]=[]
rec.features[f_index].sub_features[fsub_index].qualifiers["locus_tag"]=[new_prot_name]
rec.features[f_index].sub_features[fsub_index].qualifiers["protein_id"]=[new_prot_name]
to_write.append(rec)
#Write only one time
print (to_write)
GFF.write(to_write, out_handle)
with open(args.fastaFile, "rU") as handle,\
open(args.outFAAFile, "w") as outFasta_handle:
for record in SeqIO.parse(handle, "fasta"):
try :
old=record.id
record.id = protein_names[record.id]
record.description=record.description.replace(old +" ","")
SeqIO.write(record, outFasta_handle, "fasta")
except :
print ("Warning FAA file : protein " + record.id + " discarded, no new name defined")
pass
with open(args.ffnFile, "rU") as handle,\
open(args.outFFNFile, "w") as outFFN_handle:
for record in SeqIO.parse(handle, "fasta"):
try :
old=record.id
record.id = protein_names[record.id]
record.description=record.description.replace(old +" ","")
SeqIO.write(record, outFFN_handle, "fasta")
except :
print ("Warning FFN file : protein " + record.id + " discarded, no new name defined")
pass