Commit 95a94cd6 authored by Celine Noirot's avatar Celine Noirot
Browse files

merge dev , handle conflict

parents 3db940bc 3b9d9340
Pipeline #26955 passed with stages
in 15 minutes and 16 seconds
......@@ -11,12 +11,23 @@ stages:
singularity-image:
stage: build
script:
- singularity build metagWGS.sif env/Singularity_recipe_metagWGS_dependancies
- singularity build metagWGS.sif env/Singularity_recipe_metagWGS
- singularity build eggnog_mapper.sif env/Singularity_recipe_eggnog_mapper
artifacts:
paths:
- metagWGS.sif
- eggnog_mapper.sif
only:
changes:
- .gitlab-ci.yml
- env/*
# Push the image template.sif on the registry
deploy:
stage: deploy
script:
- singularity push --docker-username "${CI_REGISTRY_USER}" --docker-password "${CI_REGISTRY_PASSWORD}" metagWGS.sif oras://"$CI_REGISTRY_IMAGE"/"$CI_PROJECT_NAME":"$CI_COMMIT_TAG"
- singularity push --docker-username "${CI_REGISTRY_USER}" --docker-password "${CI_REGISTRY_PASSWORD}" eggnog_mapper.sif oras://"$CI_REGISTRY_IMAGE"/eggnog_mapper:"$CI_COMMIT_TAG"
only:
changes:
- .gitlab-ci.yml
- env/*
# **metagWGS**
[![Nextflow](https://img.shields.io/badge/nextflow-%E2%89%A519.10.0-brightgreen.svg)](https://www.nextflow.io/)
[![pipeline status](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/badges/master/pipeline.svg)](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/commits/master)
# Introduction
**metagWGS** is a Nextflow bioinformatics analysis pipeline used for Metagenomic Shotgun Sequencing data (Illumina HiSeq3000, paired, 2*150bp).
The workflow processes raw data from FastQ inputs and do following steps:
* controls the quality of data (FastQC and MultiQC)
* trims adapters sequences and deletes low quality reads (Cutadapt, Sickle)
* suppresses contaminants (BWA mem, samtools, bedtools)
* makes a taxonomic classification of reads (kaiju MEM and kronaTools)
* assembles cleaned reads (metaSPAdes or megahit)
* annotates contigs (prokka)
* renames contigs and genes (home-made python script)
* clusterizes at sample and global level (cd-hit and home-made python script)
* quantifies reads on genes (BWA index, BWA-MEM and featureCounts)
* makes a quantification table (home-made python script)
* makes a taxonomic affiliation of contigs (DIAMOND and home-made python script)
# metagWGS
## Introduction
**metagWGS** is a [Nextflow](https://www.nextflow.io/docs/latest/index.html#) bioinformatics analysis pipeline used for **metag**enomic **W**hole **G**enome **S**hotgun sequencing data (Illumina HiSeq3000 or NovaSeq, paired, 2\*150bp).
### Pipeline graphical representation
The workflow processes raw data from `.fastq` or `.fastq.gz` inputs and do the modules represented into this figure:
![](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/raw/dev/docs/Pipeline.png)
### metagWGS steps
metagWGS is splitted into different steps that correspond to different parts of the bioinformatics analysis:
* `01_clean_qc` (can ke skipped)
* trims adapters sequences and deletes low quality reads ([Cutadapt](https://cutadapt.readthedocs.io/en/stable/#), [Sickle](https://github.com/najoshi/sickle))
* suppresses host contaminants ([BWA](http://bio-bwa.sourceforge.net/) + [Samtools](http://www.htslib.org/) + [Bedtools](https://bedtools.readthedocs.io/en/latest/))
* controls the quality of raw and cleaned data ([FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/))
* makes a taxonomic classification of cleaned reads ([Kaiju MEM](https://github.com/bioinformatics-centre/kaiju) + [kronaTools](https://github.com/marbl/Krona/wiki/KronaTools) + [Generate_barplot_kaiju.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/dev/bin/Generate_barplot_kaiju.py) + [merge_kaiju_results.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/dev/bin/merge_kaiju_results.py))
* `02_assembly`
* assembles cleaned reads (combined with `01_clean_qc` step) or raw reads (combined with `--skip_01_clean_qc` parameter) ([metaSPAdes](https://github.com/ablab/spades) or [Megahit](https://github.com/voutcn/megahit))
* assesses the quality of assembly ([metaQUAST](http://quast.sourceforge.net/metaquast))
* deduplicates cleaned reads (combined with `01_clean_qc` step) or raw reads (combined with `--skip_01_clean_qc` parameter) ([BWA](http://bio-bwa.sourceforge.net/) + [Samtools](http://www.htslib.org/) + [Bedtools](https://bedtools.readthedocs.io/en/latest/))
* `03_filtering` (can be skipped)
* filters contigs with low CPM value ([Filter_contig_per_cpm.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/dev/bin/Filter_contig_per_cpm.py) + [metaQUAST](http://quast.sourceforge.net/metaquast))
* `04_structural_annot`
* makes a structural annotation of genes ([Prokka](https://github.com/tseemann/prokka) + [Rename_contigs_and_genes.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/dev/bin/Rename_contigs_and_genes.py))
* `05_alignment`
* aligns reads to the contigs ([BWA](http://bio-bwa.sourceforge.net/) + [Samtools](http://www.htslib.org/))
* aligns the protein sequence of genes against a protein database ([DIAMOND](https://github.com/bbuchfink/diamond))
* `06_func_annot`
* makes a sample and global clustering of genes ([cd-hit-est](http://weizhongli-lab.org/cd-hit/) + [cd_hit_produce_table_clstr.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/dev/bin/cd_hit_produce_table_clstr.py))
* quantifies reads that align with the genes ([featureCounts](http://subread.sourceforge.net/) + [Quantification_clusters.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/dev/bin/Quantification_clusters.py))
* makes a functional annotation of genes and a quantification of reads by function ([eggNOG-mapper](http://eggnog-mapper.embl.de/) + [best_bitscore_diamond.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/dev/bin/best_bitscore_diamond.py) + [merge_abundance_and_functional_annotations.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/dev/bin/merge_abundance_and_functional_annotations.py) + [quantification_by_functional_annotation.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/dev/bin/quantification_by_functional_annotation.py))
* `07_taxo_affi`
* taxonomically affiliates the genes ([Samtools](http://www.htslib.org/) + [aln2taxaffi.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/dev/bin/aln2taxaffi.py))
* taxonomically affiliates the contigs ([Samtools](http://www.htslib.org/) + [aln2taxaffi.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/dev/bin/aln2taxaffi.py))
* counts the number of reads and contigs, for each taxonomic affiliation, per taxonomic level ([Samtools](http://www.htslib.org/) + [merge_idxstats_percontig_lineage.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/dev/bin/merge_idxstats_percontig_lineage.py) + [quantification_by_contig_lineage.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/dev/bin/quantification_by_contig_lineage.py))
* `08_binning` from [nf-core/mag 1.0.0](https://github.com/nf-core/mag/releases/tag/1.0.0)
* makes binning of contigs ([MetaBAT2](https://bitbucket.org/berkeleylab/metabat/src/master/))
* assesses bins ([BUSCO](https://busco.ezlab.org/) + [metaQUAST](http://quast.sourceforge.net/metaquast) + [summary_busco.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/dev/bin/summary_busco.py) and [combine_tables.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/dev/bin/combine_tables.py) from [nf-core/mag](https://github.com/nf-core/mag))
* taxonomically affiliates the bins ([BAT](https://github.com/dutilh/CAT))
A report html file is generated at the end of the workflow with [MultiQC](https://multiqc.info/).
The pipeline is built using [Nextflow,](https://www.nextflow.io/docs/latest/index.html#) a bioinformatics workflow tool to run tasks across multiple compute infrastructures in a very portable manner.
It can be run with a [Singularity](https://sylabs.io/docs/) container making installation trivial and results highly reproducible.
# Schematic representation
![](/docs/Pipeline.png)
# Prerequisites
metagWGS requires all the following tools. They must be installed and copied or moved to a directory in your $PATH:
* [Nextflow](https://www.nextflow.io/docs/latest/index.html#) v19.01.0
* [Cutadapt](https://cutadapt.readthedocs.io/en/stable/#) v1.15
* [Sickle](https://github.com/najoshi/sickle) v1.33
* [FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) v0.11.7
* [MultiQC](https://multiqc.info/) v1.5
* [BWA](http://bio-bwa.sourceforge.net/) v0.7.17
* [Python](https://www.python.org/) v3.6.3
* [Kaiju](https://github.com/bioinformatics-centre/kaiju) v1.7.0
* [SPAdes](https://github.com/ablab/spades) v3.11.1
* [Megahit](https://github.com/voutcn/megahit) v1.1.3
* [Prokka](https://github.com/tseemann/prokka) v1.13.4 - WARNING : always have the new release
* [Cd-hit](http://weizhongli-lab.org/cd-hit/) v4.6.8
* [Samtools](http://www.htslib.org/) v1.9
* [Bedtools](https://bedtools.readthedocs.io/en/latest/) v2.27.1
* [Subread](http://subread.sourceforge.net/) v1.6.0
* Python3 BCBio (+ BCBio.GFF), Bio (+ Bio.Seq, Bio.SeqRecord, Bio.SeqFeature) and pprint libraries
* [Singularity](https://sylabs.io/docs/) v3.0.1
* [DIAMOND](https://github.com/bbuchfink/diamond) v0.9.22
* Python3 Pandas
# Installation
## Install NextFlow
Nextflow runs on most POSIX systems (Linux, Mac OSX etc). It can be installed by running the following commands:
```bash
# Make sure that Java v8+ is installed:
java -version
# Install Nextflow
curl -fsSL get.nextflow.io | bash
# Add Nextflow binary to your PATH:
mv nextflow ~/bin/
# OR system-wide installation:
# sudo mv nextflow /usr/local/bin
```
## Install workflow
* **Retrieve workflow sources**
```bash
git clone git@forgemia.inra.fr:genotoul-bioinfo/metagwgs.git
```
* **Configure profiles with modules**
A configuration file has been developped ([nextflow.config](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/blob/dev/nextflow.config)) to run the pipeline on a local machine or on a SLURM cluster.
To use these configurations run the pipeline with following parameters:
* `-profile local,modules` runs metagWGS on a local machine without Singularity container (with modules).
* `-profile slurm,modules` runs metagWGS on a SLURM cluster without Singularity container (with modules).
* **Configure profiles with a Singularity container**
A [Singularity](https://sylabs.io/docs/) container is available for this pipeline.
All informations about how we built it is in [this Wiki page](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/wikis/Singularity%20container).
To use the configuration profiles with Singularity container run the pipeline with following parameters:
* `-profile local,singularity` runs metagWGS on a local machine with our Singularity container (in addition to -with-singularity singularity/metagWGS_with_dependancies.img parameter).
* `-profile slurm,singularity` runs metagWGS on a SLURM cluster with our Singularity container (in addition to -with-singularity singularity/metagWGS_with_dependancies.img parameter).
# Usage
## Basic usage
A basic command line running the pipeline is:
```python
./nextflow run main.nf \
-profile slurm,singularity \
--reads 'test/*_{R1,R2}.fastq.gz' \
--assembly [metaspades or megahit] \
[-with-singularity env/metagWGS.sif]
```
'*_{R1,R2}.fastq.gz' run the pipeline with all the R1.fastq.gz and R2.fastq.gz files available in your working directory.
WARNING: the user has choice between metaspades or megahit for assembly step.
The choice can be based on CPUs and memory availability: metaspades needs more CPUs and memory than megahit but our tests showed that assembly metrics are better than megahit.
Two [Singularity](https://sylabs.io/docs/) containers are available making installation trivial and results highly reproducible.
## Other parameters
## Documentation
Other parameters are available:
```
Usage:
The typical command for running the pipeline is as follows:
nextflow run -profile standard main.nf --reads '*_{R1,R2}.fastq.gz' --assembly metaspades
Mandatory arguments:
--reads Path to input data (must be surrounded with quotes).
--assembly Tool used for assembly: 'metaspades' or 'megahit'.
-profile Configuration profile to use.
Available: local,modules (local with module load), slurm,modules (run pipeline on slurm cluster with module load),
local,singularity (local with Singularity container) and slurm,singularity (run pipeline on slurm cluster with Singularity container)
Options:
Mode:
--mode: Paired-end ('pe') or single-end ('se') reads. Default: 'pe'.
Trimming options:
--adapter1 Sequence of adapter 1. Default: Illumina TruSeq adapter.
--adapter2 Sequence of adapter 2. Default: Illumina TruSeq adapter.
Quality option:
--qualityType Sickle supports three types of quality values: Illumina, Solexa, and Sanger. Default: 'sanger'.
Alignment options:
--db_host Host database already indexed (bwa index).
Taxonomic classification options:
--kaiju_nodes File nodes.dmp built with kaiju-makedb.
--kaiju_db File kaiju_db_refseq.fmi built with kaiju-makedb.
--kaiju_names File names.dmp built with kaiju-makedb.
--diamond_bank NR Diamond bank
--accession2taxid FTP adress of file prot.accession2taxid.gz
--taxdump FTP adress of file taxdump.tar.gz
Clustering option:
--percentage_identity Sequence identity threshold. Default: 0.95.
Other options:
--outdir The output directory where the results will be saved.
--help Show this message and exit.
Softwares versions:
Cutadapt v1.15
Sickle v1.33
FastQC v0.11.7
MultiQC v1.5
BWA 0.7.17
Python v3.6.3
Kaiju v1.7.0
SPAdes v3.11.1
megahit v1.1.3
prokka v1.13.4
cdhit v4.6.8
samtools v1.9
bedtools v2.27.1
subread v1.6.0
diamond v0.9.22
Skip
--skip_sickle Skip sickle process.
--skip_kaiju_index Skip built of kaiju database (index_db_kaiju process).
--skip_taxonomy Skip taxonomy process
```
## Generated files
The pipeline will create the following files in your working directory:
```
* work # Directory containing the nextflow working files
* results # Directory containing result files
** results/01_Cleaned_raw_data: cleaned raw data files (after cutadapt or cutadapt+sickle and after human reads removing removing)
** results/02_Quality_control: multiQC file
** results/03_Classification_Kaiju: index database files (if process index_db_kaiju not skipped) and kaiju files (kaiju result files, kronas, histograms, kaiju results for each node of taxonomy tree)
** results/04_Assembly: assembly files and assembly metrics
** results/05_Annotation: files .gff, .ffn, .fna, .faa, etc after prokka annotation and .gff, .ffn, .fna, .faa files with renamed contigs and genes
** results/06_Clustering: cd-hit results for each sample, correspondance table of intermediate clusters and genes, cd-hit results at global level and correspondance table of global cluster and intermediate clusters (table_clstr.txt)
** results/07_Quantification: .bam et .bam.bai file after reads alignment on contigs, .count files (featureCounts count), .summary file (featureCounts summary), .output file (featureCounts output), Correspondence_global_clstr_contigs.txt (correspondance table between global cluster and genes), Clusters_Count_table_all_samples.txt (quantification table of aligned reads for each global cluster and for each sample).
** results/08_Diamond: diamond results and taxonomic affiliation consensus for proteins and contigs.
* .nextflow_log # Log file from Nextflow
* # Other nextflow hidden files, eg. history of pipeline runs and old logs.
```
## How to run demonstration on genologin cluster
* Data test are available [here](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/tree/master/test)
* BWA index of human reference genome is available at /work/bank/bwadb/ensembl_homo_sapiens_genome
* kaiju database index file are avaiblable at /work/bank/kaijudb/kaijudb_Juin2019/
WARNING: to be noticed, on genologin modules are written "bioinfo/Name-version". We encountered issues with load of several modules at the same time (compatibility). To avoid this problem, we created for some process one module which contains the load of different modules:
* bioinfo_bwa_samtools_subread
* bioinfo_bwa_samtools
You can run the pipeline as follow without Singularity container:
```python
./nextflow run -profile cluster_slurm_modules main.nf --reads '*_{R1,R2}.fastq.gz' --assembly metaspades --skip_kaiju_index --kaiju_nodes /work/bank/kaijudb/kaijudb_Juin2019/nodes.dmp --kaiju_db /w
ork/bank/kaijudb/kaijudb_Juin2019/refseq/kaiju_db_refseq.fmi --kaiju_names /work/bank/kaijudb/kaijudb_Juin2019/names.dmp
./nextflow run -profile slurm,modules main.nf --reads '*_{R1,R2}.fastq.gz' --assembly metaspades --skip_kaiju_index --kaiju_nodes /work/bank/kaijudb/kaijudb_Juin2019/nodes.dmp --kaiju_db /work/
bank/kaijudb/kaijudb_Juin2019/refseq/kaiju_db_refseq.fmi --kaiju_names /work/bank/kaijudb/kaijudb_Juin2019/names.dmp
```
You can run the pipeline as follow with our Singularity container:
```python
module load system/singularity-3.0.1
./nextflow run main.nf \
-profile slurm,singularity \
--reads '*_{R1,R2}.fastq.gz' \
--assembly metaspades \
--skip_kaiju_index \
--kaiju_nodes /work/bank/kaijudb/kaijudb_Juin2019/nodes.dmp \
--kaiju_db /work/bank/kaijudb/kaijudb_Juin2019/refseq/kaiju_db_refseq.fmi \
--kaiju_names /work/bank/kaijudb/kaijudb_Juin2019/names.dmp \
-with-singularity singularity/metagWGS_with_dependancies.img
```
# License
metagWGS documentation is available [here](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/tree/dev/docs).
## License
metagWGS is distributed under the GNU General Public License v3.
# Copyright
## Copyright
2021 INRAE
## Citation
metagWGS has been presented at JOBIM 2020:
2019 INRA
Poster "Whole metagenome analysis with metagWGS", J. Fourquet, C. Noirot, C. Klopp, P. Pinton, S. Combes, C. Hoede, G. Pascal.
# Citation
https://www.sfbi.fr/sites/sfbi.fr/files/jobim/jobim2020/posters/compressed/jobim2020_poster_9.pdf
metagWGS has been presented at JOBIM 2019 and at Genotoul Biostat Bioinfo day:
Poster "Whole metagenome analysis with metagWGS", J. Fourquet, A. Chaubet, H. Chiapello, C. Gaspin, M. Haenni, C. Klopp, A. Lupo, J. Mainguy, C. Noirot, T. Rochegue, M. Zytnicki, T. Ferry, C. Hoede.
\ No newline at end of file
Poster "Whole metagenome analysis with metagWGS", J. Fourquet, A. Chaubet, H. Chiapello, C. Gaspin, M. Haenni, C. Klopp, A. Lupo, J. Mainguy, C. Noirot, T. Rochegue, M. Zytnicki, T. Ferry, C. Hoede.
report_comment: >
This report has been generated by the <a href="https://forgemia.inra.fr/genotoul-bioinfo/metagwgs" target="_blank">genotoul-bioinfo/metagwgs</a>
analysis pipeline. For information about how to interpret these results, please see the
<a href="https://forgemia.inra.fr/genotoul-bioinfo/metagwgs" target="_blank">documentation</a>.
extra_fn_clean_trim:
- "cleaned_"
- "raw_"
- '_kept_contigs'
- '.count_reads_on_contigs.flagstat'
- '.no_filter.flagstat'
- '.host_filter.flagstat'
- '_scaffolds'
- '.txt'
- '_R1'
- '.contigs'
- '.sort'
- '_select_contigs'
module_order:
- fastqc:
name: 'FastQC (raw)'
path_filters_exclude:
- '*cleaned_*.zip'
- cutadapt
- sickle:
path_filters:
- '*_sickle.log'
- samtools:
name : 'Reads aln on host genome'
info: 'This section of the cleaned reads alignement against host genome with bwa.'
path_filters:
- '*host_filter/*'
- fastqc:
name: 'FastQC (cleaned)'
info: 'This section of the report shows FastQC results after adapter trimming and cleaning.'
path_filters:
- '*cleaned_*.zip'
- kaiju
- quast:
name: 'Quast primary assembly'
info: 'This section of the report shows quast results after assembly'
path_filters:
- '*_all_contigs_QC/*'
- samtools:
name : 'Reads before host reads filter'
info: 'This section reports of the reads alignement against host genome with bwa.(not sure to be interesting ????)'
path_filters:
- '*.no_filter.flagstat'
- samtools:
name : 'Reads after host reads filter'
info: 'This section reports of the cleaned reads alignement against host genome with bwa.(not sure to be interesting ????)'
path_filters:
- '*.host_filter.flagstat'
- samtools:
name : 'Reads after deduplication'
info: 'This section reports of deduplicated reads alignement against contigs with bwa.'
path_filters:
- '*.count_reads_on_contigs.flagstat'
- quast:
name: 'Quast filtered assembly'
info: 'This section of the report shows quast results after filtering of assembly'
path_filters:
- '*_select_contigs_QC/*'
- prokka
- featureCounts
prokka_fn_snames: True
prokka_table: True
featurecounts:
fn: '*.summary'
shared: true
table_columns_visible:
FastQC (raw):
percent_duplicates: False
percent_gc: False
# Author: Ian Korf, Genome Center, UC Davis
# This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.
# This software is provided AS IS, without warranty of any kind.
use strict;
use warnings;
package FAlite;
use strict;
sub new {
my ($class, $fh) = @_;
if (ref $fh !~ /GLOB/)
{die ref $fh, "\n", "FAlite ERROR: expect a GLOB reference\n"}
my $this = bless {};
$this->{FH} = $fh;
while(<$fh>) {last if $_ =~ /\S/} # not supposed to have blanks, but...
my $firstline = $_;
if (not defined $firstline) {warn "FAlite: Empty\n"; return $this}
if ($firstline !~ /^>/) {warn "FAlite: Not FASTA formatted\n"; return $this}
$this->{LASTLINE} = $firstline;
chomp $this->{LASTLINE};
return $this;
}
sub nextEntry {
my ($this) = @_;
return 0 if not defined $this->{LASTLINE};
my $fh = $this->{FH};
my $def = $this->{LASTLINE};
my @seq;
my $lines_read = 0;
while(<$fh>) {
$lines_read++;
if ($_ =~ /^>/) {
$this->{LASTLINE} = $_;
chomp $this->{LASTLINE};
last;
}
push @seq, $_;
}
return 0 if $lines_read == 0;
chomp @seq;
my $entry = FAlite::Entry::new($def, \@seq);
return $entry;
}
package FAlite::Entry;
use overload '""' => 'all';
sub new {
my ($def, $seqarry) = @_;
my $this = bless {};
$this->{DEF} = $def;
$this->{SEQ} = join("", @$seqarry);
$this->{SEQ} =~ s/\s//g; # just in case more spaces
return $this;
}
sub def {shift->{DEF}}
sub seq {shift->{SEQ}}
sub all {my $e = shift; return $e->{DEF}."\n".$e->{SEQ}."\n"}
1;
__END__
=head1 NAME
FAlite;
=head1 SYNOPSIS
use FAlite;
my $fasta = new FAlite(\*STDIN);
while(my $entry = $fasta->nextEntry) {
$entry->def;
$entry->seq;
}
=head1 DESCRIPTION
FAlite is a package for parsing FASTA files and databases. The FASTA format is
widely used in bioinformatics. It consists of a definition line followed by
sequence with an arbitrary number of lines and line lengths.
A FASTA file looks like this:
>identifier descriptive text
GAATTC
A FASTA database looks like this:
>identifier1 some text describing this entry
GAATTC
ACTAGT
>identifier2 some text describing this entry
AAACCT
GCTAAT
=head2 Object
FAlite has two kinds of objects, the file and the entry.
my $fasta_file = new FAlite(\*STDIN); # or any other filehandle
$entry = $fasta_file->nextEntry; # single fasta fle
while(my $entry = $fasta_file->nextEntry) {
# canonical form of use for fasta database
}
The entry has two attributes (def and seq).
$entry->def; # access the def line
$entry->seq; # access the sequence
"$entry"; # overload to fasta file ($entry->def . "\n" . $entry->seq)
=head1 AUTHOR
Ian Korf (ikorf@sapiens.wustl.edu, http://sapiens.wustl.edu/~ikorf)
=head1 ACKNOWLEDGEMENTS
This software was developed at the Genome Sequencing Center at Washington
Univeristy, St. Louis, MO.
=head1 COPYRIGHT
Copyright (C) 1999 Ian Korf. All Rights Reserved.
=head1 DISCLAIMER
This software is provided "as is" without warranty of any kind.
=cut
#!/usr/bin/env python
"""----------------------------------------------------------------------------
Script Name: Filter_contig_per_cpm.py
Description: Calculates the CPM normalization of mapped reads for each \
contig and returns contigs which have a CPM > cutoff in .fa.
Input files: Samtools idxstats output file, .fasta file of contigs.
Created By: Joanna Fourquet
Date: 2020-04-01
-------------------------------------------------------------------------------
"""
# Metadata
__author__ = 'Joanna Fourquet \
- GenPhySE - Team NED'
__copyright__ = 'Copyright (C) 2020 INRA'
__license__ = 'GNU General Public License'
__version__ = '0.1'
__email__ = 'support.bioinfo.genotoul@inra.fr'
__status__ = 'dev'
# Status: dev
# Modules importation
try:
import argparse
import sys
import pandas as p
import numpy as np
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
import pprint
from Bio import SeqIO
except ImportError as error:
print(error)
exit(1)
################################################
# Function
################################################
def cpm(counts):
N = np.sum(counts.iloc[:,2], axis=0)
C = counts.iloc[:,2]
cpm_values = 1e6 * C / N
return(cpm_values)
def main(argv):
parser = argparse.ArgumentParser()
parser.add_argument("-i", "--samtools_idxstats", \
required = True, help = "samtools idxstats file containing contig id, \
sequence length, number of mapped reads or fragments, \
number of unmapped reads or fragments")
parser.add_argument('-f', '--fasta_file', required = True, help = \
'fasta file containing sequences of contigs.')
parser.add_argument("-c", "--cutoff_cpm", required = True, \
help = "Minimum number of reads in a contig")
parser.add_argument("-s", "--select", \
help = "Name of outpout .fa file containing contigs which passed cpm cutoff")
parser.add_argument("-d", "--discard", \
help = "Name of outpout .fa file containing contigs which don't passed cpm cutoff")
args = parser.parse_args()
# Read input table
raw_counts = p.read_table(args.samtools_idxstats,sep ='\t',header = None, comment="*")