From 411689ae70492f987ce21e8258d89f9632db2e52 Mon Sep 17 00:00:00 2001 From: DUGAT-BONY Eric <eric.dugat-bony@inrae.fr> Date: Fri, 17 Jan 2025 12:32:38 +0100 Subject: [PATCH] Update file README.md --- README.md | 32 +++++--------------------------- 1 file changed, 5 insertions(+), 27 deletions(-) diff --git a/README.md b/README.md index 7e96d00..d016d9f 100644 --- a/README.md +++ b/README.md @@ -76,7 +76,7 @@ done ``` ### Classification of prophage sequences -Prophage sequences were classified [PhaBox](https://github.com/KennthShang/PhaBOX) version 2.0 and the results were save in `results/PhaBox`. Here is the command line: +Prophage sequences were classified [PhaBox](https://github.com/KennthShang/PhaBOX) version 2.0 and the results were saved in `results/PhaBox`. Here is the command line: ``` conda activate phabox python ~/work/PhaBox/PhaBOX/main.py --contigs results/GENOMAD/All_provirus.fna --threads 8 --len 3000 --rootpth results/PhaBox/ --out out_provirus/ --dbdir ~/work/PhaBox/PhaBOX/database/ --parampth ~/work/PhaBox/PhaBOX/parameters/ --scriptpth ~/work/PhaBox/PhaBOX/scripts/ @@ -92,55 +92,33 @@ All contigs larger than 2 kb were concatenated and dereplicated to retain only o The abundance table was processed using the Virome_analyses_2024.rmd script in RStudio to create phyloseq objects for DNA- and RNA-amplified samples. ### Read QC -For each sample, raw reads were quality quality checked using [FastQC](https://github.com/s-andrews/FastQC) version 0.11.9, as follow: +For each sample, raw reads were quality quality checked using [FastQC](https://github.com/s-andrews/FastQC) version 0.11.9, and the results were saved in `results/FastQC_report`. Here is the command line: ``` -mkdir results/FastQC_report - conda activate fastqc-0.11.9 fastqc /RAW_DATA/virome/Sample1_R1.fastq.gz -o results/FastQC_report/ fastqc /RAW_DATA/virome/Sample1_R2.fastq.gz -o results/FastQC_report/ conda deactivate ``` -#### Merging sample sequenced twice -Firstly, we need to decompressed fastq sample to concatenated them. -Then we creat a text file with the number of read for each raw fastq file before concatenation. -Then, the fastq files were concatenated by cat function into a new fastq file. And the raw data were deleted. -The last command ligne allow to control the number of reads of each fasts files after concatenation. +### Merging sample sequenced twice +Sequence files were uncompressed and concatenated using the following command lines: ``` gunzip /RAW_DATA/virome/*.fastq.gz - mkdir results/stat/ - for file in /RAW_DATA/virome/*_R1.fastq ; do sample=$(echo $(basename $file)); grep -c ^@ $sample | awk -v var=$sample '{print var "\t" $0}'; done > results/stat/Nbre_reads_raw_sequencing.txt - cat /RAW_DATA/virome/Sample1_sequencing1_R1.fastq /RAW_DATA/virome/Sample1_sequencing2_R1.fastq > /RAW_DATA/virome/Sample1_R1.fastq cat /RAW_DATA/virome/Sample1_sequencing1_R2.fastq /RAW_DATA/virome/Sample1_sequencing2_R2.fastq > /RAW_DATA/virome/Sample1_R2.fastq rm /RAW_DATA/virome/Sample1_sequencing1_R1.fastq rm /RAW_DATA/virome/Sample1_sequencing2_R1.fastq rm /RAW_DATA/virome/Sample1_sequencing1_R2.fastq rm /RAW_DATA/virome/Sample1_sequencing2_R2.fastq - for file in /RAW_DATA/virome/*.fastq ; do sample=$(echo $(basename $file)); grep -c ^@ $sample | awk -v var=$sample '{print var "\t" $0}'; done > results/stat/Nbre_reads_after_conc.txt ``` ### Host decontamination -We remove all reads that match to cabbage, carrot and turnip genome in order to eliminate host genome for a better assembly. -For cabbage the host genome is Brassica oleracea var. oleracea genome NCBI RefSeq GCF_000695525.1, GenBak assembly GCA_000695525.1. Sequencing date May 27, 2014. -For the carrots samples the reference genome is Daucus carota subsp. sativus NCBI RefSeq GCF_001625215.1, GenBak assembly GCA_001625215.1. Sequencing date May 6, 2016 -For turnips samples chose as reference genome is Brassica rapa subsp. rapa, GenBak assembly GCA_018901965.1. Sequencing date May 6, 2016 -Note: This scaffold-level genome assembly includes 655 scaffolds and no assembled chromosomes. - -Each reads were aligned against the reference genome by command ligne extracted from the book : The plant microbiome methods and protocols, Lilia C. Carvalhais and Paul G. Dennis, Methods in Molecular Biology,2021 ISBN 978-1-0716-1039-8 +Host decontamination involved reads mapping against the genomes of Brassica oleracea var. oleracea (NCBI RefSeq GCF_000695525.1, GenBank assembly GCA_000695525.1), Daucus carota subsp. sativus (NCBI RefSeq GCF_001625215.1, GenBank assembly GCA_001625215.1) and Brassica rapa subsp. rapa (GenBank assembly GCA_018901965.1) using [BWA](https://github.com/lh3/bwa) version 0.7.17. The alignement was then processed using [samtools](https://github.com/samtools/samtools) version 1.12 and [bedtools](https://github.com/arq5x/bedtools2) version 2.30.0 to discard reads aligning with the host genomes. We recorded the number of reads discarded. Att the procedure was embedded in the script `reads_host_decontamination.sh`. The results were save in `Host_decontamination`. Here is the command line used to run the script: -Then we control the reads number of each sample to identify the proportion of host reads discarded - -Directory reference genome: -RAW_DATA/Reference_genome/ ``` -mkdir results/Host_decontamination - -chmod u+x scripts/reads_host_decontamination.sh sh scripts/reads_host_decontamination.sh ``` -- GitLab