From 411689ae70492f987ce21e8258d89f9632db2e52 Mon Sep 17 00:00:00 2001
From: DUGAT-BONY Eric <eric.dugat-bony@inrae.fr>
Date: Fri, 17 Jan 2025 12:32:38 +0100
Subject: [PATCH] Update file README.md

---
 README.md | 32 +++++---------------------------
 1 file changed, 5 insertions(+), 27 deletions(-)

diff --git a/README.md b/README.md
index 7e96d00..d016d9f 100644
--- a/README.md
+++ b/README.md
@@ -76,7 +76,7 @@ done
 ```
 
 ### Classification of prophage sequences
-Prophage sequences were classified [PhaBox](https://github.com/KennthShang/PhaBOX) version 2.0 and the results were save in `results/PhaBox`. Here is the command line:
+Prophage sequences were classified [PhaBox](https://github.com/KennthShang/PhaBOX) version 2.0 and the results were saved in `results/PhaBox`. Here is the command line:
 ```
 conda activate phabox
 python ~/work/PhaBox/PhaBOX/main.py --contigs results/GENOMAD/All_provirus.fna --threads 8 --len 3000 --rootpth results/PhaBox/ --out out_provirus/ --dbdir ~/work/PhaBox/PhaBOX/database/ --parampth ~/work/PhaBox/PhaBOX/parameters/ --scriptpth ~/work/PhaBox/PhaBOX/scripts/
@@ -92,55 +92,33 @@ All contigs larger than 2 kb were concatenated and dereplicated to retain only o
 The abundance table was processed using the Virome_analyses_2024.rmd script in RStudio to create phyloseq objects for DNA- and RNA-amplified samples.
 
 ### Read QC
-For each sample, raw reads were quality quality checked using [FastQC](https://github.com/s-andrews/FastQC) version 0.11.9, as follow:
+For each sample, raw reads were quality quality checked using [FastQC](https://github.com/s-andrews/FastQC) version 0.11.9, and the results were saved in `results/FastQC_report`. Here is the command line:
 ```
-mkdir results/FastQC_report
-
 conda activate fastqc-0.11.9
 fastqc /RAW_DATA/virome/Sample1_R1.fastq.gz -o results/FastQC_report/
 fastqc /RAW_DATA/virome/Sample1_R2.fastq.gz -o results/FastQC_report/
 conda deactivate
 ```
 
-#### Merging sample sequenced twice 
-Firstly, we need to decompressed fastq sample to concatenated them.
-Then we creat a text file with the number of read for each raw fastq file before concatenation.
-Then, the fastq files were concatenated by cat function into a new fastq file. And the raw data were deleted.
-The last command ligne allow to control the number of reads of each fasts files after concatenation.
+### Merging sample sequenced twice 
+Sequence files were uncompressed and concatenated using the following command lines: 
 ```
 gunzip /RAW_DATA/virome/*.fastq.gz
-
 mkdir results/stat/
-
 for file in /RAW_DATA/virome/*_R1.fastq ; do sample=$(echo $(basename $file)); grep -c ^@ $sample | awk -v var=$sample '{print var "\t" $0}'; done > results/stat/Nbre_reads_raw_sequencing.txt
-
 cat /RAW_DATA/virome/Sample1_sequencing1_R1.fastq /RAW_DATA/virome/Sample1_sequencing2_R1.fastq > /RAW_DATA/virome/Sample1_R1.fastq
 cat /RAW_DATA/virome/Sample1_sequencing1_R2.fastq /RAW_DATA/virome/Sample1_sequencing2_R2.fastq > /RAW_DATA/virome/Sample1_R2.fastq
 rm /RAW_DATA/virome/Sample1_sequencing1_R1.fastq
 rm /RAW_DATA/virome/Sample1_sequencing2_R1.fastq
 rm /RAW_DATA/virome/Sample1_sequencing1_R2.fastq
 rm /RAW_DATA/virome/Sample1_sequencing2_R2.fastq
-
 for file in /RAW_DATA/virome/*.fastq ; do sample=$(echo $(basename $file)); grep -c ^@ $sample | awk -v var=$sample '{print var "\t" $0}'; done > results/stat/Nbre_reads_after_conc.txt
 ```
 
 ### Host decontamination
-We remove all reads that match to cabbage, carrot and turnip genome in order to eliminate host genome for a better assembly.
-For cabbage the host genome is Brassica oleracea var. oleracea genome NCBI RefSeq GCF_000695525.1, GenBak assembly GCA_000695525.1. Sequencing date May 27, 2014.
-For the carrots samples the reference genome is Daucus carota subsp. sativus  NCBI RefSeq GCF_001625215.1, GenBak assembly GCA_001625215.1. Sequencing date May 6, 2016
-For turnips samples chose as reference genome is Brassica rapa subsp. rapa, GenBak assembly GCA_018901965.1. Sequencing date May 6, 2016
-Note: This scaffold-level genome assembly includes 655 scaffolds and no assembled chromosomes.
-
-Each reads were aligned against the reference genome by command ligne extracted from the book : The plant microbiome methods and protocols, Lilia C. Carvalhais and Paul G. Dennis, Methods in Molecular Biology,2021 ISBN 978-1-0716-1039-8
+Host decontamination involved reads mapping against the genomes of Brassica oleracea var. oleracea (NCBI RefSeq GCF_000695525.1, GenBank assembly GCA_000695525.1), Daucus carota subsp. sativus (NCBI RefSeq GCF_001625215.1, GenBank assembly GCA_001625215.1) and Brassica rapa subsp. rapa (GenBank assembly GCA_018901965.1) using [BWA](https://github.com/lh3/bwa) version 0.7.17. The alignement was then processed using [samtools](https://github.com/samtools/samtools) version 1.12 and [bedtools](https://github.com/arq5x/bedtools2) version 2.30.0 to discard reads aligning with the host genomes. We recorded the number of reads discarded. Att the procedure was embedded in the script `reads_host_decontamination.sh`. The results were save in `Host_decontamination`. Here is the command line used to run the script:
 
-Then we control the reads number of each sample to identify the proportion of host reads discarded
-
-Directory reference genome: 
-RAW_DATA/Reference_genome/
 ```
-mkdir results/Host_decontamination
-
-chmod u+x scripts/reads_host_decontamination.sh
 sh scripts/reads_host_decontamination.sh
 ```
 
-- 
GitLab