Sebastien Theil · 45e0da90
--- a/tutorial.md
+++ b/tutorial.md
-# ANOMALY Workflow tutorial.
+# ANOMALY step-by-step tutorial.

-This workflow is inpired by [DADA2 workflow](https://benjjneb.github.io/dada2/tutorial.html) and [Phyloseq analysis](https://joey711.github.io/phyloseq/index.html).

-ANOMALY improvements:
- A versatile and customizable workflow, from fastq treatment to statistics analysis.
- Integrating contaminant filtering step with decontam R package.
- Taxonomic assignment based on IDTAXA with two databases, with validation step to keep best assignment.
- Integrated statistical analysis (ANOVA, permanova, non-parametric test).
- Multiple differential analysis methods (DESeq2, metacoder, metagenomeSeq)
+## Help

-
-## Clone ANOMALY repository
-```{bash}
-# Set ANOMALYpath
-ANOMALYpath=/pathtoanomaly/
-git clone git@forgemia.inra.fr:umrf/anomaly.git
-cd /PATHTO/anomaly/test
-```
+To get more detailed script options, just launch the script without any options.

 ## Tests samples

-36 fastq subsamples of amplified 16S DNA extracted from cow teat skin surface are available in `./reads/` directory. There are 15 control samples for contamination filtering.
+We provide 36 fastq samples of amplified 16S DNA extracted from cow teat skin surface and 15 control samples (blank extraction) for contamination filtering.

 Here are a preview of samples metadatas:

-| sample-id | nom  | lot  | temps | origin | type   | extraction | temps_lot |
-|-----------|------|------|-------|--------|--------|------------|-----------|
-| 1181-T1   | 1181 | lot3 | T1    | trayon | sample | 2          | T1_lot3   |
-| 1181-T3   | 1181 | lot3 | T3    | trayon | sample | 4          | T3_lot3   |
-| 1181-T6   | 1181 | lot3 | T6    | trayon | sample | 9          | T6_lot3   |
-| 1181-T9   | 1181 | lot3 | T9    | trayon | sample | 14         | T9_lot3   |
-| 153-T1    | 153  | lot2 | T1    | trayon | sample | 1          | T1_lot2   |
-| 153-T3    | 153  | lot2 | T3    | trayon | sample | 5          | T3_lot2   |
-| 153-T6    | 153  | lot2 | T6    | trayon | sample | 9          | T6_lot2   |
+| sample-id | lot  | temps | type   | temps_lot | quant_reading |
+|-----------|------|------ |--------|-----------|---------------|
+| 1181-T1   | lot3 | T1    | sample | T1_lot3   |          500  |
+| 1181-T3   | lot3 | T3    | sample | T3_lot3   |          500  |
+| 1181-T6   | lot3 | T6    | sample | T6_lot3   |          500  |
+| 1181-T9   | lot3 | T9    | sample | T9_lot3   |          500  |
+| TT1       |      |       | control |          |            10 |
+| TT2       |      |       | control |           |           10 |
+| TT3       |      |       | control |           |           10 |


 ## DADA2
+First we will process all paired-read files with Dada2 to create ASV.
+
 ```{bash}
 Rscript $ANOMALYpath/dada2/dada2_process.R -a 16S -p ./reads -q TRUE -c TRUE
 ```

 Main output:
- read_tracking.csv as below
+- `read_tracking.csv` that summarize the read number after each filtering step.

-|      | input | filtered | denoisedF | denoisedR | merged | nonchim |
+|         | input | filtered | denoisedF | denoisedR | merged | nonchim |
 |---------|-------|----------|-----------|-----------|--------|---------|
-| 1181-T1 | 10000 | 6644     | 6298      | 6344      | 4726   | 4245    |
-| 1181-T3 | 10000 | 6370     | 5834      | 5784      | 4247   | 4049    |
-| 1181-T6 | 10000 | 6581     | 6486      | 6509      | 6110   | 6089    |
-| 1181-T9 | 10000 | 6652     | 6422      | 6402      | 5570   | 5419    |
-| 153-T1  | 10000 | 6579     | 5880      | 6120      | 4027   | 3773    |
-| 153-T3  | 10000 | 6630     | 6093      | 6097      | 4247   | 3975    |
-
- robjects.Rdata with raw otu table and representative sequences.
+| 1181-T1 |  1000 |      679 |       608 |       622 |    505 |     505 |
+| 1181-T3 |  1000 |      643 |       511 |       456 |    380 |     380 |
+| 1181-T6 |  1000 |      702 |       662 |       681 |    621 |     620 |
+| 1181-T9 |  1000 |      648 |       559 |       563 |    475 |     473 |
+| 153-T1  |  1000 |      688 |       553 |       552 |    382 |     382 |
+| 153-T3  |  1000 |      666 |       450 |       430 |    318 |     318 |
+
+The sample names extracted from the file name. We consider as sample name anything that is before the first underscore. This must match the sample names that are in sample metadata files.
+*input*: raw read number.
+*filtered*: after dada2 filtering step: no N's in sequence, low quality, and phiX.
+*denoisedF* & *denoisedR*: after denoising. Forward & Reverse.
+*merged*: after merging R1 & R2.
+*nonchim*: after chimeras filtering.
+
+- `dada2_robjects.Rdata` with raw ASV table and representative sequences in objects `otu.table`, `seqtab.export` & `seqtab.nochim`.
+- `raw_asv-table.csv`
+- `rep-seqs.fna`

 ## Taxonomic assignment

-This script uses IDTAXA function from DECIPHER package, and allows to use 2 differents databases. It keeps the best assignation on 2 criteria, resolution and confidence. The final taxonomy is validated by multiple ancestors taxa and incongruity correction step.
+This script uses IDTAXA function from DECIPHER package, and allows to use 2 differents databases. It keeps the best assignation on 2 criteria, resolution (depth) and confidence. The final taxonomy is validated by multiple ancestors taxa and incongruity correction step.
 ```{bash}
-DB1=./bank/SILVA_SSU_r132_March2018.RData
-DB2=./bank/DAIRYdb_v1.2.0_20190222_IDTAXA.RData
-
-Rscript $ANOMALYpath/taxonomy/assign_taxo.R -r dada2_out/robjects.Rdata \
-  -o tax_out -i ${DB1},${DB2}
+Rscript $ANOMALYpath/taxonomy/assign_taxo.R -r dada2_out/dada2_robjects.Rdata \
+  -o tax_out -i /home/db/dairydb/DAIRYdb_v1.1.2_20180914_IDTAXA.RData,/home/db/silva/SILVA_SSU_r132_March2018.RData
 ```

 Main output:
- robjects.Rdata with taxonomy in phyloseq format.
- csv table with the different assignations.
- csv table with final assignation.
+- `taxo_robjects.Rdata` with taxonomy in phyloseq format in `tax.table` object.
+- `final_tax_table.csv` the final assignation table.
+- `allDB_tax_table.csv` assignations from the two databases.
+

+## Phylogenetic Tree

-## Tree generation
-Using phangorn and DECIPHER packages.
+The phylogenetic tree from the representative sequences is generated using phangorn and DECIPHER packages.
 ```{bash}
-Rscript $ANOMALYpath/tools/generate_tree.R -r dada2_out/robjects.Rdata -o tree_out/
+Rscript $ANOMALYpath/tools/generate_tree.R -r dada2_out/dada2_robjects.Rdata -o tree_out/
 ```
 Main output:
- robjects.Rdata with phylogenetic tree in phyloseq format.
+- `tree_robjects.Rdata` with phylogenetic `tree` object in phyloseq format.
+
+## Phyloseq object
+
+To create a phyloseq object, we need to merge four objects and one file:
+- the asv table `otu.table` and  the representative sequences `seqtab.nochim` from `dada2_robjects.Rdata`
+- a taxonomy table `taxo_robjects.Rdata` from `taxo_robjects.Rdata`
+- the phylogenetic tree `tree` from `tree_robjects.Rdata`
+- metadata from `sample-metadata.csv`

-## Phyloseq object generation
 ```{bash}
-Rscript $ANOMALYpath/tools/generate_phyloseq.R -r tree_out/robjects.Rdata \
-  -t dada2_out/robjects.Rdata -q tax_out/robjects.Rdata \
+Rscript $ANOMALYpath/tools/generate_phyloseq.R -r tree_out/tree_robjects.Rdata \
+  -t dada2_out/dada2_robjects.Rdata -q tax_out/taxo_robjects.Rdata \
  -m sample-metadata.csv
 ```
 Main output:
- robjects.Rdata with phyloseq object grouping otu table, metadata, taxonomy, tree and sequences. Phyloseq object in rdata is named `data`.
+- `robjects.Rdata` with phyloseq object in `data` for raw counts and `data_rel` for relative abundance.

 ## Decontamination

-`decontam.r` scripts uses `decontam` R package with control samples to filter contaminants. Then it proceeds to a basic frequency (<0.005%) and prevalence in samples (2) filtering on otus.
+`decontam.r` scripts uses [decontam](https://www.bioconductor.org/packages/release/bioc/html/decontam.html) R package with control samples to filter contaminants.
+The decontam package offers two main methods, frequency and prevalence (and then you can combine those methods). For frequency method, it is mandatory to have the dna concentration of each sample in phyloseq (and hence in the `sample-metadata.csv`).
+"_In this method, the distribution of the frequency of each sequence feature as a function of the input DNA concentration is used to identify contaminants._"
+In the prevalence methods no need of DNA quantification.
+"_In this method, the prevalence (presence/absence across samples) of each sequence feature in true positive samples is compared to the prevalence in negative controls to identify contaminants._"
+
+Tips: sequencing plateforms often quantify the DNA before sequencing, but do not automaticaly give the information. Just ask for it ;).
+
+Our script integrates the basics ASV frequency (nb_reads_ASV/nb_total_reads) and prevalence (nb_sample_ASV/nb_total_sample) filtering.
+As in our lab we had a known recurrent contaminant we included an option to filter out ASV based on they taxa names.

 ```{bash}
 Rscript $ANOMALYpath/filtering/decontam.r -r phyloseq/robjects.Rdata \
@@ -99,21 +111,40 @@ Rscript $ANOMALYpath/filtering/decontam.r -r phyloseq/robjects.Rdata \
 ```

 Main output:
- robjects.Rdata with contaminant filtered phyloseq object
- `Exclu_out.csv`  list filtered otus for each filtering step.
- Kronas before and after filtering.
+- `robjects.Rdata` with contaminant filtered phyloseq object named `data`
+- `Exclu_out.csv`  list of filtered ASVs for each filtering step.
+- Kronas [before](test/krona_no_filtering.html) and [after](test/krona_filtering.html) filtering.
+- `raw_asv-table.csv` & `relative_asv-table.csv`
+- `venndiag_filtering.png`
+
+## Plots, diversity and statistics
+
+!!! We are currently developping a ShinyApp to visualize your data, sub-select your samples/taxons and do all those analyses interactively !!!
+[ExploreMetabar](https://forgemia.inra.fr/umrf/exploremetabar)
+
+### Rarefaction curves and composition

+This script generates an html page, with rarefaction curves, composition plots of raw and relative abundances.

-## Statistic inferences
-### Plot composition
+The option `-a` and `-b` designates the columns you want to represent in the plot. Here we want to observe sample by 'temps' split the graph by the other column ''
 Using [phyloseq-extended repository](https://github.com/mahendra-mariadassou/phyloseq-extended)

+`-i` option sets the taxa level at which you want your plots, you can set multiple levels separated with comma.
+
 ```{bash}
 Rscript $ANOMALYpath/plot_stats/bars.R -r decontam_out/robjects.Rdata \
  -a temps -b lot -i Family,Genus -e lot -s TRUE
 ```
-![barplot](figures/plot_bar_Genus.png)
-![plot composition](figures/plot_composition_lot3_Genus.png)
+#### Rarefaction curves
+![rare](figures/rarefaction_curves.png)
+
+#### Composition
+##### Raw read number
+![barplot](figures/plot_bar_family.png)
+##### Relative abundance
+![plot composition](figures/plot_composition_family.png)
+
+


 ### Alpha diversity analysis
@@ -204,12 +235,12 @@ Number of permutations: 1000

 Terms added sequentially (first to last)

-          Df SumsOfSqs  MeanSqs F.Model      R2   Pr(>F)    
+          Df SumsOfSqs  MeanSqs F.Model      R2   Pr(>F)
 Depth      1   0.15713 0.157128  4.3868 0.09074 0.002997 **
 temps      3   0.51220 0.170733  4.7667 0.29580 0.000999 ***
-lot        2   0.02350 0.011749  0.3280 0.01357 0.977023    
-Residuals 29   1.03872 0.035818         0.59988             
-Total     35   1.73155                  1.00000             
+lot        2   0.02350 0.011749  0.3280 0.01357 0.977023
+Residuals 29   1.03872 0.035818         0.59988
+Total     35   1.73155                  1.00000
 ---
 Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

@@ -301,5 +332,5 @@ Rscript $ANOMALYpath/tools/ASVenn.R -r ./decontam_out/robjects.Rdata -b lot
 ```
 Rscript $ANOMALYpath/tools/phy2cyto.R -r ./decontam_out/robjects.Rdata -a lot
 ```
-Example of bipartite network on real data: 
+Example of bipartite network on real data:
 ![boxplot](figures/sif_tab1.tsv.png)