From 9b203cf7c4083a6f846f9bbc4763ddad82061ce3 Mon Sep 17 00:00:00 2001
From: Robert Bossy <Robert.Bossy@inra.fr>
Date: Mon, 5 Apr 2021 16:43:45 +0200
Subject: [PATCH] update README

---
 README.md | 91 ++++++++++++++++++++++++-------------------------------
 1 file changed, 40 insertions(+), 51 deletions(-)

diff --git a/README.md b/README.md
index 26623a4..5a00867 100644
--- a/README.md
+++ b/README.md
@@ -1,81 +1,70 @@
 # BacDive Utils
 
-Command-line utilities for downloading [BacDive](https://bacdive.dsmz.de/api/bacdive/) entries and merging PNU with the NCBI taxonomy.
+Taxonomy managing tools.
 
-## Python utilities
+## Workflow
 
-### `dsmz.py`
+### 0. Configuration
 
-A library containing base classes for wrapping REST APIs provided by DSMZ.
+Edit the `config.yaml` file, and set the following variables:
 
-See: https://www.dsmz.de/services/online-tools
+* `BACDIVE_USER`: username for the BacDive API
+* `BACDIVE_PASSWORD_FILE`: file containing the password for the BacDive API
+* `ALVISNLP`: path to the AlvisNLP binary
+* `REWRITE_TAXONOMY`: path to the `rewrite-taxonomy` binary
 
-### `bacdive.py`
 
-Python wrapper for downloading DSMZ catalog entries.
-This script may be used as a library or in command-line.
+### 1. NCBI Taxonomy download
 
-See: https://bacdive.dsmz.de/api/bacdive/
-
-### `pnu.py`
+```shell
+snakemake -j 1 -s ncbi-download.snakefile
+```
 
-Python wrapper for downloading LPSN taxa.
-This script may be used as a library or in command-line.
+**ETA: 2 minutes**
 
-See: https://bacdive.dsmz.de/api/pnu/
+This will download the Taxonomy archive from the NCBI FTP server, then unzip the archive.
 
-## Configuration
+The download is anonymous and does not require any registration.
 
-In order to use the BacDive API, you must register a valid e-mail address: https://bacdive.dsmz.de/api/bacdive/registration/register/
+### 2. DSMZ catalog download
 
-Once the registration process is completed, write the BacDive password into a file.
-**Set this file permissions as `0600` in order to prevent other users to access it**.
+```shell
+snakemake -j 1 -s dsmz-download.snakefile
+```
 
-Open `config.yml` and set the following variables:
+**ETA: 10 hours**
 
-* `BACDIVE_USER` and `BACDIVE_PASSWORD_FILE`: the e-mail registered to BacDive and the password file.
-* `ALVISNLP`: the path to AlvisNLP binary.
-* `TAXA_FILE`: path to the `taxa+id_microorganisms.txt` file.
-* `OUTDIR`: the path to the directory where to store results.
+This will download the whole DSMZ catalog via the BacDive service.
 
-## Download DSMZ entries
+In order to use the BacDive API, you must [register](https://bacdive.dsmz.de/api/bacdive/registration/register/).
+You must type the registered user to the variable `BACDIVE_USER` in the `config.yaml` file.
+You also must record the password in a file (do not forget to remove read rights from group and others), then type the file path to the variable `BACDIVE_PASSWORD_FILE`.
 
-Ì€``shell
+### 3. Match DSMZ strains to NCBI taxonomy
 
-snakemake --snakefile dsmz-download.snakefile --cores 1
+```shell
+snakemake -j 1 -s dsmz-match.snakefile
 ```
 
-**Downloading the whole DSMZ takes a long time**.
-Depending on the bandwith, it takes between 6 and 10 hours.
+**ETA: 2 minutes**
 
-As in february 2021, the whole catalog includes more than 80k entries and takes approximately 700MB in XML format.
+This will look for a suitable place in the NCBI Taxonomy for each DMSZ strain.
 
-## Match DSMZ strains to NCBI
+The output of this step contains 5 files:
+* `bacdive-to-taxid.txt`: a mapping file from the BacDive identifier of the strain to the taxon identifier
+* `dispatch-report.txt`: a report of the dispatch for each DSMZ strain
+* `dsmz-nodes.dmp` and `dsmz-names.dmp`: files in the format of the NCBI Taxonomy with additional nodes and synonyms
+* `warnings.txt`: things to pay attention
 
-```shell
 
-snakemake --snakefile dsmz-match.snakefile --cores 1
-```
-
-This step uses the following files: `dsmz-match.plan`, `bacdive2alvisnlp.xslt` and whichever file specified in the configuration file as `TAXA_FILE`.
-
-### Output files
+### 4. Rewrite taxonomy
 
-The DSMZ match results are written in the `dsmz-match` directory in the output directory.
-
-#### `dispatch-report.txt`
+```shell
+snakemake -j 1 -s rewrite-taxonomy.snakefile
+```
 
-A tabular file where each line represents a DSMZ catalog entry.
+**ETA: 12 minutes**
 
-* `ENTRY`: name of the catalog entry file.
-* `DISPATCH`: dipatch decision
-  * `append`: new strain appended to NCBI species or NCBI subspecies
-  * `append-species`: new strain and lineage appended to NCBI superspecific taxon
-  * `equivalent`: equivalence with NCBI strain
-  * `type material`: equivelence with NCBI species or subspecies given as type strain
-  * `no-number`: strain has no designation or number
-  * `fail`: no match at any taxonomic level
+This will write the merged taxonomy in a format suitable for text projection.
 
-#### `dsmz-nodes.txt` 
 
-Files in the format of NCBI Taxonomy `nodes.dmp` and `names.dmp` that contains all additions.
-- 
GitLab