update README

9b203cf7 · Robert Bossy · 89d61310 · 9b203cf7
Commit 9b203cf7 authored 3 years ago by Robert Bossy
--- a/README.md
+++ b/README.md
 # BacDive Utils

-Command-line utilities for downloading [BacDive](https://bacdive.dsmz.de/api/bacdive/) entries and merging PNU with the NCBI taxonomy.
+Taxonomy managing tools.

-## Python utilities
+## Workflow

-### `dsmz.py`
+### 0. Configuration

-A library containing base classes for wrapping REST APIs provided by DSMZ.
+Edit the `config.yaml` file, and set the following variables:

-See: https://www.dsmz.de/services/online-tools
+* `BACDIVE_USER`: username for the BacDive API
+* `BACDIVE_PASSWORD_FILE`: file containing the password for the BacDive API
+* `ALVISNLP`: path to the AlvisNLP binary
+* `REWRITE_TAXONOMY`: path to the `rewrite-taxonomy` binary

-### `bacdive.py`

-Python wrapper for downloading DSMZ catalog entries.
-This script may be used as a library or in command-line.
+### 1. NCBI Taxonomy download

-See: https://bacdive.dsmz.de/api/bacdive/
-
-### `pnu.py`
+```shell
+snakemake -j 1 -s ncbi-download.snakefile
+```

-Python wrapper for downloading LPSN taxa.
-This script may be used as a library or in command-line.
+**ETA: 2 minutes**

-See: https://bacdive.dsmz.de/api/pnu/
+This will download the Taxonomy archive from the NCBI FTP server, then unzip the archive.

-## Configuration
+The download is anonymous and does not require any registration.

-In order to use the BacDive API, you must register a valid e-mail address: https://bacdive.dsmz.de/api/bacdive/registration/register/
+### 2. DSMZ catalog download

-Once the registration process is completed, write the BacDive password into a file.
-**Set this file permissions as `0600` in order to prevent other users to access it**.
+```shell
+snakemake -j 1 -s dsmz-download.snakefile
+```

-Open `config.yml` and set the following variables:
+**ETA: 10 hours**

-* `BACDIVE_USER` and `BACDIVE_PASSWORD_FILE`: the e-mail registered to BacDive and the password file.
-* `ALVISNLP`: the path to AlvisNLP binary.
-* `TAXA_FILE`: path to the `taxa+id_microorganisms.txt` file.
-* `OUTDIR`: the path to the directory where to store results.
+This will download the whole DSMZ catalog via the BacDive service.

-## Download DSMZ entries
+In order to use the BacDive API, you must [register](https://bacdive.dsmz.de/api/bacdive/registration/register/).
+You must type the registered user to the variable `BACDIVE_USER` in the `config.yaml` file.
+You also must record the password in a file (do not forget to remove read rights from group and others), then type the file path to the variable `BACDIVE_PASSWORD_FILE`.

-̀``shell
+### 3. Match DSMZ strains to NCBI taxonomy

-snakemake --snakefile dsmz-download.snakefile --cores 1
+```shell
+snakemake -j 1 -s dsmz-match.snakefile
 ```

-**Downloading the whole DSMZ takes a long time**.
-Depending on the bandwith, it takes between 6 and 10 hours.
+**ETA: 2 minutes**

-As in february 2021, the whole catalog includes more than 80k entries and takes approximately 700MB in XML format.
+This will look for a suitable place in the NCBI Taxonomy for each DMSZ strain.

-## Match DSMZ strains to NCBI
+The output of this step contains 5 files:
+* `bacdive-to-taxid.txt`: a mapping file from the BacDive identifier of the strain to the taxon identifier
+* `dispatch-report.txt`: a report of the dispatch for each DSMZ strain
+* `dsmz-nodes.dmp` and `dsmz-names.dmp`: files in the format of the NCBI Taxonomy with additional nodes and synonyms
+* `warnings.txt`: things to pay attention

-```shell

-snakemake --snakefile dsmz-match.snakefile --cores 1
-```
-
-This step uses the following files: `dsmz-match.plan`, `bacdive2alvisnlp.xslt` and whichever file specified in the configuration file as `TAXA_FILE`.
-
-### Output files
+### 4. Rewrite taxonomy

-The DSMZ match results are written in the `dsmz-match` directory in the output directory.
-
-#### `dispatch-report.txt`
+```shell
+snakemake -j 1 -s rewrite-taxonomy.snakefile
+```

-A tabular file where each line represents a DSMZ catalog entry.
+**ETA: 12 minutes**

-* `ENTRY`: name of the catalog entry file.
-* `DISPATCH`: dipatch decision
-  * `append`: new strain appended to NCBI species or NCBI subspecies
-  * `append-species`: new strain and lineage appended to NCBI superspecific taxon
-  * `equivalent`: equivalence with NCBI strain
-  * `type material`: equivelence with NCBI species or subspecies given as type strain
-  * `no-number`: strain has no designation or number
-  * `fail`: no match at any taxonomic level
+This will write the merged taxonomy in a format suitable for text projection.

-#### `dsmz-nodes.txt` 

-Files in the format of NCBI Taxonomy `nodes.dmp` and `names.dmp` that contains all additions.