update README

9b203cf7 · Robert Bossy · 89d61310 · 9b203cf7
Commit 9b203cf7 authored 3 years ago by Robert Bossy
--- a/README.md
+++ b/README.md
 # BacDive Utils
-Command-line utilities for downloading [BacDive](https://bacdive.dsmz.de/api/bacdive/) entries and merging PNU with the NCBI taxonomy.
+Taxonomy managing tools.
-## Python utilities
+## Workflow
-### `dsmz.py`
+### 0. Configuration
-A library containing base classes for wrapping REST APIs provided by DSMZ.
+Edit the `config.yaml` file, and set the following variables:
-See: https://www.dsmz.de/services/online-tools
+* `BACDIVE_USER`: username for the BacDive API
+* `BACDIVE_PASSWORD_FILE`: file containing the password for the BacDive API
+* `ALVISNLP`: path to the AlvisNLP binary
+* `REWRITE_TAXONOMY`: path to the `rewrite-taxonomy` binary
-### `bacdive.py`
-Python wrapper for downloading DSMZ catalog entries.
+### 1. NCBI Taxonomy download
-This script may be used as a library or in command-line.
-See: https://bacdive.dsmz.de/api/bacdive/
+```shell
+snakemake -j 1 -s ncbi-download.snakefile
-### `pnu.py`
+```
-Python wrapper for downloading LPSN taxa.
+**ETA: 2 minutes**
-This script may be used as a library or in command-line.
-See: https://bacdive.dsmz.de/api/pnu/
+This will download the Taxonomy archive from the NCBI FTP server, then unzip the archive.
-## Configuration
+The download is anonymous and does not require any registration.
-In order to use the BacDive API, you must register a valid e-mail address: https://bacdive.dsmz.de/api/bacdive/registration/register/
+### 2. DSMZ catalog download
-Once the registration process is completed, write the BacDive password into a file.
+```shell
-**Set this file permissions as `0600` in order to prevent other users to access it**.
+snakemake -j 1 -s dsmz-download.snakefile
+```
-Open `config.yml` and set the following variables:
+**ETA: 10 hours**
-* `BACDIVE_USER` and `BACDIVE_PASSWORD_FILE`: the e-mail registered to BacDive and the password file.
+This will download the whole DSMZ catalog via the BacDive service.
-* `ALVISNLP`: the path to AlvisNLP binary.
-* `TAXA_FILE`: path to the `taxa+id_microorganisms.txt` file.
-* `OUTDIR`: the path to the directory where to store results.
-## Download DSMZ entries
+In order to use the BacDive API, you must [register](https://bacdive.dsmz.de/api/bacdive/registration/register/).
+You must type the registered user to the variable `BACDIVE_USER` in the `config.yaml` file.
+You also must record the password in a file (do not forget to remove read rights from group and others), then type the file path to the variable `BACDIVE_PASSWORD_FILE`.
-̀``shell
+### 3. Match DSMZ strains to NCBI taxonomy
-snakemake --snakefile dsmz-download.snakefile --cores 1
+```shell
+snakemake -j 1 -s dsmz-match.snakefile
 ```
-**Downloading the whole DSMZ takes a long time**.
+**ETA: 2 minutes**
-Depending on the bandwith, it takes between 6 and 10 hours.
-As in february 2021, the whole catalog includes more than 80k entries and takes approximately 700MB in XML format.
+This will look for a suitable place in the NCBI Taxonomy for each DMSZ strain.
-## Match DSMZ strains to NCBI
+The output of this step contains 5 files:
+* `bacdive-to-taxid.txt`: a mapping file from the BacDive identifier of the strain to the taxon identifier
+* `dispatch-report.txt`: a report of the dispatch for each DSMZ strain
+* `dsmz-nodes.dmp` and `dsmz-names.dmp`: files in the format of the NCBI Taxonomy with additional nodes and synonyms
+* `warnings.txt`: things to pay attention
-```shell
-snakemake --snakefile dsmz-match.snakefile --cores 1
+### 4. Rewrite taxonomy
-```
-This step uses the following files: `dsmz-match.plan`, `bacdive2alvisnlp.xslt` and whichever file specified in the configuration file as `TAXA_FILE`.
-### Output files
-The DSMZ match results are written in the `dsmz-match` directory in the output directory.
+```shell
+snakemake -j 1 -s rewrite-taxonomy.snakefile
-#### `dispatch-report.txt`
+```
-A tabular file where each line represents a DSMZ catalog entry.
+**ETA: 12 minutes**
-* `ENTRY`: name of the catalog entry file.
+This will write the merged taxonomy in a format suitable for text projection.
-* `DISPATCH`: dipatch decision
-  * `append`: new strain appended to NCBI species or NCBI subspecies
-  * `append-species`: new strain and lineage appended to NCBI superspecific taxon
-  * `equivalent`: equivalence with NCBI strain
-  * `type material`: equivelence with NCBI species or subspecies given as type strain
-  * `no-number`: strain has no designation or number
-  * `fail`: no match at any taxonomic level
-#### `dsmz-nodes.txt` 
-Files in the format of NCBI Taxonomy `nodes.dmp` and `names.dmp` that contains all additions.