From 9b203cf7c4083a6f846f9bbc4763ddad82061ce3 Mon Sep 17 00:00:00 2001 From: Robert Bossy <Robert.Bossy@inra.fr> Date: Mon, 5 Apr 2021 16:43:45 +0200 Subject: [PATCH] update README --- README.md | 91 ++++++++++++++++++++++++------------------------------- 1 file changed, 40 insertions(+), 51 deletions(-) diff --git a/README.md b/README.md index 26623a4..5a00867 100644 --- a/README.md +++ b/README.md @@ -1,81 +1,70 @@ # BacDive Utils -Command-line utilities for downloading [BacDive](https://bacdive.dsmz.de/api/bacdive/) entries and merging PNU with the NCBI taxonomy. +Taxonomy managing tools. -## Python utilities +## Workflow -### `dsmz.py` +### 0. Configuration -A library containing base classes for wrapping REST APIs provided by DSMZ. +Edit the `config.yaml` file, and set the following variables: -See: https://www.dsmz.de/services/online-tools +* `BACDIVE_USER`: username for the BacDive API +* `BACDIVE_PASSWORD_FILE`: file containing the password for the BacDive API +* `ALVISNLP`: path to the AlvisNLP binary +* `REWRITE_TAXONOMY`: path to the `rewrite-taxonomy` binary -### `bacdive.py` -Python wrapper for downloading DSMZ catalog entries. -This script may be used as a library or in command-line. +### 1. NCBI Taxonomy download -See: https://bacdive.dsmz.de/api/bacdive/ - -### `pnu.py` +```shell +snakemake -j 1 -s ncbi-download.snakefile +``` -Python wrapper for downloading LPSN taxa. -This script may be used as a library or in command-line. +**ETA: 2 minutes** -See: https://bacdive.dsmz.de/api/pnu/ +This will download the Taxonomy archive from the NCBI FTP server, then unzip the archive. -## Configuration +The download is anonymous and does not require any registration. -In order to use the BacDive API, you must register a valid e-mail address: https://bacdive.dsmz.de/api/bacdive/registration/register/ +### 2. DSMZ catalog download -Once the registration process is completed, write the BacDive password into a file. -**Set this file permissions as `0600` in order to prevent other users to access it**. +```shell +snakemake -j 1 -s dsmz-download.snakefile +``` -Open `config.yml` and set the following variables: +**ETA: 10 hours** -* `BACDIVE_USER` and `BACDIVE_PASSWORD_FILE`: the e-mail registered to BacDive and the password file. -* `ALVISNLP`: the path to AlvisNLP binary. -* `TAXA_FILE`: path to the `taxa+id_microorganisms.txt` file. -* `OUTDIR`: the path to the directory where to store results. +This will download the whole DSMZ catalog via the BacDive service. -## Download DSMZ entries +In order to use the BacDive API, you must [register](https://bacdive.dsmz.de/api/bacdive/registration/register/). +You must type the registered user to the variable `BACDIVE_USER` in the `config.yaml` file. +You also must record the password in a file (do not forget to remove read rights from group and others), then type the file path to the variable `BACDIVE_PASSWORD_FILE`. -Ì€``shell +### 3. Match DSMZ strains to NCBI taxonomy -snakemake --snakefile dsmz-download.snakefile --cores 1 +```shell +snakemake -j 1 -s dsmz-match.snakefile ``` -**Downloading the whole DSMZ takes a long time**. -Depending on the bandwith, it takes between 6 and 10 hours. +**ETA: 2 minutes** -As in february 2021, the whole catalog includes more than 80k entries and takes approximately 700MB in XML format. +This will look for a suitable place in the NCBI Taxonomy for each DMSZ strain. -## Match DSMZ strains to NCBI +The output of this step contains 5 files: +* `bacdive-to-taxid.txt`: a mapping file from the BacDive identifier of the strain to the taxon identifier +* `dispatch-report.txt`: a report of the dispatch for each DSMZ strain +* `dsmz-nodes.dmp` and `dsmz-names.dmp`: files in the format of the NCBI Taxonomy with additional nodes and synonyms +* `warnings.txt`: things to pay attention -```shell -snakemake --snakefile dsmz-match.snakefile --cores 1 -``` - -This step uses the following files: `dsmz-match.plan`, `bacdive2alvisnlp.xslt` and whichever file specified in the configuration file as `TAXA_FILE`. - -### Output files +### 4. Rewrite taxonomy -The DSMZ match results are written in the `dsmz-match` directory in the output directory. - -#### `dispatch-report.txt` +```shell +snakemake -j 1 -s rewrite-taxonomy.snakefile +``` -A tabular file where each line represents a DSMZ catalog entry. +**ETA: 12 minutes** -* `ENTRY`: name of the catalog entry file. -* `DISPATCH`: dipatch decision - * `append`: new strain appended to NCBI species or NCBI subspecies - * `append-species`: new strain and lineage appended to NCBI superspecific taxon - * `equivalent`: equivalence with NCBI strain - * `type material`: equivelence with NCBI species or subspecies given as type strain - * `no-number`: strain has no designation or number - * `fail`: no match at any taxonomic level +This will write the merged taxonomy in a format suitable for text projection. -#### `dsmz-nodes.txt` -Files in the format of NCBI Taxonomy `nodes.dmp` and `names.dmp` that contains all additions. -- GitLab