diff --git a/README.md b/README.md index 8fa3927de4623d6f0b2e77ac7333e4d3004a19a5..809ebdc5a9a4b30c16c0447e60881baef5de9b68 100644 --- a/README.md +++ b/README.md @@ -2,6 +2,86 @@ Command-line utilities for downloading [BacDive](https://bacdive.dsmz.de/api/bacdive/) entries and merging PNU with the NCBI taxonomy. +## Python utilities + +### `dsmz.py` + +A library containing base classes for wrapping REST APIs provided by DSMZ. + +See: https://www.dsmz.de/services/online-tools + +### `bacdive.py` + +Python wrapper for downloading DSMZ catalog entries. +This script may be used as a library or in command-line. + +See: https://bacdive.dsmz.de/api/bacdive/ + +### `pnu.py` + +Python wrapper for downloading LPSN taxa. +This script may be used as a library or in command-line. + +See: https://bacdive.dsmz.de/api/pnu/ + +## Configuration + +In order to use the BacDive API, you must register a valid e-mail address: https://bacdive.dsmz.de/api/bacdive/registration/register/ + +Once the registration process is completed, write the BacDive password into a file. +**Set this file permissions as `0600` in order to prevent other users to access it**. + +Open `config.yml` and set the following variables: + +* `BACDIVE_USER` and `BACDIVE_PASSWORD_FILE`: the e-mail registered to BacDive and the password file. +* `ALVISNLP`: the path to AlvisNLP binary. +* `TAXA_FILE`: path to the `taxa+id_microorganisms.txt` file. +* `OUTDIR`: the path to the directory where to store results. + +## Download DSMZ entries + +̀``shell + +snakemake --snakefile dsmz-download.snakefile --cores 1 +``` + +**Downloading the whole DSMZ takes a long time**. +Depending on the bandwith, it takes between 6 and 10 hours. + +As in february 2021, the whole catalog includes more than 80k entries and takes approximately 700MB in XML format. + +## Match DSMZ strains to NCBI + ```shell -snakemake -j 1 + +snakemake --snakefile dsmz-match.snakefile --cores 1 ``` + +This step uses the following files: `dsmz-match.plan`, `bacdive2alvisnlp.xslt` and whichever file specified in the configuration file as `TAXA_FILE`. + +### Output files + +The DSMZ match results are written in the `dsmz-match` directory in the output directory. + +#### `report.txt` + +A tabular file where each line represents a match. + +* `BACDIVE ID`: BacDive entry identifier, **this is different from the DSM catalogue number**. +* `FIELD`: strain form, either `catalog-number`, `species`, or `species-and-number`. +* `NAME`: strain form that matches. +* `NCBI TAXID`: matched taxon identifier in the NCBI taxonomy. +* ̀ǸCBI CANONICAL`: canonical name of the matched taxon in the NCBI taxonomy. +* `NCBI RANK`: taxonomic rank of the matched taxon in the NCBI taxonomy. + +The three last columns are empty if there is no match. + +#### The decision tree + +1. Try to match + + + +#### `equivalent-strains.txt` + +This file shows equivalence between catalog entries and the NCBI \ No newline at end of file