From e6dff400785d612244971e50452a5952a2d1249e Mon Sep 17 00:00:00 2001 From: Robert Bossy <Robert.Bossy@inra.fr> Date: Wed, 31 Mar 2021 11:31:15 +0200 Subject: [PATCH] completed readme --- README.md | 82 ++++++++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 81 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index 8fa3927..809ebdc 100644 --- a/README.md +++ b/README.md @@ -2,6 +2,86 @@ Command-line utilities for downloading [BacDive](https://bacdive.dsmz.de/api/bacdive/) entries and merging PNU with the NCBI taxonomy. +## Python utilities + +### `dsmz.py` + +A library containing base classes for wrapping REST APIs provided by DSMZ. + +See: https://www.dsmz.de/services/online-tools + +### `bacdive.py` + +Python wrapper for downloading DSMZ catalog entries. +This script may be used as a library or in command-line. + +See: https://bacdive.dsmz.de/api/bacdive/ + +### `pnu.py` + +Python wrapper for downloading LPSN taxa. +This script may be used as a library or in command-line. + +See: https://bacdive.dsmz.de/api/pnu/ + +## Configuration + +In order to use the BacDive API, you must register a valid e-mail address: https://bacdive.dsmz.de/api/bacdive/registration/register/ + +Once the registration process is completed, write the BacDive password into a file. +**Set this file permissions as `0600` in order to prevent other users to access it**. + +Open `config.yml` and set the following variables: + +* `BACDIVE_USER` and `BACDIVE_PASSWORD_FILE`: the e-mail registered to BacDive and the password file. +* `ALVISNLP`: the path to AlvisNLP binary. +* `TAXA_FILE`: path to the `taxa+id_microorganisms.txt` file. +* `OUTDIR`: the path to the directory where to store results. + +## Download DSMZ entries + +̀``shell + +snakemake --snakefile dsmz-download.snakefile --cores 1 +``` + +**Downloading the whole DSMZ takes a long time**. +Depending on the bandwith, it takes between 6 and 10 hours. + +As in february 2021, the whole catalog includes more than 80k entries and takes approximately 700MB in XML format. + +## Match DSMZ strains to NCBI + ```shell -snakemake -j 1 + +snakemake --snakefile dsmz-match.snakefile --cores 1 ``` + +This step uses the following files: `dsmz-match.plan`, `bacdive2alvisnlp.xslt` and whichever file specified in the configuration file as `TAXA_FILE`. + +### Output files + +The DSMZ match results are written in the `dsmz-match` directory in the output directory. + +#### `report.txt` + +A tabular file where each line represents a match. + +* `BACDIVE ID`: BacDive entry identifier, **this is different from the DSM catalogue number**. +* `FIELD`: strain form, either `catalog-number`, `species`, or `species-and-number`. +* `NAME`: strain form that matches. +* `NCBI TAXID`: matched taxon identifier in the NCBI taxonomy. +* ̀ǸCBI CANONICAL`: canonical name of the matched taxon in the NCBI taxonomy. +* `NCBI RANK`: taxonomic rank of the matched taxon in the NCBI taxonomy. + +The three last columns are empty if there is no match. + +#### The decision tree + +1. Try to match + + + +#### `equivalent-strains.txt` + +This file shows equivalence between catalog entries and the NCBI \ No newline at end of file -- GitLab