completed readme

e6dff400 · Robert Bossy · 3ff12318 · e6dff400
Commit e6dff400 authored 3 years ago by Robert Bossy
--- a/README.md
+++ b/README.md
@@ -2,6 +2,86 @@

 Command-line utilities for downloading [BacDive](https://bacdive.dsmz.de/api/bacdive/) entries and merging PNU with the NCBI taxonomy.

+## Python utilities
+
+### `dsmz.py`
+
+A library containing base classes for wrapping REST APIs provided by DSMZ.
+
+See: https://www.dsmz.de/services/online-tools
+
+### `bacdive.py`
+
+Python wrapper for downloading DSMZ catalog entries.
+This script may be used as a library or in command-line.
+
+See: https://bacdive.dsmz.de/api/bacdive/
+
+### `pnu.py`
+
+Python wrapper for downloading LPSN taxa.
+This script may be used as a library or in command-line.
+
+See: https://bacdive.dsmz.de/api/pnu/
+
+## Configuration
+
+In order to use the BacDive API, you must register a valid e-mail address: https://bacdive.dsmz.de/api/bacdive/registration/register/
+
+Once the registration process is completed, write the BacDive password into a file.
+**Set this file permissions as `0600` in order to prevent other users to access it**.
+
+Open `config.yml` and set the following variables:
+
+* `BACDIVE_USER` and `BACDIVE_PASSWORD_FILE`: the e-mail registered to BacDive and the password file.
+* `ALVISNLP`: the path to AlvisNLP binary.
+* `TAXA_FILE`: path to the `taxa+id_microorganisms.txt` file.
+* `OUTDIR`: the path to the directory where to store results.
+
+## Download DSMZ entries
+
+̀``shell
+
+snakemake --snakefile dsmz-download.snakefile --cores 1
+```
+
+**Downloading the whole DSMZ takes a long time**.
+Depending on the bandwith, it takes between 6 and 10 hours.
+
+As in february 2021, the whole catalog includes more than 80k entries and takes approximately 700MB in XML format.
+
+## Match DSMZ strains to NCBI
+
 ```shell
-snakemake -j 1
+
+snakemake --snakefile dsmz-match.snakefile --cores 1
 ```
+
+This step uses the following files: `dsmz-match.plan`, `bacdive2alvisnlp.xslt` and whichever file specified in the configuration file as `TAXA_FILE`.
+
+### Output files
+
+The DSMZ match results are written in the `dsmz-match` directory in the output directory.
+
+#### `report.txt`
+
+A tabular file where each line represents a match.
+
+* `BACDIVE ID`: BacDive entry identifier, **this is different from the DSM catalogue number**.
+* `FIELD`: strain form, either `catalog-number`, `species`, or `species-and-number`.
+* `NAME`: strain form that matches.
+* `NCBI TAXID`: matched taxon identifier in the NCBI taxonomy.
+* ̀ǸCBI CANONICAL`: canonical name of the matched taxon in the NCBI taxonomy.
+* `NCBI RANK`: taxonomic rank of the matched taxon in the NCBI taxonomy.
+
+The three last columns are empty if there is no match.
+
+#### The decision tree
+
+1. Try to match 
+
+
+
+#### `equivalent-strains.txt`
+
+This file shows equivalence between catalog entries and the NCBI 
\ No newline at end of file