Skip to content
Snippets Groups Projects
Commit 9b203cf7 authored by Robert Bossy's avatar Robert Bossy
Browse files

update README

parent 89d61310
No related branches found
No related tags found
No related merge requests found
# BacDive Utils # BacDive Utils
Command-line utilities for downloading [BacDive](https://bacdive.dsmz.de/api/bacdive/) entries and merging PNU with the NCBI taxonomy. Taxonomy managing tools.
## Python utilities ## Workflow
### `dsmz.py` ### 0. Configuration
A library containing base classes for wrapping REST APIs provided by DSMZ. Edit the `config.yaml` file, and set the following variables:
See: https://www.dsmz.de/services/online-tools * `BACDIVE_USER`: username for the BacDive API
* `BACDIVE_PASSWORD_FILE`: file containing the password for the BacDive API
* `ALVISNLP`: path to the AlvisNLP binary
* `REWRITE_TAXONOMY`: path to the `rewrite-taxonomy` binary
### `bacdive.py`
Python wrapper for downloading DSMZ catalog entries. ### 1. NCBI Taxonomy download
This script may be used as a library or in command-line.
See: https://bacdive.dsmz.de/api/bacdive/ ```shell
snakemake -j 1 -s ncbi-download.snakefile
### `pnu.py` ```
Python wrapper for downloading LPSN taxa. **ETA: 2 minutes**
This script may be used as a library or in command-line.
See: https://bacdive.dsmz.de/api/pnu/ This will download the Taxonomy archive from the NCBI FTP server, then unzip the archive.
## Configuration The download is anonymous and does not require any registration.
In order to use the BacDive API, you must register a valid e-mail address: https://bacdive.dsmz.de/api/bacdive/registration/register/ ### 2. DSMZ catalog download
Once the registration process is completed, write the BacDive password into a file. ```shell
**Set this file permissions as `0600` in order to prevent other users to access it**. snakemake -j 1 -s dsmz-download.snakefile
```
Open `config.yml` and set the following variables: **ETA: 10 hours**
* `BACDIVE_USER` and `BACDIVE_PASSWORD_FILE`: the e-mail registered to BacDive and the password file. This will download the whole DSMZ catalog via the BacDive service.
* `ALVISNLP`: the path to AlvisNLP binary.
* `TAXA_FILE`: path to the `taxa+id_microorganisms.txt` file.
* `OUTDIR`: the path to the directory where to store results.
## Download DSMZ entries In order to use the BacDive API, you must [register](https://bacdive.dsmz.de/api/bacdive/registration/register/).
You must type the registered user to the variable `BACDIVE_USER` in the `config.yaml` file.
You also must record the password in a file (do not forget to remove read rights from group and others), then type the file path to the variable `BACDIVE_PASSWORD_FILE`.
̀``shell ### 3. Match DSMZ strains to NCBI taxonomy
snakemake --snakefile dsmz-download.snakefile --cores 1 ```shell
snakemake -j 1 -s dsmz-match.snakefile
``` ```
**Downloading the whole DSMZ takes a long time**. **ETA: 2 minutes**
Depending on the bandwith, it takes between 6 and 10 hours.
As in february 2021, the whole catalog includes more than 80k entries and takes approximately 700MB in XML format. This will look for a suitable place in the NCBI Taxonomy for each DMSZ strain.
## Match DSMZ strains to NCBI The output of this step contains 5 files:
* `bacdive-to-taxid.txt`: a mapping file from the BacDive identifier of the strain to the taxon identifier
* `dispatch-report.txt`: a report of the dispatch for each DSMZ strain
* `dsmz-nodes.dmp` and `dsmz-names.dmp`: files in the format of the NCBI Taxonomy with additional nodes and synonyms
* `warnings.txt`: things to pay attention
```shell
snakemake --snakefile dsmz-match.snakefile --cores 1 ### 4. Rewrite taxonomy
```
This step uses the following files: `dsmz-match.plan`, `bacdive2alvisnlp.xslt` and whichever file specified in the configuration file as `TAXA_FILE`.
### Output files
The DSMZ match results are written in the `dsmz-match` directory in the output directory. ```shell
snakemake -j 1 -s rewrite-taxonomy.snakefile
#### `dispatch-report.txt` ```
A tabular file where each line represents a DSMZ catalog entry. **ETA: 12 minutes**
* `ENTRY`: name of the catalog entry file. This will write the merged taxonomy in a format suitable for text projection.
* `DISPATCH`: dipatch decision
* `append`: new strain appended to NCBI species or NCBI subspecies
* `append-species`: new strain and lineage appended to NCBI superspecific taxon
* `equivalent`: equivalence with NCBI strain
* `type material`: equivelence with NCBI species or subspecies given as type strain
* `no-number`: strain has no designation or number
* `fail`: no match at any taxonomic level
#### `dsmz-nodes.txt`
Files in the format of NCBI Taxonomy `nodes.dmp` and `names.dmp` that contains all additions.
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment