Skip to content
Snippets Groups Projects
Commit 9b203cf7 authored by Robert Bossy's avatar Robert Bossy
Browse files

update README

parent 89d61310
No related branches found
No related tags found
No related merge requests found
# BacDive Utils
Command-line utilities for downloading [BacDive](https://bacdive.dsmz.de/api/bacdive/) entries and merging PNU with the NCBI taxonomy.
Taxonomy managing tools.
## Python utilities
## Workflow
### `dsmz.py`
### 0. Configuration
A library containing base classes for wrapping REST APIs provided by DSMZ.
Edit the `config.yaml` file, and set the following variables:
See: https://www.dsmz.de/services/online-tools
* `BACDIVE_USER`: username for the BacDive API
* `BACDIVE_PASSWORD_FILE`: file containing the password for the BacDive API
* `ALVISNLP`: path to the AlvisNLP binary
* `REWRITE_TAXONOMY`: path to the `rewrite-taxonomy` binary
### `bacdive.py`
Python wrapper for downloading DSMZ catalog entries.
This script may be used as a library or in command-line.
### 1. NCBI Taxonomy download
See: https://bacdive.dsmz.de/api/bacdive/
### `pnu.py`
```shell
snakemake -j 1 -s ncbi-download.snakefile
```
Python wrapper for downloading LPSN taxa.
This script may be used as a library or in command-line.
**ETA: 2 minutes**
See: https://bacdive.dsmz.de/api/pnu/
This will download the Taxonomy archive from the NCBI FTP server, then unzip the archive.
## Configuration
The download is anonymous and does not require any registration.
In order to use the BacDive API, you must register a valid e-mail address: https://bacdive.dsmz.de/api/bacdive/registration/register/
### 2. DSMZ catalog download
Once the registration process is completed, write the BacDive password into a file.
**Set this file permissions as `0600` in order to prevent other users to access it**.
```shell
snakemake -j 1 -s dsmz-download.snakefile
```
Open `config.yml` and set the following variables:
**ETA: 10 hours**
* `BACDIVE_USER` and `BACDIVE_PASSWORD_FILE`: the e-mail registered to BacDive and the password file.
* `ALVISNLP`: the path to AlvisNLP binary.
* `TAXA_FILE`: path to the `taxa+id_microorganisms.txt` file.
* `OUTDIR`: the path to the directory where to store results.
This will download the whole DSMZ catalog via the BacDive service.
## Download DSMZ entries
In order to use the BacDive API, you must [register](https://bacdive.dsmz.de/api/bacdive/registration/register/).
You must type the registered user to the variable `BACDIVE_USER` in the `config.yaml` file.
You also must record the password in a file (do not forget to remove read rights from group and others), then type the file path to the variable `BACDIVE_PASSWORD_FILE`.
̀``shell
### 3. Match DSMZ strains to NCBI taxonomy
snakemake --snakefile dsmz-download.snakefile --cores 1
```shell
snakemake -j 1 -s dsmz-match.snakefile
```
**Downloading the whole DSMZ takes a long time**.
Depending on the bandwith, it takes between 6 and 10 hours.
**ETA: 2 minutes**
As in february 2021, the whole catalog includes more than 80k entries and takes approximately 700MB in XML format.
This will look for a suitable place in the NCBI Taxonomy for each DMSZ strain.
## Match DSMZ strains to NCBI
The output of this step contains 5 files:
* `bacdive-to-taxid.txt`: a mapping file from the BacDive identifier of the strain to the taxon identifier
* `dispatch-report.txt`: a report of the dispatch for each DSMZ strain
* `dsmz-nodes.dmp` and `dsmz-names.dmp`: files in the format of the NCBI Taxonomy with additional nodes and synonyms
* `warnings.txt`: things to pay attention
```shell
snakemake --snakefile dsmz-match.snakefile --cores 1
```
This step uses the following files: `dsmz-match.plan`, `bacdive2alvisnlp.xslt` and whichever file specified in the configuration file as `TAXA_FILE`.
### Output files
### 4. Rewrite taxonomy
The DSMZ match results are written in the `dsmz-match` directory in the output directory.
#### `dispatch-report.txt`
```shell
snakemake -j 1 -s rewrite-taxonomy.snakefile
```
A tabular file where each line represents a DSMZ catalog entry.
**ETA: 12 minutes**
* `ENTRY`: name of the catalog entry file.
* `DISPATCH`: dipatch decision
* `append`: new strain appended to NCBI species or NCBI subspecies
* `append-species`: new strain and lineage appended to NCBI superspecific taxon
* `equivalent`: equivalence with NCBI strain
* `type material`: equivelence with NCBI species or subspecies given as type strain
* `no-number`: strain has no designation or number
* `fail`: no match at any taxonomic level
This will write the merged taxonomy in a format suitable for text projection.
#### `dsmz-nodes.txt`
Files in the format of NCBI Taxonomy `nodes.dmp` and `names.dmp` that contains all additions.
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment