Skip to content
Snippets Groups Projects

epmc-crawler

Get articles (XML fulltext) from https://europepmc.org/ by using dois, pmid or/and pmcids.

install

git clone https://forgemia.inra.fr/mandiayba/epmc-crawler.git
cd epmc-crawler
conda env conda env create -f softwares/envs/snakemake-5.13.0-env.yaml

usage (on migale)

  • corpus from a list of dois
conda activate snakemake-5.13.0-env

snakemake --nolock --verbose --printshellcmds --use-singularity --use-conda --reason --latency-wait 60 --jobs 2 --snakefile get_corpus_from_doiss.snakefile all --cluster "qsub -v PYTHONPATH='' -l mem_free=4G -V -cwd -e log/ -o log/ -q short.q -pe thread 2" --config --config DOIS_FILE=data/pmids.txt --config CORPUS_FOLDER=data/corpus_1
  • corpus from a list of pmids
conda activate snakemake-5.13.0-env

 snakemake --nolock --verbose --printshellcmds --use-singularity --use-conda --reason --latency-wait 60 --jobs 2 --snakefile get_corpus_from_pmids.snakefile all --cluster "qsub -v PYTHONPATH='' -l mem_free=4G -V -cwd -e log/ -o log/ -q short.q -pe thread 2" --config --config PMID_FILE=data/pmids.txt --config CORPUS_FOLDER=data/corpus_2
  • corpus from a list of pmcids
conda activate snakemake-5.13.0-

snakemake --nolock --verbose --printshellcmds --use-singularity --use-conda --reason --latency-wait 60 --jobs 2 --snakefile get_corpus_from_pmcids.snakefile all --cluster "qsub -v PYTHONPATH='' -l mem_free=4G -V -cwd -e log/ -o log/ -q short.q -pe thread 2" --config --config PMCID_FILE=data/pmcids.txt --config CORPUS_FOLDER=data/corpus_3

other usages

  • joint annotations results with other metadata using PMCID
snakemake --nolock --verbose --printshellcmds --use-singularity --use-conda --reason --latency-wait 60 --jobs 2 --snakefile joint_results.snakefile all --cluster "qsub -v PYTHONPATH='' -l mem_free=4G -V -cwd -e log/ -o log/ -q short.q -pe thread 2"

todo

  • apply to list of dois, pmids, pmcids collected by Open16S project members
  • integrate the pipelines to the omnicrobe workflow
  • handle the exceptions (missing dois, pmids, texts in epmc, add constraints to enhance the extractions of articles)