diff --git a/config/README.md b/config/README.md new file mode 100644 index 0000000000000000000000000000000000000000..0631078321a45633a8c45cabf5979c7f57c6a9ec --- /dev/null +++ b/config/README.md @@ -0,0 +1,9 @@ +# Workflow Configs +This folder contains configs used in the project + +## configs into [config.yaml](config.yaml) +is used to define global variables to be used into the snakefiles + +## sge profile into [profile/](profile) + +a sge profile that configures Snakemake to run on the migale SGE Cluster. We are re-using one of the [Snakemake-Profiles](https://github.com/Snakemake-Profiles/doc. The profile defines default options to be used to submit and run jobs on the migale cluster. It also defines options to be used to run rules that requires special configs like queue selection, memory requirements, etc. diff --git a/config/profile/config.yaml b/config/profile/config.yaml index 24ce768869bcdabd838d332d6d7d452e44def424..ef8ebd59b83513aad831060961e15be8f200c02b 100755 --- a/config/profile/config.yaml +++ b/config/profile/config.yaml @@ -1,7 +1,7 @@ -restart-times: 1 +restart-times: 2 cluster: "sge-submit.py" cluster-status: "sge-status.py" jobscript: "jobscript.sh" -jobs: 80 +jobs: 100 max-status-checks-per-second: 1 latency-wait: 60 diff --git a/docs/1-preprocess-ontology.md b/docs/1-preprocess-ontology.md index 1b40575d06b5879ee6d334c79680f266df9c69ae..555142ba9d2f44e376e87eabf735ee3b14e6acce 100644 --- a/docs/1-preprocess-ontology.md +++ b/docs/1-preprocess-ontology.md @@ -8,9 +8,8 @@ This pipeline analyzes the ontologies, cuts the desired branches and produces th ``` snakemake --nolock --verbose --printshellcmds --use-singularity --use-conda --reason --latency-wait 30 --jobs \ ---snakefile preprocess-ontology.snakefile \ ---cluster "qsub -v PYTHONPATH='' -V -cwd -e log/ -o log/ -q short.q -pe thread 2" \ ---restart-times 4 all +--snakefile preprocess-ontology.snakefile all \ +--profile config/profile ``` ## **Display the DAG** diff --git a/docs/2-generate-concept-path.md b/docs/2-generate-concept-path.md index 4ccc5dcc731d53da6bac751dc9d5ce41968d207d..f941833c48a85ecd6258504a2f78beaeb1fdb932 100644 --- a/docs/2-generate-concept-path.md +++ b/docs/2-generate-concept-path.md @@ -8,9 +8,8 @@ The pipeline generates the concept paths from the structure of the Ontobiotope o ``` snakemake --nolock --verbose --printshellcmds --use-singularity --use-conda --reason --latency-wait 30 --jobs \ ---snakefile generate_concept_path.snakefile \ ---cluster "qsub -v PYTHONPATH='' -V -cwd -e log/ -o log/ -q short.q -pe thread 2" \ ---restart-times 4 all +--snakefile generate_concept_path.snakefile all \ +--profile config/profile ``` ## **Display the DAG** diff --git a/docs/3-process-cirm-data.md b/docs/3-process-cirm-data.md index 4bc8a28209bf4b5a9b959e4017aa6ec49d526245..b7ee8aef2e3277747fb3a62a2fa3b51a53faf70f 100644 --- a/docs/3-process-cirm-data.md +++ b/docs/3-process-cirm-data.md @@ -8,9 +8,8 @@ The pipeline extracts microorganisms, habitats of texts from CIRM. CIRM texts ar ``` snakemake --nolock --verbose --printshellcmds --use-singularity --use-conda --reason --latency-wait 30 --jobs \ ---snakefile process_CIRM_corpus.snakefile \ ---cluster "qsub -v PYTHONPATH='' -V -cwd -e log/ -o log/ -q short.q -pe thread 2" \ ---restart-times 4 all +--snakefile process_CIRM_corpus.snakefile all \ +--profile config/profile ``` ## **Display the DAG** diff --git a/docs/4-process-genbank-data.md b/docs/4-process-genbank-data.md index 8958f414ba36c47b6d3879a3a9885413d9d89aa8..d2d7e62ab9e413acbfa108e165941cff758d3fb4 100644 --- a/docs/4-process-genbank-data.md +++ b/docs/4-process-genbank-data.md @@ -8,9 +8,8 @@ The pipeline extracts microorganisms, habitats of texts from GenBank. GenBank te ``` snakemake --nolock --verbose --printshellcmds --use-singularity --use-conda --reason --latency-wait 30 --jobs \ ---snakefile process_GenBank_corpus.snakefile \ ---cluster "qsub -v PYTHONPATH='' -V -cwd -e log/ -o log/ -q short.q -pe thread 2" \ ---restart-times 4 all +--snakefile process_GenBank_corpus.snakefile all \ +--profile config/profile ``` ## **Display the DAG** diff --git a/docs/5-process-dsmz-data.md b/docs/5-process-dsmz-data.md index 012e1285e217be448a1b71962b08ff02fa9127d3..caa7c03d1ba54d7f249ff41ac823e7f0a12e1887 100644 --- a/docs/5-process-dsmz-data.md +++ b/docs/5-process-dsmz-data.md @@ -8,9 +8,8 @@ The pipeline extracts microorganisms, habitats of texts from DSMZ. DSMZ texts to ``` snakemake --nolock --verbose --printshellcmds --use-singularity --use-conda --reason --latency-wait 30 --jobs \ ---snakefile process_DSMZ_corpus.snakefile \ ---cluster "qsub -v PYTHONPATH='' -V -cwd -e log/ -o log/ -q short.q -pe thread 2" \ ---restart-times 4 all +--snakefile process_DSMZ_corpus.snakefile all \ +--profile config/profile ``` ## **Display the DAG** diff --git a/docs/6-process-pubmed-data.md b/docs/6-process-pubmed-data.md index 71a0e2b2d283301789370b5247301bd33f46134c..db66afd21a7c6599576b2abe08d071d952377db1 100644 --- a/docs/6-process-pubmed-data.md +++ b/docs/6-process-pubmed-data.md @@ -11,9 +11,8 @@ The bacthes are automatically scanned by the pipeline. ``` snakemake --nolock --verbose --printshellcmds --use-singularity --use-conda --reason --latency-wait 30 --jobs 80 \ ---snakefile process_PubMed_corpus.snakefile \ ---cluster "qsub -v PYTHONPATH='' -V -cwd -e log/ -o log/ -q short.q -pe thread 2" \ ---restart-times 4 all +--snakefile process_PubMed_corpus.snakefile all \ +--profile config/profile ``` diff --git a/docs/run.md b/docs/run.md index 4adfd09d5775b3c6f834add8acba973d7b61e485..13a781eb74064dc662da73de00fbe9b7032ef8db 100644 --- a/docs/run.md +++ b/docs/run.md @@ -23,12 +23,12 @@ Pubmed corpus is to split into several batches to put in the `corpora/microbes-2 * The expander config file (expander.xml) is required to create the index expander for AlvisIR in **step 6.** ## Run the pipelines -The commands to run the steps look like the following. They must be executed from the project home dir with the `snakemake-5.13.0-env` activated. +The commands to run the steps look like the following. They must be executed from the project home dir with the `snakemake-5.13.0-env` activated. We use the sge profile available into **config/profile** to run on the migale cluster. + ``` snakemake --nolock --verbose --printshellcmds --use-singularity --use-conda --reason --latency-wait 10 --jobs \ ---snakefile [STEP].snakefile \ ---cluster "qsub -v PYTHONPATH='' -V -cwd -e log/ -o log/ -q short.q -pe thread 2" \ ---restart-times 4 all +--snakefile [STEP].snakefile all \ +--profile config/profile \ --dry-run ``` * `[STEP]` to be replaced by `preprocess-ontology` or `generate_concept_path` or `process_CIRM_corpus` or `process_GenBank_corpus` or `process_DSMZ_corpus` or `process_PubMed_corpus` @@ -37,15 +37,16 @@ snakemake --nolock --verbose --printshellcmds --use-singularity --use-conda --re * use option `--use-singularity` and `--use-conda` to manage the singularity images and the conda environments. * use option `--reason` to print the reason for each executed rule. * use option `--latency-wait` to set the number of seconds to wait for an output file. -* Use option `--jobs` to set the number of jobs to tun in parallel. -* use option `--cluster` to configure the SGE cluster. It is not used when you run locally. +* Use option `--jobs` to set the number of jobs to tun in parallel. +* Use option `--profile` to set the sge profile that configures the snakemake runs on migale cluster. -These others options can be useful -* use option `--forceall` if you want to force the execution of all the rules -* use option `--delete-all-output` to remove the already calculated outputs -* use option `--unlock` to unlock files -* use option `--directory` to set the execution directory -* use option `--report` to create an HTML report with results and statistics +These others options could be useful +* use option `--forceall` if you want to force the execution of all the rules. +* use option `--delete-all-output` to remove the already calculated outputs. +* use option `--unlock` to unlock files. +* use option `--directory` to set the execution directory. +* use option `--report` to create an HTML report with results and statistics. +* use option `--cluster` to configure the SGE cluster if you don't want to use option `--profile`. More options for snakemake [here](https://snakemake.readthedocs.io/en/v5.13.0/api_reference/snakemake.html) @@ -64,45 +65,40 @@ snakemake --nolock --verbose --printshellcmds --use-singularity --use-conda --re ``` snakemake --nolock --verbose --printshellcmds --use-singularity --use-conda --reason --latency-wait 30 --jobs \ ---snakefile generate_concept_path.snakefile \ ---cluster "qsub -v PYTHONPATH='' -V -cwd -e log/ -o log/ -q short.q -pe thread 2" \ ---restart-times 4 all +--snakefile generate_concept_path.snakefile all \ +--profile config/profile ``` ### **step 3.** `process CIRM data` ``` snakemake --nolock --verbose --printshellcmds --use-singularity --use-conda --reason --latency-wait 30 --jobs \ ---snakefile process_CIRM_corpus.snakefile \ ---cluster "qsub -v PYTHONPATH='' -V -cwd -e log/ -o log/ -q short.q -pe thread 2" \ ---restart-times 4 all +--snakefile process_CIRM_corpus.snakefile all \ +--profile config/profile ``` ### **step 4.** `process GenBank data` ``` snakemake --nolock --verbose --printshellcmds --use-singularity --use-conda --reason --latency-wait 30 --jobs \ ---snakefile process_GenBank_corpus.snakefile \ ---cluster "qsub -v PYTHONPATH='' -V -cwd -e log/ -o log/ -q short.q -pe thread 2" \ ---restart-times 4 all +--snakefile process_GenBank_corpus.snakefile all \ +--profile config/profile ``` ### **step 5.** `process DSMZ data` ``` snakemake --nolock --verbose --printshellcmds --use-singularity --use-conda --reason --latency-wait 30 --jobs \ ---snakefile process_DSMZ_corpus.snakefile \ ---cluster "qsub -v PYTHONPATH='' -V -cwd -e log/ -o log/ -q short.q -pe thread 2" \ ---restart-times 4 all +--snakefile process_DSMZ_corpus.snakefile all \ +--profile config/profile ``` ### **step 6.** `process Pubmed Data` ``` snakemake --nolock --verbose --printshellcmds --use-singularity --use-conda --reason --latency-wait 30 --jobs 80 \ ---snakefile process_PubMed_corpus.snakefile \ ---cluster "qsub -v PYTHONPATH='' -V -cwd -e log/ -o log/ -q short.q -pe thread 2" \ ---restart-times 4 all +--snakefile process_PubMed_corpus.snakefile all \ +--profile config/profile ``` ### **all steps.** `run all steps at once` @@ -118,4 +114,4 @@ Generate a report. snakemake --verbose --printshellcmds --use-singularity --nolock --reason --latency-wait 30 --jobs 100 \ --snakefile all.snakefile all \ --report report.html -``` \ No newline at end of file +```