Week 1 :
:
ClusterSsh connexion :
- Key ssh creation local :
- ssh-keygen
- alias .bashrc
- Key ssh creation cluster :
- ssh-keygen
- Cbib work on cortex :
- module load sinteractive
- sinteractive
:
Cbib- Home : /home (personal data) 40 TB Snapshots + replication
- Scratch : /scratch/user Temporary data to compute 70 TB Warning No Backup + cleaning process
Access to Database : /mnt/cbib/bank regularly up-to-date (Blast, GenBank, UniProt, RefSeq, ...)
Restoring your data : (/home/.snapshots/ or /mnt/cbib/.snapshots/), sorted by date, and visualize/copy/restore your data.
Data transfer (no git) : scp or rsync (Ex : rsync -e ssh -avz --progress ./rep $USER:./rep/)
Useful links :
https://services.cbib.u-bordeaux.fr/doc/infra/#module
https://services.cbib.u-bordeaux.fr/redmine/projects/documentation-publique/wiki/Table_des_mati%C3%A8res
:
Mcia curta- FS_tmp: local temporary space for compute nodes
- FS_home: data space for users data
- FS_scratch: parallel files system for jobs data NB: This space is not for storage, files are currently cleaned by an automatic system
Default partition in Curta is compute : Modify #SBATCH --constraint=bigmem to change the partition
IRODS : #TODO
Useful links :
https://redmine.mcia.fr/projects/cluster-curta/wiki
https://redmine.mcia.fr/projects/irods/wiki/IRODS
:
JobsUseful commands list :
- sbatch [Job].sh
- squeue -u $USER
- scontrol show $SLURM_JOBID
- scancel $SLURM_JOBID
Output : slurm-jobid.out
Error output : slurm-jobid.err
Modules :
- module av (available modules in the cluster)
- module load modulename
- module list (module loaded list)
- module purge (unload modules)
#!/bin/bash
#Run snakemake with a singularity container
##Author Domitille COQ--ETCHEGARAY
##25/02/2020
#SBATCH -J snakemake_test
#SBATCH --time=00:05:00
#SBATCH -c 1
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --mem 128000MB
#SBATCH -o snakemake.%N.%j.out
#SBATCH -e snakemake.%N.%j.err
#############################Module loading#############################
module load snakemake
########################################################################
## Cluster config file with parameters for each rule of the Snakefile
CLUSTER_CONFIG=config_files/cluster.json
## Slurm command to launch each new jobs (aka rule) create by snakemake.
## Arguments values are parameters of the cluster config file.
CLUSTER="sbatch --mem={cluster.mem} --ntasks-per-node {cluster.npernode} -t {cluster.time} -n {cluster.ntasks} -c {cluster.c} -J {cluster.jobname} -o snake_subjob_log/{cluster.jobname}.%N.%j.out -e snake_subjob_log/{cluster.jobname}.%N.%j.err"
## Use at most N cores in parallel (default: 1). If N is omitted or 'all', the limit is set to the number of available cores.
MAX_CORES=100
## Create a log directory for all the slurm output files
mkdir -p snake_subjob_log
## Create the directive acyclic graph of the workflow
snakemake -s Snakefile --dag | dot -Tpng > dag.png
## Launch the workflow : -s Snakefile --use-singularity launch the container
## --cluster-config cluster configuration file for each rule in the cluster --cluster sbacth shell command
snakemake -s Snakefile --use-singularity -j $MAX_CORES --cluster-config $CLUSTER_CONFIG --cluster "$CLUSTER"
## Create a final report
snakemake -s Snakefile --report smk_report.html
## Useful information to print
echo '########################################'
echo 'Date:' $(date --iso-8601=seconds)
echo 'User:' $USER
echo 'Host:' $HOSTNAME
echo 'Job Name:' $SLURM_JOB_NAME
echo 'Job ID:' $SLURM_JOB_ID
echo 'Array task ID:' ${SLURM_ARRAY_TASK_ID}
echo 'Number of nodes assigned to job:' $SLURM_JOB_NUM_NODES
echo 'Total number of cores for job (?):' $SLURM_NTASKS
echo 'Number of requested cores per node:' $SLURM_NTASKS_PER_NODE
echo 'Nodes assigned to job:' $SLURM_JOB_NODELIST
echo 'Directory:' $(pwd)
## Detail Information:
echo 'scontrol show job:'
scontrol show job $SLURM_JOB_ID
echo '########################################'
Conda environment
Conda 4.8.2
Create an environment from file : environment.yml
Exemple :
channels:
- conda-forge
- bioconda
dependencies:
- bioconda::snakemake-minimal =5.4.5
- python =3.6
- jinja2 =2.10
- networkx =2.1
- matplotlib =2.2.3
- graphviz =2.38.0
- bcftools =1.9
- samtools =1.9
- bwa =0.7.17
- pysam =0.15.0
- channels : Channels from which you will install your package
- dependencies : package you want to have and install in your environment.
user@001:~$ conda env create -n myEnv -f environment.yml
user@001:~$ conda activate myEnv
user@001:~$ conda env list
base /home/user/anaconda3
myEnv * /home/user/anaconda3/envs/snakemake_tuto
Useful commands list :
- conda env create (with yml file)
- conda activate
- conda deactivate
- conda remove --name myEnv --all (remove select environment)
- conda create
- conda list (package list)
- conda env list (conda envs list)
When envs are created they are automatically create in the folder /home/user/anaconda3/envs/ . You can change the path with the argument --prefix, it can be useful if you want to create an environment specific to a project.
Useful links :
https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html
:
SnakemakeSnakemake 5.8.2
Create a Snakefile example :
##Author Domitille COQ--ETCHEGARAY
##25/02/2020
##Path to the config file, defined parameters like path of first input files
configfile: "config_files/config.yml"
##Global variable that will tell to all the rules to use the shell of the singularity container
singularity: "img/snakemake_tuto.sif"
##Rule job 0, will always be the first rule executed and define the target file of the Snakemake workflow
rule all :
input:
"plots/quals.svg"
rule bwa_map:
#Path of input files
input:
"data/genome.fa",
lambda wildcards: config["samples"][wildcards.sample]
#Path of output files (create by Snakemake)
output:
"mapped_reads/{sample}.bam"
#Add parameters depending of the wildcard value
#annotate aligned reads with so-called read groups, that contain metadata like the sample name.
params:
rg=r"@RG\tID:{sample}\tSM:{sample}"
#Output log of the rule create in a file
log:
"tool_logs/bwa_mem/{sample}.log"
#More informations like wall time clock of the rule
benchmark :
"benchmarks/bwa_mem/{sample}.bwa.benchmark.txt"
#Number of cores allow for the rule
threads: 8
#Memmory allow for the rule
resources:
mem_mb=4000
#Shell command that will be execute by the rule
shell:
"(bwa mem -R '{params.rg}' -t {threads} {input} "
"| samtools view -Sb -> {output}) 2> {log}"
rule samtools_sort:
input:
"mapped_reads/{sample}.bam"
output:
"sorted_reads/{sample}.bam"
log:
"tool_logs/samtools_sort/{sample}.log"
benchmark :
"benchmarks/samtools_sort/{sample}.sams.benchmark.txt"
shell:
"(samtools sort -T sorted_reads/{wildcards.sample} "
"-O bam {input} > {output}) 2> {log}"
rule samtools_index:
input:
"sorted_reads/{sample}.bam"
output:
"sorted_reads/{sample}.bam.bai"
log:
"tool_logs/samtools_index/{sample}.log"
benchmark :
"benchmarks/samtools_idx/{sample}.sami.benchmark.txt"
shell:
"(samtools index {input}) 2> {log}"
rule bcftools_call:
input:
fa="data/genome.fa",
bam=expand("sorted_reads/{sample}.bam",sample=config["samples"]),
bai=expand("sorted_reads/{sample}.bam.bai",sample=config["samples"])
output:
"calls/all.vcf"
log:
"tool_logs/bcftools_call/bcf.log"
benchmark:
"benchmarks/bcftools_call/bcf.log"
shell:
"(bcftools mpileup -f {input.fa} {input.bam} "
"| bcftools call -mv -> {output}) 2> {log}"
rule plot_quals :
input :
"calls/all.vcf"
output :
"plots/quals.svg"
#Script that will be execute by the rule
script :
"scripts/plot_quals.py"
samples:
A: data/samples/A.fastq
B: data/samples/B.fastq
C: data/samples/C.fastq
The workflow is composed of rules that are defined in a Snakefile like we can see above. Rules are divided in different part : input files, output files, a shell command or a script. Many others steps can be add to the rule. Then Snakemake will determine the dependencies between the rules thanks to the files names. It will create a DAG (directed acyclic graph) that will show the jobs that can be parallelized.
Specific rule :
- rule all : Thanks to this rule you don't need to specify the target file in the snakemake shell command. Warning : This rules can't have wildcards ! If no target file is provided in the shell command or in the rule all Snakemake will choose the first rule as target.
In this example we have many rules directives :
- benchmark : Provide many informations like the wall time clock of a job.
- input : the path of the input files
- log : can be input for other rules, like any output file. Useful to detect errors in rules. Also ouput of each job in a file and not just printed in the terminal
- output : the name of the output file (Snakemake will create it)
- params : specify additional parameters depending on the wildcards values
- shell : a shell command that will be execute to create the data in the file.
- script : execute a script, need the path to the script. This directive only work with R, Python and Julia scripts.
- threads : number of cores allow to the rule
Global directives (use for all the rules):
- configfile : (json or yml) define dictionnary of parameters, like for example the path of the input files.
- singularity : path to a singularity container, the shell command will be execute in this image.
- include : path to an other snakefile, can be great to split the workflow
Many others directives exists, go to the Snakemake website link.
Some options :
- temp : files are only create the time of the rule, they are temporary.
- protected : files are protected that means they can't be delete by error.
Useful command lines :
- snakemake --dag | dot -Tpng > dag.png
- snakemake -np (dry run)
- snakemake --forceall
- snakemake --delete-all-output (remove all created files begin from scratch)
-
WARNING : When a singularity image is used !!
- snakemake --use-singularity
- snakemake --use-conda
Useful links :
https://snakemake.readthedocs.io/en/v5.8.2/index.html
https://snakemake.readthedocs.io/en/v5.8.2/snakefiles/rules.html
Cluster execution :
A snakemake workflow can be done on a HPC. For that you need to add different arguments to your snakemake command
- --cluster "sbatch ..." : it will be the new slurm command to request to the cluster for each new job (rule) of the workflow. In this argument you can specify parameters thanks to a config cluster file (json or yml) that you will see later.
- --cluster-configfile file : the config cluster file (json or yml) that you need to details the argument of the new slurm command in the argument cluster. It will work like an object, for example {cluster.jobanme} correspond to the value "jobname":default of your json file.
- -j : will be the maximum number of core allow to snakemake for the parallelization of jobs.
:
SingularityLocal installation : 3.5
Build Images from scratch :
Here we gonna explain how to create container .sif and not sandbox.
Create a singularity definition file .def :
Your image will be build thanks to the definition make in this file
Definition file exemple sing.def :
Bootstrap : docker
From : continuumio/miniconda3
IncludeCmd : yes
%files
environment.yml
%post
apt-get update && apt-get install -y procps && apt-get clean -y
/opt/conda/bin/conda env create -n myEnv -f /environment.yml
/opt/conda/bin/conda clean -a
%environment
export PATH=/opt/conda/bin:$PATH
. /opt/conda/etc/profile.d/conda.sh
conda activate myEnv
%runscript
echo "Hello World"
%help
Tools for Snakemake tutorial
%labels
Author Domitille COQ--ETCHEGARAY
How to create the image :
In your folder you need to have particular files :
- a definition file (.def)
- a environment file if you create a conda environment in your container (.yml)
With this following command you will create your image :
user@001:~$ sudo singularity build snakemake_tuto.sif sing.def
Then if you want to try you can open the shell within the image that you create before :
user@001:~$ singularity shell snakemake_tuto.sif
A definition file is separated in two part the Header and the Body.
-
Header : Description of the core operating system build.
- Bootstrap : Determine the bootstrap agent that will be used for creating the base operating system. Ex : library pull a container from the Container Library (https://cloud.sylabs.io/library). docker from Docker Hub (https://hub.docker.com/)
- From : will define which module we will install like for example Ubuntu or Debian. Here we install a specific architecture for a conda environment (https://cloud.sylabs.io/library/_container/5e33375916506c7b1638e577)
- IncludeCmd : If included, and if a %runscript is not specified, a CMD will take precedence and will be used as a runscript.
-
Body :
-
Sections : (define by a %)
- files : copy the files inside the container from your local folder
- environment : Definition of environment variables.
- post : This is where you can download from internet, like when you use apt-get. You can install new software, libraries, create configuration files or new directories.
- runscript : content that will be executed when the container image is run.
- help : Metadata in the container can be displayed by run-help.
- labels : Section used to add metadata within your container. General format name-value pair.
- Many other sections exits, go to Singularity website link.
-
Sections : (define by a %)
In this case, the singularity image is launched by a Snakefile. Thanks to that we use the shell of the image and the tools that are within. We don't need to run the singularity image with the shell command. You can look for further usage of container in the Singularity documentation.
Useful command :
- singularity run snakemake_tuto.sif
- singularity run-help snakemake_tuto.sif
- singularity inspect snakamake_tuto.sif
- singularity shell snakemake_tuto.sif
- sudo singularity build snakemake_tuto.sif sing.def
Useful links :
https://sylabs.io/guides/3.5/user-guide/index.html
https://sylabs.io/guides/3.5/user-guide/build_a_container.html?highlight=definition%20file
:
Test ExampleThe following explanation will be for the creation of an example of a Snakemake workflow using a singularity container on an HPC Cluster.
:
Local runFrom Scratch :
Create a project folder on your computer
All the following files will be in it. Also add your data in this folder.
You will create step by step all the files that you need :
- You can obtain the data of this example with this command.
user@001:~$ wget https://github.com/snakemake/snakemake-tutorial-data/archive/v5.4.5.
user@001:~$ tar.gz tar -xf v5.4.5.tar.gz --strip 1
- Conda environment :
Create a environment file (environment.yml) like in this example - Singularity container :
- Create a definition file (sing.def) like in this example.
Warning : If you need files for your container be sure that the files are in the same folder than your definition file or to put the right path. - Use this command :
user@001:~$ sudo singularity build snakemake_tuto.sif sing.def
- Your img is build with your specific conda environment. (snakemake_tuto.sif)
- Create a definition file (sing.def) like in this example.
- Snakemake workflow :
- Create a Snakefile like in this example.
As you can see in this example you need to have a config.yml file. This file is very dependant of your project or data. Here for our example we will use this config file.
Warning : Be careful of the path of the config file and the singularity image path if you use one. - In this example you will use the following script to create a plot don't forget to add it to your folder inside a subfolder scripts/plot_quals.py.
import matplotlib matplotlib.use("Agg") import matplotlib.pyplot as plt from pysam import VariantFile quals = [record.qual for record in VariantFile(snakemake.input[0])] plt.hist(quals) plt.savefig(snakemake.output[0])
- Create a Snakefile like in this example.
Your folder need to have the following architecture to work with our example.
user@001:~$ tree
.
├── config_files
│ └── config.yml
├── data
│ ├── genome.fa
│ ├── genome.fa.amb
│ ├── genome.fa.ann
│ ├── genome.fa.bwt
│ ├── genome.fa.fai
│ ├── genome.fa.pac
│ ├── genome.fa.sa
│ └── samples
│ ├── A.fastq
│ ├── B.fastq
│ └── C.fastq
├── img
│ ├── environment.yml
│ ├── sing.def
│ └── snakemake_tuto.sif
├── scripts
│ └── plot_quals.py
└── Snakefile
At this step you can already run this example locally following this shell command.
user@001:~$ snakemake --use-singularity
:
HPC RunTo run our Snakemake on a HPC and not locally we need to create two more files.
- Create a bash script (Slurm Job) (job .sh) like in this example.
- Create a cluster config file (cluster.json) like in the following example.
{ "__default__" : { "jobname": "default", "c" : 1, "ntasks" : 1, "npernode" : 1, "mem": 4000, "time": "00:02:00" }, "bwa_map" : { "jobname": "bwa", "c": 8, "ntasks": 1, "npernode" : 1, "mem": 4000, "time": "00:02:00" }, "samtools_sort" : { "jobname": "samsort", "c": 1, "ntasks": 1, "npernode" : 1, "mem": 4000, "time": "00:02:00" }, "samtools_index" : { "jobname": "samidx", "c": 1, "ntasks": 1, "npernode" : 1, "mem": 4000, "time": "00:02:00" }, "bcftools_call" : { "jobname": "bcfcall", "c": 1, "ntasks": 1, "npernode" : 1, "mem": 4000, "time": "00:02:00" }, "plot_quals" : { "jobname": "plot", "c": 1, "ntasks": 1, "npernode" : 1, "mem": 4000, "time": "00:02:00" } }
user@cluster001:~$ tree . ├── config_files │ ├── cluster.json │ └── config.yml ├── data │ ├── genome.fa │ ├── genome.fa.amb │ ├── genome.fa.ann │ ├── genome.fa.bwt │ ├── genome.fa.fai │ ├── genome.fa.pac │ ├── genome.fa.sa │ └── samples │ ├── A.fastq │ ├── B.fastq │ └── C.fastq ├── img │ ├── environment.yml │ ├── sing.def │ └── snakemake_tuto.sif ├── job.sh ├── scripts │ └── plot_quals.py └── Snakefile
-
Now you need to have your folder in the HPC.
- You already created all the file in the HPC.
- You can use rsync.
- You can create a gitlab project. (Recommanded)
-
HPC : CBIB
- It's recommanded to not work on the headnode. To avoid that you will use the module sinteractive.
user@cluster001:~$ module load sinteractive user@cluster001:~$ sinteractive
Now you can launch your workflow following this command :
user@cluster001:~$ sbatch job.sh
DONE !!
-
:
ForgeMiaContainer :
You have the possibility to add container in the container regristry of your project.
Now, it's works with Docker container.
#TODO Work with a singularity container :
user : gitlab username
passwd : access personnal token create on your gitlab
singularity push --docker-username user --docker-password passwd container.sif oras://gitlab-registry/user/project:latest
Useful link :
https://souchal.pages.in2p3.fr/hugo-perso/2019/09/20/tutorial-singularity-and-docker/
https://forgemia.inra.fr/adminforgemia/doc-public/-/wikis/Gitlab-Container-Registry
Data :
https://docs.gitlab.com/ce/administration/lfs/manage_large_binaries_with_git_lfs.html