SLURM
About SLURM
SLURM is a cluster management and job scheduling system for linux clusters. At IBBA, we maintain a very small cluster instance which can be used to test pipelines and scripts by allocating resources and exploiting all the computational power we have. Here’s SLURM description from the official documentation:
Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters. Slurm requires no kernel modifications for its operation and is relatively self-contained. As a cluster workload manager, Slurm has three key functions. First, it allocates exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work.
SLURM (or any other scheduling system) can be used to avoid loading a server more than its capacity and to reserve and ensure resources to be used when running a script or an executable: the scheduler will start a job only when resources are free and will ensure that no other jobs compete for the same resources.
Cluster environment @IBBA
A minimal SLURM cluster installation is composed by a master node, which is responsible to manage job queues (or better partitions in a SLURM ecosystem), one or more working nodes, which are the resources were jobs are executed, and one or more controller nodes, which are the instances in which users can submit their jobs. Here at IBBA we have a controller node which is the same host people uses when login to their remote account, and two working nodes which are used for job execution. There is also a master instance where people can access, however this instance is very limited in size and must be used only to manage jobs. Moreover, the same management utilities are available also from the controller node, so you don’t need to login into master node to manage your jobs.
The other prerequisite to use slurm is users consistency between controller and working nodes: in our local installation users are managed through an LDAP server, and homes are mounted through NFS on all cluster nodes: this means that a script or a software installed and working on the controller host is supposed to work in the same way even in master and worker nodes. You don’t need to move files from controller node to working nodes, the same environment should be applied in each cluster nodes. This means that a conda environment configured in the login node is supposed to work also in a worker node, since your home folder is mounted in the same position in all the environment. Moreover singularity is installed in each worker and login nodes.
Important
From here and for the rest of this documentation, we suppose that the user is logged on the controller node: user can’t login to the working nodes, with the only exception of interactive jobs.
Using SLURM
Get information on partitions
The partitions in SLURM ecosystem are the same as queues in other cluster management, like PBS: they group working nodes into logical sets and define job queues. Each partition can define some constraints, for example cpu or time limits, user allowed and so on. Priority ordered job are allocated within a partition until the resources provided by the working nodes are exausted: after that a job will remain in a pending state, waiting for resource to be available again.
To get information on partitions, you can use the sinfo command:
$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
long* up infinite 2 mix node[1-2]
testing up 4:00:00 2 mix node[1-2]
In this example there are two partitions available: the long partition is
the default partition (indicated through the * symbol near its name): this
means that jobs will be executed on this partition when omitting partition name
during job submission. testing partition is another partition which imposes
some restrictions, for example no job can last more than 4 hours. The STATE
columns is important since it gives information about cluster usage. A mix
state like in this example means that the partition is not fully loaded. Other
available stated are idle which means that the queue is empty, alloc when
the queue if full and down when nodes are not available. Please check sinfo
documentation to get a better explanation of such command.
Hint
The SLURM scheduler installed in our infrastructure is more sophisticated than
a generic FIFO scheduler: a job with a time limit definition can be executed with
an higher priority than a job without it. Submitting jobs through the testing
partition can be useful when debugging pipelines or scripts.
Get information on jobs
You can have information about jobs (running/pending) with the squeue command:
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
559 long nf-bwa_( cozzip R 1:34:39 1 node1
560 long nf-bwa_( cozzip R 1:34:39 1 node1
558 long nf-bwa_( cozzip R 1:34:42 1 node1
557 long nf-bwa_( cozzip R 3:32:23 1 node2
The ST column represents the job status: R means running while PD
stand for pending job (a job waiting to be executed). You can have also detailed
information with sacct command by providing the job ID through the -j
parameter like in the following example:
$ sacct -j 557 --format JobID,jobname,NTasks,nodelist,CPUTime,ReqMem,Elapsed
JobID JobName NTasks NodeList CPUTime ReqMem Elapsed
------------ ---------- -------- --------------- ---------- ---------- ----------
557 nf-bwa_(E+ node2 14:54:40 8G 03:43:40
557.batch batch 1 node2 14:54:40 03:43:40
Get information on resources
You can have detailed information on partitions, nodes and jobs with
the scontrol show command followed by the resource you need.
For example, to collect information on partitions, you can do the following:
$ scontrol show partitions
PartitionName=long
AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
AllocNodes=ALL Default=YES QoS=N/A
DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
Nodes=node[1-2]
PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
OverTimeLimit=NONE PreemptMode=OFF
State=UP TotalCPUs=40 TotalNodes=2 SelectTypeParameters=NONE
JobDefaults=(null)
DefMemPerCPU=4096 MaxMemPerNode=UNLIMITED
PartitionName=testing
AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
AllocNodes=ALL Default=NO QoS=N/A
DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
MaxNodes=UNLIMITED MaxTime=04:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
Nodes=node[1-2]
PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
OverTimeLimit=NONE PreemptMode=OFF
State=UP TotalCPUs=40 TotalNodes=2 SelectTypeParameters=NONE
JobDefaults=(null)
DefMemPerCPU=4096 MaxMemPerNode=UNLIMITED
If you require information relying on resource name, you can use the proper name
after the scontrol show <resource> command, for example to collect information on
node1, you can do the following:
$ scontrol show nodes node1
NodeName=node1 Arch=x86_64 CoresPerSocket=8
CPUAlloc=0 CPUTot=16 CPULoad=0.00
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=(null)
NodeAddr=node1 NodeHostName=node1 Version=21.08.5
OS=Linux 5.15.0-40-generic #43-Ubuntu SMP Wed Jun 15 12:54:21 UTC 2022
RealMemory=32000 AllocMem=0 FreeMem=9310 Sockets=2 Boards=1
State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
Partitions=long,testing
BootTime=2022-07-08T10:53:31 SlurmdStartTime=2022-07-21T12:22:43
LastBusyTime=2022-07-21T12:35:09
CfgTRES=cpu=16,mem=32000M,billing=16
AllocTRES=
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
scontrol can also be used to manage cluster entities, however as a final user
you aren’t allowed modifying the cluster environment. Please see the scontrol
manpages to understand what you can do with this instruction.
Migrating from Torque/PBS to SLURM
Torque/PBS and SLURM provide similar capabilities, so you can search for documentation like Migrating from Torque to Slurm, Migrating From PBS or PBS to Slurm Conversion Cheat Sheet to have a comparison between commands for the two scheduler ecosystem.
Submitting jobs
You are allowed to submit jobs in all partitions, however they are configured
for different purposes. For example, in the testing partition you aren’t allowed to
submit a job exceeding the default time-limit, since this partition is intended
for testing purpose. If your don’t have an idea on when jour job is expected
to finish, you will need to submit jour job in the default long partition with
no time limits. Moreover partitions are configured to apply some default values
to the submitted job, for example by limiting the RAM usage when not specified.
Considering this, you are enforced to declare clearly your needs by
allocating your resources: declaring more than you really require could result in
jobs waiting for resources to come, while declaring less than required will result
in a failed job.
Hint
Submitting jobs is the only way to get access to the computational power of working nodes, since users are not allowed to log in into them and the controller node is not intended to support long or intensive tasks.
Warning
Partitions are configured for allowing 4Gb of RAM memory for each CPU allocated, if your process requires more than this default limit, it will fail.
Allocate resources and submit a command immediately
You can allocate and submit a job with srun, for example:
srun <command>
will allocate the default resource for a job and will execute <command> once
the job starts. After executing command, the job will terminate and will release the
allocated resources. You can change the number of CPUs or the memory required
with the --cpus-per-task and --mem parameters, for example:
srun --cpus-per-task 2 --mem=4G <command>
or shorter:
srun -c 2 --mem=4G <command>
Partition can be specified with the -p or --partition command:
srun -c 2 --mem=4G -p testing <command>
Hint
srun will allocate resources and will execute commands in parallel. You
may use srun with MPI programs
Interactive jobs
Interactive jobs can be launched with the --pty bash option like this:
srun -c 2 --mem=4G -p testing --pty bash
you don’t need to specify a command when launching an interactive job: when an
interactive jobs start, it will open a new terminal on the working node in which
you can do all the stuff. When you have completed your task, you have to exit
the interactive session to free resources.
Warning
Resources are limited, so it’s important that you free resource when have you
finished your tasks by leaving the interactive job console with the exit
command.
A different approach is to allocate resources with salloc and then call srun
with the desidered command. However, this approach will result in a new terminal
session, in which resources are allocated until exiting terminal with exit command.
The salloc will open a new terminal in which your resources are allocated, then
you have to call srun --pty bash (without any other options, since they are
already allocated) to start your new terminal session in the interactive job:
$ salloc --cpus-per-task 2 --mem=4G
salloc: Granted job allocation 901
$ srun --pty /bin/bash
$ <command 1>
$ <command 2>
...
$ exit
$ exit
salloc: Relinquishing job allocation 901
salloc: Job allocation 901 has been revoked.
Warning
when you allocate a resource with salloc, you will grant resource as stated
by salloc output, even if you don’t call srun. You will need to exit
once for the interactive session called by srun --pty bash and exit one
more time to free your allocated resources. Resources will not be free until
the message Job allocation <job id> has been revoked. is displayed.
Creating a sbatch script
Creating a sbatch script if the recommended way to plan and execute complex
script on clusters. A sbatch script is a kind of bash script in which we can
specify resources using #SBATCH comment with the salloc or srun parameters
we saw before. After that, we can specify the command to execute. Here is a simple
template for a sbatch job:
#!/bin/bash
#SBATCH --job-name=serial_job_test # Job name
#SBATCH --ntasks=1 # Run on a single task
#SBATCH --cpus-per-task=1 # Declare 1 CPUs per task
#SBATCH --mem=1gb # Job memory request
#SBATCH --time=00:05:00 # Time limit hrs:min:sec
#SBATCH --output=serial_test_%j.log # Standard output and error log
<command 1>
<command 2>
Next, you can submit your sbatch script with sbatch command. You can override
the parameters specified in scripts by providing the appropriate parameter at launch
time:
sbatch --cpus-per-task 2 --mem=4G <sbatch script>
Cancelling a job
You can cancel a job using scancel and specifying a job id:
scancel <job id>
Or you can cancel all your submitted jobs with -u:
scancel -u <your username>
It is possible to filter out job by state or other attributes. Please check
scancel documentation.
SLURM as Nextflow executor
SLURM can be configured as the default executor for a Nextflow pipeline, using
the environment variable NXF_EXECUTOR:
export NXF_EXECUTOR=slurm
This is sufficient to let Nextflow submit jobs through SLURM controller, without
modifying your pipeline. In alternative simply add process.executor = "slurm"
in the nextflow.config file. See Nextflow
SLURM executor documentation
to get more information about available options.
Hint
NXF_EXECUTOR environment variable is already set in our slurm clients