SLURM

About SLURM

SLURM is a cluster management and job scheduling system for linux clusters. At IBBA, we maintain a very small cluster instance which can be used to test pipelines and scripts by allocating resources and exploiting all the computational power we have. Here’s SLURM description from the official documentation:

Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters. Slurm requires no kernel modifications for its operation and is relatively self-contained. As a cluster workload manager, Slurm has three key functions. First, it allocates exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work.

SLURM (or any other scheduling system) can be used to avoid loading a server more than its capacity and to reserve and ensure resources to be used when running a script or an executable: the scheduler will start a job only when resources are free and will ensure that no other jobs compete for the same resources.

Cluster environment @IBBA

A minimal SLURM cluster installation is composed by a master node, which is responsible to manage job queues (or better partitions in a SLURM ecosystem), one or more working nodes, which are the resources were jobs are executed, and one or more controller nodes, which are the instances in which users can submit their jobs. Here at IBBA we have a controller node which is the same host people uses when login to their remote account, and two working nodes which are used for job execution. There is also a master instance where people can access, however this instance is very limited in size and must be used only to manage jobs. Moreover, the same management utilities are available also from the controller node, so you don’t need to login into master node to manage your jobs.

The other prerequisite to use slurm is users consistency between controller and working nodes: in our local installation users are managed through an LDAP server, and homes are mounted through NFS on all cluster nodes: this means that a script or a software installed and working on the controller host is supposed to work in the same way even in master and worker nodes. You don’t need to move files from controller node to working nodes, the same environment should be applied in each cluster nodes. This means that a conda environment configured in the login node is supposed to work also in a worker node, since your home folder is mounted in the same position in all the environment. Moreover singularity is installed in each worker and login nodes.

Important

From here and for the rest of this documentation, we suppose that the user is logged on the controller node: user can’t login to the working nodes, with the only exception of interactive jobs.

Using SLURM

Get information on partitions

The partitions in SLURM ecosystem are the same as queues in other cluster management, like PBS: they group working nodes into logical sets and define job queues. Each partition can define some constraints, for example cpu or time limits, user allowed and so on. Priority ordered job are allocated within a partition until the resources provided by the working nodes are exausted: after that a job will remain in a pending state, waiting for resource to be available again.

To get information on partitions, you can use the sinfo command:

$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
long*        up   infinite      2    mix node[1-2]
testing      up    4:00:00      2    mix node[1-2]

In this example there are two partitions available: the long partition is the default partition (indicated through the * symbol near its name): this means that jobs will be executed on this partition when omitting partition name during job submission. testing partition is another partition which imposes some restrictions, for example no job can last more than 4 hours. The STATE columns is important since it gives information about cluster usage. A mix state like in this example means that the partition is not fully loaded. Other available stated are idle which means that the queue is empty, alloc when the queue if full and down when nodes are not available. Please check sinfo documentation to get a better explanation of such command.

Hint

The SLURM scheduler installed in our infrastructure is more sophisticated than a generic FIFO scheduler: a job with a time limit definition can be executed with an higher priority than a job without it. Submitting jobs through the testing partition can be useful when debugging pipelines or scripts.

Get information on jobs

You can have information about jobs (running/pending) with the squeue command:

$ squeue
JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
  559      long nf-bwa_(   cozzip  R    1:34:39      1 node1
  560      long nf-bwa_(   cozzip  R    1:34:39      1 node1
  558      long nf-bwa_(   cozzip  R    1:34:42      1 node1
  557      long nf-bwa_(   cozzip  R    3:32:23      1 node2

The ST column represents the job status: R means running while PD stand for pending job (a job waiting to be executed). You can have also detailed information with sacct command by providing the job ID through the -j parameter like in the following example:

$ sacct -j 557 --format JobID,jobname,NTasks,nodelist,CPUTime,ReqMem,Elapsed
JobID           JobName   NTasks        NodeList    CPUTime     ReqMem    Elapsed
------------ ---------- -------- --------------- ---------- ---------- ----------
557          nf-bwa_(E+                    node2   14:54:40         8G   03:43:40
557.batch         batch        1           node2   14:54:40              03:43:40

Get information on resources

You can have detailed information on partitions, nodes and jobs with the scontrol show command followed by the resource you need. For example, to collect information on partitions, you can do the following:

$ scontrol show partitions
PartitionName=long
  AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
  AllocNodes=ALL Default=YES QoS=N/A
  DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
  MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
  Nodes=node[1-2]
  PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
  OverTimeLimit=NONE PreemptMode=OFF
  State=UP TotalCPUs=40 TotalNodes=2 SelectTypeParameters=NONE
  JobDefaults=(null)
  DefMemPerCPU=4096 MaxMemPerNode=UNLIMITED

PartitionName=testing
  AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
  AllocNodes=ALL Default=NO QoS=N/A
  DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
  MaxNodes=UNLIMITED MaxTime=04:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
  Nodes=node[1-2]
  PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
  OverTimeLimit=NONE PreemptMode=OFF
  State=UP TotalCPUs=40 TotalNodes=2 SelectTypeParameters=NONE
  JobDefaults=(null)
  DefMemPerCPU=4096 MaxMemPerNode=UNLIMITED

If you require information relying on resource name, you can use the proper name after the scontrol show <resource> command, for example to collect information on node1, you can do the following:

$ scontrol show nodes node1
NodeName=node1 Arch=x86_64 CoresPerSocket=8
  CPUAlloc=0 CPUTot=16 CPULoad=0.00
  AvailableFeatures=(null)
  ActiveFeatures=(null)
  Gres=(null)
  NodeAddr=node1 NodeHostName=node1 Version=21.08.5
  OS=Linux 5.15.0-40-generic #43-Ubuntu SMP Wed Jun 15 12:54:21 UTC 2022
  RealMemory=32000 AllocMem=0 FreeMem=9310 Sockets=2 Boards=1
  State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
  Partitions=long,testing
  BootTime=2022-07-08T10:53:31 SlurmdStartTime=2022-07-21T12:22:43
  LastBusyTime=2022-07-21T12:35:09
  CfgTRES=cpu=16,mem=32000M,billing=16
  AllocTRES=
  CapWatts=n/a
  CurrentWatts=0 AveWatts=0
  ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

scontrol can also be used to manage cluster entities, however as a final user you aren’t allowed modifying the cluster environment. Please see the scontrol manpages to understand what you can do with this instruction.

Migrating from Torque/PBS to SLURM

Torque/PBS and SLURM provide similar capabilities, so you can search for documentation like Migrating from Torque to Slurm, Migrating From PBS or PBS to Slurm Conversion Cheat Sheet to have a comparison between commands for the two scheduler ecosystem.

Submitting jobs

You are allowed to submit jobs in all partitions, however they are configured for different purposes. For example, in the testing partition you aren’t allowed to submit a job exceeding the default time-limit, since this partition is intended for testing purpose. If your don’t have an idea on when jour job is expected to finish, you will need to submit jour job in the default long partition with no time limits. Moreover partitions are configured to apply some default values to the submitted job, for example by limiting the RAM usage when not specified. Considering this, you are enforced to declare clearly your needs by allocating your resources: declaring more than you really require could result in jobs waiting for resources to come, while declaring less than required will result in a failed job.

Hint

Submitting jobs is the only way to get access to the computational power of working nodes, since users are not allowed to log in into them and the controller node is not intended to support long or intensive tasks.

Warning

Partitions are configured for allowing 4Gb of RAM memory for each CPU allocated, if your process requires more than this default limit, it will fail.

Allocate resources and submit a command immediately

You can allocate and submit a job with srun, for example:

srun <command>

will allocate the default resource for a job and will execute <command> once the job starts. After executing command, the job will terminate and will release the allocated resources. You can change the number of CPUs or the memory required with the --cpus-per-task and --mem parameters, for example:

srun --cpus-per-task 2 --mem=4G <command>

or shorter:

srun -c 2 --mem=4G <command>

Partition can be specified with the -p or --partition command:

srun -c 2 --mem=4G -p testing <command>

Hint

srun will allocate resources and will execute commands in parallel. You may use srun with MPI programs

Interactive jobs

Interactive jobs can be launched with the --pty bash option like this:

srun -c 2 --mem=4G -p testing --pty bash

you don’t need to specify a command when launching an interactive job: when an interactive jobs start, it will open a new terminal on the working node in which you can do all the stuff. When you have completed your task, you have to exit the interactive session to free resources.

Warning

Resources are limited, so it’s important that you free resource when have you finished your tasks by leaving the interactive job console with the exit command.

A different approach is to allocate resources with salloc and then call srun with the desidered command. However, this approach will result in a new terminal session, in which resources are allocated until exiting terminal with exit command. The salloc will open a new terminal in which your resources are allocated, then you have to call srun --pty bash (without any other options, since they are already allocated) to start your new terminal session in the interactive job:

$ salloc --cpus-per-task 2 --mem=4G
salloc: Granted job allocation 901
$ srun --pty /bin/bash
$ <command 1>
$ <command 2>
...
$ exit
$ exit
salloc: Relinquishing job allocation 901
salloc: Job allocation 901 has been revoked.

Warning

when you allocate a resource with salloc, you will grant resource as stated by salloc output, even if you don’t call srun. You will need to exit once for the interactive session called by srun --pty bash and exit one more time to free your allocated resources. Resources will not be free until the message Job allocation <job id> has been revoked. is displayed.

Creating a sbatch script

Creating a sbatch script if the recommended way to plan and execute complex script on clusters. A sbatch script is a kind of bash script in which we can specify resources using #SBATCH comment with the salloc or srun parameters we saw before. After that, we can specify the command to execute. Here is a simple template for a sbatch job:

#!/bin/bash
#SBATCH --job-name=serial_job_test    # Job name
#SBATCH --ntasks=1                    # Run on a single task
#SBATCH --cpus-per-task=1             # Declare 1 CPUs per task
#SBATCH --mem=1gb                     # Job memory request
#SBATCH --time=00:05:00               # Time limit hrs:min:sec
#SBATCH --output=serial_test_%j.log   # Standard output and error log

<command 1>
<command 2>

Next, you can submit your sbatch script with sbatch command. You can override the parameters specified in scripts by providing the appropriate parameter at launch time:

sbatch --cpus-per-task 2 --mem=4G <sbatch script>

Cancelling a job

You can cancel a job using scancel and specifying a job id:

scancel <job id>

Or you can cancel all your submitted jobs with -u:

scancel -u <your username>

It is possible to filter out job by state or other attributes. Please check scancel documentation.

SLURM as Nextflow executor

SLURM can be configured as the default executor for a Nextflow pipeline, using the environment variable NXF_EXECUTOR:

export NXF_EXECUTOR=slurm

This is sufficient to let Nextflow submit jobs through SLURM controller, without modifying your pipeline. In alternative simply add process.executor = "slurm" in the nextflow.config file. See Nextflow SLURM executor documentation to get more information about available options.

Hint

NXF_EXECUTOR environment variable is already set in our slurm clients