SLURM ===== .. contents:: Table of Contents About SLURM ----------- SLURM is a cluster management and job scheduling system for linux clusters. At IBBA, we maintain a very small cluster instance which can be used to test pipelines and scripts by allocating resources and exploiting all the computational power we have. Here's SLURM description from the `official documentation `_: .. epigraph:: Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters. Slurm requires no kernel modifications for its operation and is relatively self-contained. As a cluster workload manager, Slurm has three key functions. First, it allocates exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work. SLURM (or any other scheduling system) can be used to avoid loading a server more than its capacity and to reserve and ensure resources to be used when running a script or an executable: the scheduler will start a job only when resources are free and will ensure that no other jobs compete for the same resources. Cluster environment @IBBA ------------------------- A minimal SLURM cluster installation is composed by a *master* node, which is responsible to manage job *queues* (or better *partitions* in a SLURM ecosystem), one or more *working* nodes, which are the resources were jobs are executed, and one or more *controller* nodes, which are the instances in which users can submit their jobs. Here at IBBA we have a *controller* node which is the same host people uses when login to their remote account, and two working nodes which are used for job execution. There is also a *master* instance where people can access, however this instance is very limited in size and must be used only to manage jobs. Moreover, the same management utilities are available also from the controller node, so you don't need to login into master node to manage your jobs. The other prerequisite to use slurm is users consistency between controller and working nodes: in our local installation users are managed through an LDAP server, and *homes* are mounted through NFS on all cluster nodes: this means that a script or a software installed and working on the controller host is supposed to work in the same way even in master and worker nodes. You don't need to move files from controller node to working nodes, the same environment should be applied in each cluster nodes. This means that a conda environment configured in the *login* node is supposed to work also in a *worker* node, since your home folder is mounted in the same position in all the environment. Moreover :doc:`singularity <../general/singularity>` is installed in each worker and login nodes. .. important:: From here and for the rest of this documentation, we suppose that the user is logged on the *controller* node: user can't login to the *working* nodes, with the only exception of :ref:`interactive jobs `. Using SLURM ----------- Get information on partitions ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The *partitions* in SLURM ecosystem are the same as *queues* in other cluster management, like *PBS*: they group working nodes into logical sets and define job queues. Each partition can define some constraints, for example cpu or time limits, user allowed and so on. Priority ordered job are allocated within a partition until the resources provided by the working nodes are exausted: after that a job will remain in a pending state, waiting for resource to be available again. To get information on partitions, you can use the ``sinfo`` command:: $ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST long* up infinite 2 mix node[1-2] testing up 4:00:00 2 mix node[1-2] In this example there are two partitions available: the ``long`` partition is the default partition (indicated through the ``*`` symbol near its name): this means that jobs will be executed on this partition when omitting *partition name* during job submission. ``testing`` partition is another partition which imposes some restrictions, for example no job can last more than 4 hours. The ``STATE`` columns is important since it gives information about cluster usage. A ``mix`` state like in this example means that the partition is not fully loaded. Other available stated are ``idle`` which means that the queue is empty, ``alloc`` when the queue if full and ``down`` when nodes are not available. Please check ``sinfo`` documentation to get a better explanation of such command. .. hint:: The SLURM scheduler installed in our infrastructure is more sophisticated than a generic *FIFO* scheduler: a job with a time limit definition can be executed with an higher priority than a job without it. Submitting jobs through the ``testing`` partition can be useful when debugging pipelines or scripts. Get information on jobs ~~~~~~~~~~~~~~~~~~~~~~~ You can have information about jobs (running/pending) with the ``squeue`` command:: $ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 559 long nf-bwa_( cozzip R 1:34:39 1 node1 560 long nf-bwa_( cozzip R 1:34:39 1 node1 558 long nf-bwa_( cozzip R 1:34:42 1 node1 557 long nf-bwa_( cozzip R 3:32:23 1 node2 The ``ST`` column represents the job *status*: ``R`` means *running* while ``PD`` stand for *pending* job (a job waiting to be executed). You can have also detailed information with ``sacct`` command by providing the *job ID* through the ``-j`` parameter like in the following example:: $ sacct -j 557 --format JobID,jobname,NTasks,nodelist,CPUTime,ReqMem,Elapsed JobID JobName NTasks NodeList CPUTime ReqMem Elapsed ------------ ---------- -------- --------------- ---------- ---------- ---------- 557 nf-bwa_(E+ node2 14:54:40 8G 03:43:40 557.batch batch 1 node2 14:54:40 03:43:40 Get information on resources ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ You can have detailed information on ``partitions``, ``nodes`` and ``jobs`` with the ``scontrol show`` command followed by the resource you need. For example, to collect information on partitions, you can do the following:: $ scontrol show partitions PartitionName=long AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=YES QoS=N/A DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=node[1-2] PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=40 TotalNodes=2 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerCPU=4096 MaxMemPerNode=UNLIMITED PartitionName=testing AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO QoS=N/A DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=04:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=node[1-2] PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=40 TotalNodes=2 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerCPU=4096 MaxMemPerNode=UNLIMITED If you require information relying on resource name, you can use the proper *name* after the ``scontrol show `` command, for example to collect information on ``node1``, you can do the following:: $ scontrol show nodes node1 NodeName=node1 Arch=x86_64 CoresPerSocket=8 CPUAlloc=0 CPUTot=16 CPULoad=0.00 AvailableFeatures=(null) ActiveFeatures=(null) Gres=(null) NodeAddr=node1 NodeHostName=node1 Version=21.08.5 OS=Linux 5.15.0-40-generic #43-Ubuntu SMP Wed Jun 15 12:54:21 UTC 2022 RealMemory=32000 AllocMem=0 FreeMem=9310 Sockets=2 Boards=1 State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A Partitions=long,testing BootTime=2022-07-08T10:53:31 SlurmdStartTime=2022-07-21T12:22:43 LastBusyTime=2022-07-21T12:35:09 CfgTRES=cpu=16,mem=32000M,billing=16 AllocTRES= CapWatts=n/a CurrentWatts=0 AveWatts=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s ``scontrol`` can also be used to manage cluster entities, however as a final user you aren't allowed modifying the cluster environment. Please see the ``scontrol`` *manpages* to understand what you can do with this instruction. Migrating from Torque/PBS to SLURM ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Torque/PBS and SLURM provide similar capabilities, so you can search for documentation like `Migrating from Torque to Slurm `_, `Migrating From PBS `_ or `PBS to Slurm Conversion Cheat Sheet `_ to have a comparison between commands for the two scheduler ecosystem. Submitting jobs --------------- You are allowed to submit jobs in all partitions, however they are configured for different purposes. For example, in the ``testing`` partition you aren't allowed to submit a job exceeding the default time-limit, since this partition is intended for testing purpose. If your don't have an idea on when jour job is expected to finish, you will need to submit jour job in the default ``long`` partition with no time limits. Moreover partitions are configured to apply some default values to the submitted job, for example by limiting the RAM usage when not specified. Considering this, you are enforced to declare clearly your needs by allocating your resources: declaring more than you really require could result in jobs waiting for resources to come, while declaring less than required will result in a failed job. .. hint:: Submitting jobs is the only way to get access to the computational power of *working* nodes, since users are not allowed to log in into them and the *controller* node is not intended to support long or intensive tasks. .. warning:: Partitions are configured for allowing *4Gb* of RAM memory for each CPU allocated, if your process requires more than this default limit, it will fail. Allocate resources and submit a command immediately ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ You can allocate and submit a job with ``srun``, for example:: srun will allocate the default resource for a job and will execute ```` once the job starts. After executing command, the job will terminate and will release the allocated resources. You can change the number of CPUs or the memory required with the ``--cpus-per-task`` and ``--mem`` parameters, for example:: srun --cpus-per-task 2 --mem=4G or shorter:: srun -c 2 --mem=4G Partition can be specified with the ``-p`` or ``--partition`` command:: srun -c 2 --mem=4G -p testing .. hint:: ``srun`` will allocate resources and will execute commands in *parallel*. You may use ``srun`` with *MPI* programs .. _interactive-jobs: Interactive jobs ~~~~~~~~~~~~~~~~ Interactive jobs can be launched with the ``--pty bash`` option like this:: srun -c 2 --mem=4G -p testing --pty bash you don't need to specify a command when launching an interactive job: when an interactive jobs start, it will open a new terminal on the *working* node in which you can do all the stuff. When you have completed your task, you have to ``exit`` the interactive session to free resources. .. warning:: Resources are limited, so it's important that you free resource when have you finished your tasks by leaving the interactive job console with the ``exit`` command. A different approach is to allocate resources with ``salloc`` and then call ``srun`` with the desidered command. However, this approach will result in a new terminal session, in which resources are allocated until exiting terminal with ``exit`` command. The ``salloc`` will open a new terminal in which your resources are allocated, then you have to call ``srun --pty bash`` (without any other options, since they are already allocated) to start your new terminal session in the interactive job:: $ salloc --cpus-per-task 2 --mem=4G salloc: Granted job allocation 901 $ srun --pty /bin/bash $ $ ... $ exit $ exit salloc: Relinquishing job allocation 901 salloc: Job allocation 901 has been revoked. .. warning:: when you allocate a resource with ``salloc``, you will grant resource as stated by ``salloc`` output, even if you don't call ``srun``. You will need to ``exit`` once for the interactive session called by ``srun --pty bash`` and ``exit`` one more time to free your allocated resources. Resources will not be free until the message ``Job allocation has been revoked.`` is displayed. Creating a sbatch script ~~~~~~~~~~~~~~~~~~~~~~~~ Creating a *sbatch* script if the recommended way to plan and execute complex script on clusters. A *sbatch* script is a kind of *bash* script in which we can specify resources using ``#SBATCH`` comment with the ``salloc`` or ``srun`` parameters we saw before. After that, we can specify the command to execute. Here is a simple template for a sbatch job: .. code-block:: bash #!/bin/bash #SBATCH --job-name=serial_job_test # Job name #SBATCH --ntasks=1 # Run on a single task #SBATCH --cpus-per-task=1 # Declare 1 CPUs per task #SBATCH --mem=1gb # Job memory request #SBATCH --time=00:05:00 # Time limit hrs:min:sec #SBATCH --output=serial_test_%j.log # Standard output and error log Next, you can submit your *sbatch* script with ``sbatch`` command. You can override the parameters specified in scripts by providing the appropriate parameter at launch time:: sbatch --cpus-per-task 2 --mem=4G Cancelling a job ---------------- You can cancel a job using ``scancel`` and specifying a *job id*:: scancel Or you can cancel **all** your submitted jobs with ``-u``:: scancel -u It is possible to filter out job by *state* or other attributes. Please check ``scancel`` documentation. SLURM as Nextflow executor -------------------------- SLURM can be configured as the default executor for a Nextflow pipeline, using the environment variable ``NXF_EXECUTOR``: .. code-block:: bash export NXF_EXECUTOR=slurm This is sufficient to let Nextflow submit jobs through SLURM controller, without modifying your pipeline. In alternative simply add ``process.executor = "slurm"`` in the ``nextflow.config`` file. See Nextflow `SLURM executor documentation `_ to get more information about available options. .. hint:: ``NXF_EXECUTOR`` environment variable is already set in our slurm *clients*