Running a job on the cluster

Why use the scheduler (Slurm)?

Using Roaring Thunder as an example, the cluster has one login node (roaringthunder) and many identical worker nodes behind a private network switch. In our case, we have 65 worker nodes (56 compute, 5 big-mem, 4 GPU). To run on the worker nodes, we submit a batch script to the scheduler.

Don't run long and resource intensive jobs on the login node!! Use the login node to test your application and code, then deploy to the nodes.

On thunder, the scheduler is SLURM, which is the most commonly used scheduler on HPC batch systems. SLURM is open source and there are many good sources of information for it on the internet, for example at https://slurm.schedmd.com/documentation.html. On the schedmd site, we have found the two page summary very useful, https://slurm.schedmd.com/pdfs/summary.pdf. An even briefer "cheat sheet" is https://www.chpc.utah.edu/presentations/SlurmCheatsheet.pdf. Before running a job, it can be useful to do a few basic SLURM commands. For example, squeue shows the state of the job queue, both running and pending jobs.

[chad.julius@login ~]$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
18233 compute openmpi jeffrey. R 1-17:18:26 5 node[020-024]
18238 bigmem PLATANUS software R 1-15:13:00 1 big-mem002
18248 bigmem PLATANUS software R 1-00:36:07 1 big-mem003
18256 compute starccm. dillon.p R 1:44:58 6 node[001-006]
18257 compute starccm. weston.c R 49:39 6 node[007-012]
18258 compute starccm2 weston.c R 43:47 6 node[013-018]

We see there are a few jobs running, some on just a single node, some on multiple nodes. In this case, all jobs are running, see the R status under the ST header line. If all nodes were occupied, we might see also some jobs in pending state, PD.

The PARTITION header refers to the SLURM partition; a partition is a group of nodes. When you submit your job, it will choose from the nodes in the partition you select (or the default partition, if you don't select).

We can see partition and node information with the sinfo command.

[root@login lib]# sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
all up 14-00:00:0 18 mix big-mem005,node[001,006-007,009-
013,021,029,031,036-037,043,054-056]
all up 14-00:00:0 43 alloc big-mem[001-004],node[002-005,008,014-
020,022-028,030,032-035,038-042,044-053]
all up 14-00:00:0 4 idle gpu[001-004]
compute* up 14-00:00:0 17 mix node[001,006-007,009-013,021,029,031,036-
037,043,054-056]
compute* up 14-00:00:0 39 alloc node[002-005,008,014-020,022-028,030,032-
035,038-042,044-053]
bigmem up 14-00:00:0 1 mix big-mem005
bigmem up 14-00:00:0 4 alloc big-mem[001-004]
gpu up 14-00:00:0 4 idle gpu[001-004]

The output of sinfo shows the partitions and what the nodes are doing in the partition; alloc means jobs running, idle means available to accept jobs. We have four partitions configured. The default partition is compute which is why there is an asterisk next to it. This partition contains 56 compute nodes that each have 40 cores and 196GB of RAM. The bigmem partition contains five nodes that each have 3TB of
RAM and 80 cores. The gpu partition contains 4 GPU nodes each with 40 Cores 724GB of RAM. GPU001-003 contain 2 NVIDIA P100s and GPU004 contains 2 NVIDIA V100s. The all partition includes all the nodes.

More information about squeue, sinfo and other SLURM commands can be obtained with the Linux man command, for example man sinfo.

Other important SLURM commands are sbatch, to submit a SLURM job, scancel, and sacct, to see information about a job and srun. The srun command can be used to initiate an interactive SLURM job.

[chad.julius@login ~]$ srun --partition=compute --pty bash
[chad.julius@node019 ~]$

The --pty option means execute in psuedo-terminal, and bash is the name of the command to execute; bash is the Linux command line interpreter. In this case, your interactive job is just like a regular SLURM job, but connected to a terminal you can type commands into. Since we have not specified otherwise, in the example above it will be given default number of processes (one), default maximum walltime, etc. You can specify non-default values with more options on the srun invocation.

Example SLURM job
For examples in general on the system, you can browse through the /examples folder. Not all of these are guaranteed to work; some have been tested while others have been put there just to give a general idea. For our example here, let's use a MATLAB example,
in /examples/matlab. The MATLAB script we are going to run is in the file matrixsoln.m.

% file matrixsoln.m
% MATLAB matrix linear solution (Ax=b) problem, to illustrate multithreading.
%
A=rand(20000);
b=rand(20000,1);
tic
x=A\b;
toc
norm(A*x-b)

This MATLAB script creates a fairly big random vector and 2D array, and then solves the Ax=b matrix problem, using MATLAB's built in solver, which is implicitly multithreaded. So we will request an entire compute node in our SLURM script, matlab.slurm:

#!/bin/bash
#SBATCH --job-name=matlab # Job name
#SBATCH --nodes=1 # Number of nodes
#SBATCH --ntasks-per-node=40 # CPUs per node (MAX=40 for CPU nodes and 80 for GPU)
#SBATCH --output=out-%j-%N.log # Standard output (log file)
#SBATCH --partition=compute # Partition/Queue
#SBATCH --time=72:00:00 # Maximum walltime
module purge
module use /cm/shared/modulefiles_local
module load shared
module load slurm/18.08.4
module load matlab/R2018b
module list
date
time matlab -nodisplay < matrixsoln.m
date

The SLURM script is a bash script. The lines with #SBATCH are comments to bash but when submitted to SLURM they are interpreted as resource requests. Resources are things such as processors, memory, time, etc. There are many possible resources one can request, the above script just give a simple example. The two lines below, specify the nodes and tasks (processors) we want:

#SBATCH --nodes=1 # Number of nodes
#SBATCH --ntasks-per-node=40 # CPUs per node (MAX=40 for CPU nodes and 80 for GPU)

Since we know in this case (because of prior testing) that MATLAB will try to grab all processors, we ask for all on an node.

The partition selection is important, because we know we want a compute node, and if we submit to the all queue instead, we could wind up with a big-mem or gpu node and we don't want that.

The time specification is the maximum walltime (real, elapsed time) the running job can have. When that time is reached, if your job is still running, it will be terminated. After the #SBATCH resource request lines, we have module commands. We load the modules we need for this job. In general, we always want to use the first four lines of the module requests, then we add the specific extra modules we want for this job, matlab, in this case.

Running the SLURM job
To run the example, we need to create a local folder, first copy the MATLAB script and SLURM script to your own local folder.
[chad.julius@login ~]$ mkdir matlabtest
[chad.julius@login ~]$ cd matlabtest/
[chad.julius@login matlabtest]$ cp /examples/matlab/matlab.slurm .
[chad.julius@login matlabtest]$ cp /examples/matlab/matrixsoln.m .
[chad.julius@login matlabtest]$ ls
matlab.slurm matrixsoln.m
[chad.julius@login matlabtest]$

Now submit the job.

[chad.julius@login matlabtest]$ sbatch matlab.slurm
Submitted batch job 18262

After submitting the job, you can use squeue to see if it is in the system.

[chad.julius@login matlabtest]$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
18233 compute openmpi jeffrey. R 1-18:47:01 5 node[020-024]
18238 bigmem PLATANUS software R 1-16:41:35 1 big-mem002
18248 bigmem PLATANUS software R 1-02:04:42 1 big-mem003
18256 compute starccm. dillon.p R 3:13:33 6 node[001-006]
18262 compute matlab chad.jul R 0:06 1 node007

We can see it running, job number 18262, on node007. When it is done the complete
output file will be in the folder. Display it in the terminal with the more command.

[chad.julius@login matlabtest]$ ll
total 0
-rw-r--r-- 1 chad.julius domain users 540 Feb 14 14:19 matlab.slurm
-rw-r--r-- 1 chad.julius domain users 159 Feb 14 14:19 matrixsoln.m
-rw-r--r-- 1 chad.julius domain users 564 Feb 14 14:21 out-18262-node007.log

[chad.julius@login matlabtest]$
[chad.julius@login matlabtest]$ more out-18262-node007.log

Currently Loaded Modules:
1) shared 2) slurm/17.11.8 3) matlab/R2018b

Thu Feb 14 14:21:04 CST 2019

< M A T L A B (R) >
Copyright 1984-2018 The MathWorks, Inc.
R2018b (9.5.0.944444) 64-bit (glnxa64)
August 28, 2018

To get started, type doc.
For product information, visit www.mathworks.com.
>> >> >> >> >> >> >> >> Elapsed time is 13.110310 seconds.
>>
ans =
1.2815e-08
>>
real 0m54.310s
user 3m47.598s
sys 1m24.934s
Thu Feb 14 14:21:58 CST 2019
[chad.julius@login matlabtest]$

Disk quotas and policies
Each user has a folder created in /home upon first login, with 100 GB quota per user. Users that need more disk space to run can request a folder be created in the scratch area, /gpfs/scratch, part of the GPFS shared file system. No user quotas are implemented in scratch, but a data expiration policy will eventually be applied and older data will be deleted. Currently, no backups are running yet. Eventually we will have tape backups on home, but not scratch.