Computing and GPU cluster – CS Teaching Labs

The Teaching Labs have a small computing cluster. It is meant for teaching distributed computing, scientific computing, GPU programming, and the like; it is not powerful enough nor intended for production computation. Access is allowed only to students registered in specific courses.

Cluster systems can be accessed only through the Slurm workload manager; direct login (e.g. with ssh) is not allowed. See below for a list of basic Slurm commands. Your instructor should offer details, in particular which partitions are available to your course.

Nodes and partitions

The cluster contains these nodes:

squid01 – squid07

Two twelve-core Intel Xeon Silver 4310 CPUs running at 2.10 GHz, with 18MiB of cache
256GiB of memory
Two NVIDIA GeForce RTX 2070 SUPER GPUs, with 8GiB of memory; the NVIDIA CUDA 12.2 software suite (same as on Teaching Labs workstations with GPUs).
10Gbps Ethernet networking to every other squidnn system; 10Gbps uplink to other teach.cs systems.

coral01 – coral08

An eight-core Intel Xeon Silver 4208 CPU running at 2.10 GHz, with 11MiB of cache
96GiB of memory
Two NVIDIA GeForce RTX A4000 GPUs, with 16GiB of memory; the NVIDIA CUDA 12.2 software suite (same as on Teaching Labs workstations with GPUs).
10Gbps Ethernet networking to every other coralnn system; 10Gbps uplink to other teach.cs systems.

prawn01 – prawn14

An eight-core Intel Xeon Silver 4208 CPU running at 2.10 GHz, with 11MiB of cache
96GiB of memory
Two NVIDIA GeForce RTX 2080 Ti GPUs, with 11GiB of memory; the NVIDIA CUDA 12.2 software suite (same as on Teaching Labs workstations with GPUs).
10Gbps Ethernet networking to every other prawnnn system; 10Gbps uplink to other teach.cs systems.

Students in a given course will have access only to certain partitions, made from specific subsets of these nodes. Your instructor will tell you which. The sinfo command will show the partitions you’re allowed to use.

Basic Slurm Commands

Here is a summary of frequently-used Slurm commands. They may be run from any Teaching Labs workstation or remote-access system. See the official documentation for details and tutorials.

Status Commands

sinfo [-ppartition]
List all partitions you’re allowed to use, or a specific partition if you’re allowed there, with information about which hosts are up and which are in use.

squeue [-ppartition]
List all current sessions (whether batch jobs started with sbatch or interactive sessions via srun or salloc) on any partition you’re allowed to use, or on the named partition. The listing includes the user responsible, the real time the session has been running, and which nodes it is using. If a session is waiting for resources before running, the node list will say (Resources).

Running Batch jobs

This is the best way to run long computations.

sbatch -p partition [-N nodecount] [-c ncpus] [--gres gpu] [-o outfile] [-e errfile] [script]
Queue a batch job, running the named script (default the contents of standard input). The options are:

-N nodecount

Allocate nodecount nodes (computers) to this job. The default is 1; that usually the maximum.

-c ncpus

Allocate ncpus CPU cores to this job; default 1.

–gres gpu

Allocate a GPU for this job; default none. No other job will be allowed to use that GPU while yours is running.

-o outfile

Store standard output in outfile;

-e errfile

Store standard error in errfile;

If enough resources (e.g. enough nodes) are available, the job will be started right away; otherwise it will wait.

Beware that your job will be restricted to a single host CPU core unless you use the -c option, and a GPU will be available only if you use –gres gpu. If a system has two GPUs, there may be two jobs running at the same time, each using one; arrangements are made to point the CUDA code at the one allocated for your job.

Running commands interactively

To avoid tying up nodes when others have work to do, interactive commands should be used only for short tests and debugging, not for long computations. An interactive session may be automatically interrupted after a certain amount (ten minutes, say) of real time.

If enough resources (e.g. enough nodes) are available, the interactive session will be started right away; otherwise it will print a message like
srun: job 208 queued and waiting for resources
and wait. Hit the interrupt key to give up on waiting.

srun -p partition [-N nodecount] [-c ncpus] [--gres gpu] [-o outfile] [-e errfile] [--pty] [-l] [-I] command

Allocate nodes, cores, and GPUs as for sbatch from partition, and run command on each. Give each node a copy of srun‘s standard input; send standard output and error to those for srun unless redirected with -o and -e options. Outfile and errfile must be in directories accessible from any Teaching Labs system, not just the host where srun is called; in particular it won’t work to use system directories like /tmp.

If -l is given, label each line of standard output or error with a decimal task number (different on each node) followed by a colon.

Normally, if there aren’t enough nodes or GPUs or cores, srun will wait until enough are available. If -I is given, srun will give up immediately instead.

For example:

$ srun -p coral -N 3 -l hostname
0: coral01
1: coral02
2: coral03
$

salloc -p partition [-N nodecount] [-c ncpus] [--gres gpu] [-I] [command]
Allocate nodes, cores, and GPUs as for sbatch from partition, then run command (default your login shell) with environment variables describing the allocation. Calling srun under control of salloc will run tasks within the allocation:

srun -N nodecount command runs command on each of nodecount hosts chosen from within the allocation;
srun command runs it on every allocated host.

Option -I is like that in srun: don’t wait if too few hosts are available.

For example:

$ salloc -N 5 -p prawn
salloc: Granted job allocation 192
$ srun hostname
prawn05
prawn02
prawn04
prawn03
prawn01
$ srun -N 3 hostname
prawn02
prawn01
prawn03
$ exit
salloc: Relinquishing job allocation 192
$