Quest Slurm Quick Start

On May 1st, 2019 the Quest scheduler will change from Moab to Slurm. Researchers using Quest will have to update their job submission scripts and use different commands to run under the new scheduler.

Click here for the schedule of Slurm workshops offered during the transition period. For more information, please see the project page.

This page contains information for using the Slurm test cluster on Quest.

To view Slurm training videos, visit Quest Slurm Scheduler Training Materials.

Quick Start

Please login to the Slurm test cluster to update your job submission scripts to Slurm in advance of May 1, 2019. All job submission scripts that currently run on Quest must be modified to run on the new Slurm scheduler. To access the Slurm cluster on Quest, log in to slurmtest.northwestern.edu:

ssh <netid>@slurmtest.northwestern.edu

From the command line, you will be able to submit jobs to the Slurm scheduler. When submitting jobs to the Slurm scheduler, use the allocations and queue names you already use. During the testing period, Slurm jobs will not be debited against your production allocations.

The 200 nodes in the test Slurm cluster each have 20 cores and 128GB of RAM; submission scripts may need to be modified to reflect these resource limits. The maximum walltime accepted on the Slurm test cluster is 48 hours.  For memory allocation purposes, each core in the pilot cluster is assigned 6GB of RAM.

Before you submit your Slurm job, modify your existing job submission script to change the Moab directives into Slurm directives. Most submission scripts will be straightforward to modify, but implementing advanced features in Slurm can be expected to take additional time and effort.

Simple Batch Job Submission Script Conversion Example

The sbatch command is used for scheduler directives in job submission scripts as well as the job submission command at the command line. To modify your job scripts to work with Slurm, you'll need to edit all lines that currently begin with #MSUB. In addition to replacing each #MSUB with #SBATCH, at a minimum you'll need to edit lines specifying the queue, nodes, and walltime. If you have additional scheduler directives, please see the full list of Slurm job directive commands and their analogues in Moab.

Example Moab job submission script
#!/bin/bash
#MSUB -A b1042               ## account
#MSUB -q genomics            ## queue name 
#MSUB -l nodes=1:ppn=1       ## number of nodes and cores
#MSUB -l walltime=00:10:00   ## walltime
#MSUB -N sample_job          ## name of job

module load python           ## Load modules

python helloworld.py         ## Run program
                        
Example Slurm job submission script
#!/bin/bash
#SBATCH -A b1042             ## account (unchanged)
#SBATCH -p genomics          ## "-p" instead of "-q"
#SBATCH -N 1                 ## number of nodes
#SBATCH -n 1                 ## number of cores
#SBATCH -t 00:10:00          ## walltime
#SBATCH	--job-name="test"    ## name of job

module purge all	     ## purge environment modules
module load python           ## Load modules (unchanged)

python helloworld.py         ## Run program (unchanged)
                        

Environment

Note that when you submit your job Slurm passes your current environment variables to the compute nodes, including any modules you've loaded on the command line before the job was submitted. By comparison, Moab does not pass environmental variables but does source a user's ~/.bashrc file on the compute node before running the job submission script. Because Slurm and Moab use different methods to replicate a user's environment on the compute nodes, scripts that rely on environment variables may behave in unexpected ways or fail.

Queues == Partition

Note that in Slurm, a "queue" is called a "partition". Moving forward, what we have historically called "partitions" on Quest - Quest5, Quest6 and Quest8 - will now be referred to as "architectures".

Submitting a Batch Job

sbatch or qsub replace msub for job submissions on the command line. To use sbatch to submit a job to the Slurm scheduler:

sbatch job_script.sh
Submitted batch job 546723

or in cases where you only want the job number to be returned, you can use qsub:

qsub job_scipt.sh
546723

Both sbatch and qsub can be used to submit a batch script to Slurm. Note that not all functionality of qsub is provided under Slurm, making sbatch the preferred command for submitting jobs.  

Slurm will reject the job at submission time if there are requests or constraints within the job submission script that Slurm cannot meet. This gives the user the opportunity to examine the rejected job request and resubmit it with the necessary corrections. With Slurm, if a job number is returned at the time of job submission, the job will run although it may experience a wait time in the queue depending on how busy the system is.

Submitting an Interactive Job

srun replaces msub. To launch a interactive job from the command line use the srun command:

srun --pty --account=<account> --time=<hh:mm:ss> --partition=<queue_name> --mem=<xG> bash

This will launch a terminal session on the compute node as a single core job. To request additional cores for multi-threaded applications, include the -N and -n flags:

srun --pty -N 1 -n 6 --account=<account> --time=<hh:mm:ss> --partition=<queue_name> --mem=<xG> bash

For additional information on interactive jobs under Slurm, please see Submitting a Job on Quest.

Monitoring Jobs

squeue replaces both showq and qstat. To use squeue to see all jobs:

squeue

To see just your jobs:

squeue -u <NetID>

squeue returns information on jobs in the Slurm queue:

JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
546723     short slurm2.s  jon9348  R   INVALID      1 qnode4017
546711     short high-thr  akh9585  R       2:34      3 qnode[4180-4181,4196]
546712     short high-thr  akh9585  R       2:34      3 qnode[4078,4086,4196]
Field Description
JOBID Number assigned to the job upon submission
PARTITION The queue, also called partition, that the job is running in
NAME Name of the job submission script
USER NetID of the user who submitted the job
ST State of the job: "R" for Running or "PD" for Pending (Idle)
TIME hours:minutes:seconds a job has been running; can be INVALID for the first few minutes.
NODES Number of nodes the job resides on
NODELIST Names of the nodes the job is running on

Cancelling Jobs

To cancel a single job use scancel:

scancel <job_ID_number>

To cancel all of your jobs:

scancel -u <netID>

For additional job commands, please see Common Job Commands.

See Also:




Keywords:quest,slurm,moab,torque,hpc,sbatch,srun,msub,batch   Doc ID:89456
Owner:Research Computing .Group:Northwestern
Created:2019-02-01 17:32 CDTUpdated:2019-04-15 13:13 CDT
Sites:Northwestern
CleanURL:https://kb.northwestern.edu/quest-slurm-quick-start
Feedback:  0   0