Specifying Memory for Jobs on Quest

This document outlines how to modify job submission scripts to request the correct amount of memory for jobs on Quest.

Starting September 20, Quest will enforce limits for random-access memory (RAM) reserved by jobs. Jobs running on Quest's compute nodes will no longer be able to access memory that they did not reserve in their job submission script. This change will protect all jobs from encroachment by other jobs using more memory than they reserved. Often, this can result in a loss of jobs due to nodes crashing. After these limits are implemented, researchers will notice improvements in reliability of the compute nodes and job completion.

Why reserve memory for a job?
To ensure that all jobs run successfully after memory limits are enforced, researchers should identify how much RAM their jobs use and request the correct amount of memory in their job submission scripts. Once memory limits are enforced, jobs that do not reserve enough memory for their programs may run very slowly or fail to complete. For a job to run successfully on Quest, the job submission script must specify the amount of memory the job will be using on the compute nodes. For example, to reserve a total of 10 GB of memory per node in your job submission script, include the slurm directive:

#SBATCH --mem=10G

If memory is not specified, 3.25 GB of memory per core requested is assigned by default. This amount of memory may not be the correct amount for your job, and it is best practice to accurately request the amount of memory that your job needs.

How much memory should I reserve for my job?
To identify how much memory your job requires, you will need a job ID number from a similar job that you have already run successfully on Quest. If your job has not run on Quest or you do not have a job ID number, run your job and let it run to completion before continuing. To see your recent job ID numbers, use the sacct -Xcommand. More information on sacct is available here.

Once your job has finished successfully run the command:

seff <job_id_number>

This will return output similar to this:

Job ID: 767731
Cluster: quest
User/Group: abc1234/abc1234
State: COMPLETED (exit code 0)
Cores: 2
CPU Utilized: 00:10:00
CPU Efficiency: 100.00% of 00:10:00 core-walltime
Job Wall-clock time: 00:10:00
Memory Utilized: 3.00 GB
Memory Efficiency: 150.00% of 2.00 GB

Look at the last two lines. "Memory Utilized" identifies the highest amount of memory this job used at one time, in this example 3 GB of memory. "Memory Efficiency" is the percentage of memory reserved that was actually used; in this case, 2 GB of memory was reserved but 3 GB of memory was actually used, resulting in an efficiency of 150%. Efficiency over 100% indicates that your job used more memory than it reserved, and you must modify your job submission script to reserve more memory. This job should reserve more than 2 GB of total memory in its job submission script. For example:

#SBATCH --mem=4G

which requests four gigabytes of total memory per node for the job. Under Slurm, another way to reserve memory is to reserve it per core:

#SBATCH --mem-per-core=2G

As this job ran on 2 cores, this also reserves 4GB of total memory for this job. Note that only one of these #SBATCH memory options may be specified in a job submission script.

Reserving slightly more memory than your job utilizes is best practice, as the same job may require slightly different amounts of memory depending on the variations in data it processes in each run of the job. If you choose to reserve significantly more memory than your job requires, your job may wait longer in the queue than necessary as the larger amount of memory is identified and reserved. Reserving memory your job does not utilize also wastes shared resources that could be used by other researchers.

Note that if the job State is FAILED or CANCELLED, the Memory Efficiency percentage reported by seff will be inaccurate. Use the seff command only on jobs that COMPLETED successfully.

See Also:




Keywords:research computing, Quest, memory, RAM, slurm, seff, jobs, HPC   Doc ID:92939
Owner:Research Computing .Group:Northwestern
Created:2019-07-09 08:12 CDTUpdated:2019-07-22 13:56 CDT
Sites:Northwestern
Feedback:  0   0