Managing Jobs on Quest

How to manage batch jobs after they’ve been submitted on Quest.

Please note: Job resource requirements, such as the number of cores or node or the amount of memory, in the submission script are recorded by the scheduler. Any changes made to the job script (jobscript.sh in the example on Submitting a Job on Quest) after the job has been submitted with msub will have no effect on the job. After submitting a job, you can only Hold, Release, Kill, and Modify job parameters using the the Moab commands in the list below.

If you're job isn't running or ended before you expected, see Troubleshooting Jobs for some tips.

Job ID

When you submit jobs on Quest using msub, the scheduler returns the job ID and queues it for execution.

For example, if a user submitted jobscript.sh using msub:

[abc123@quser10 ~]$ msub jobscript.sh

If the job was submitted successfully, a job ID will be returned:

13870586

You can use this job ID later to monitor the job.

In most cases, the job ID will be only numbers. In some cases, the job ID may start with "Moab.", for example:

Moab.13870586

"Moab." is part of the job ID. There are two common cases when this might occur:

  • Jobs submitted as part of a job array will be assigned a job ID prefixed by "Moab." This is normal.
  • Jobs that are rejected by the Torque resource manager may be assigned a job ID prefixed by "Moab." This can happen, for example, if you request more cores per node than are available on Quest. If your job ID begins with "Moab." and you aren't using a job array, then you should use the checkjob command to investigate problems with your job.  Also see Troubleshooting Jobs on Quest for additional details.

Job Status

After submitting a job, you can execute the showq command or the checkjob command to check the status of your job. On Quest, submitted jobs are analyzed and queued by the scheduler. When a job is sent to the scheduler, it is first checked by a resource manager. The resource manager ensures that you have enough resources, particularly compute hours, on the system in order to run your job.

If enough resources exist in your allocation, the job is forwarded to the scheduler to be put in queue. It is important to note that if there is a typo in your job submission script, it may be flagged by the resource manager and you job will be rejected and placed on BatchHold.

When the scheduler receives a job, it will prioritize your job relative to other jobs currently in the queue. The accounting system assumes that your job will run with the amount of time and number of cores that you specified in your job submission script. If your job requires less time than you specified, the accounting system only charges you for the time used on the system.

If you lack enough compute hours left in your allocation to run your job, it will be placed on BatchHold or SystemHold. Jobs in a BatchHold or SystemHold state, will remain in this state until you cancel the job or a system administrator intervenes to either add enough compute hours for your job to run, or to redirect your job to another account for you to access so your job can run. If your job is under a BatchHold or SystemHold and you need assistance from a system administrator, please contact quest-help@northwestern.edu for help.

Generally, the more resources that a job requires, the longer a job may sit in the queue until the necessary resources become free and can be scheduled. Full access nodes are dedicated resources thus the access criteria, queues, job duration and job size limits for these nodes are different. See Full Access Job Commands for specialized information.

Commonly Used Commands

The showq Command

The showq command (without any options) displays the job queues for all users on Quest. To quickly access information about your specific job(s), there are options to filter the results (you can combine multiple options):

Command Description
showq -u <netID> Show only jobs belonging to user specified
showq -r Show running jobs
showq -i Show idle jobs
showq -b Show blocked jobs
showq -w acct=<allocationID>

Show only jobs belonging to account specified

showq --help See documentation and additional options

The output of the showq command groups jobs into three categories: active, eligible, and blocked. Active jobs are running. Eligible jobs are being considered by the scheduler when additional computing resources become available; they are currently idle. There is a limit of 30 idle jobs per user. If you have submitted more than 30 jobs, and they weren't scheduled immediately, then some of the jobs will appear in the Blocked list; these jobs will be moved to the Eligible list as space becomes available. Jobs may also appear on the Blocked list if they were submitted to an expired allocation, there are insufficent compute hours on an allocation to complete a job, or the job has other errors or resource limit issues. Use the checkjob command to get more information.

The checkjob Command

The checkjob command displays detailed information about a submitted job’s status and diagnostic information that can be useful for troubleshooting submission issues. It can also be used to obtain useful information about completed jobs such as the allocated nodes, resources used, and exit codes. The -v flag is useful for gathering additional diagnostic information about your job.

Example usage:

checkjob -v <jobID>

where you can get your <jobID> using the showq commands above.

Example for a Successfully Running Job

Note in the output below that:

  • The State is listed as Running (State: Running)
  • The amount of walltime used is listed with the amount of walltime requested at the bottom (Reservation '19936802' (-00:00:25 -> 00:04:35 Duration: 00:05:00))
  • The node name(s) are listed (Allocated Nodes: [qnode5056:1])
[<netid>@quser13 ~]$ checkjob -v 19936802
job 19936802 (RM job '19936802.qsched03.quest.it.northwestern.edu')

AName: testjob5
State: Running 
Creds:  user:<netid>  group:<netid>  account:<allocationID>  class:short  qos:<allocationID>qos
WallTime:   00:00:00 of 00:05:00
SubmitTime: Mon May 15 15:10:05
  (Time Queued  Total: 00:01:06  Eligible: 00:00:00)

StartTime: Mon May 15 15:11:11
TemplateSets:  DEFAULT
NodeMatchPolicy: EXACTNODE
Total Requested Tasks: 1
Total Requested Nodes: 1

Req[0]  TaskCount: 1  Partition: quest5
TasksPerNode: 1  NodeCount:  1

Allocated Nodes:
[qnode5056:1]


SystemID:   Moab
SystemJID:  Moab.1670310
Notification Events: JobFail  Notification Address: <email>
Task Distribution: qnode5056
IWD:            $HOME
SubmitDir:      $HOME
UMask:          0002 
Executable:     /opt/moab/spool/moab.job.EXIRwo

OutputFile:     qsched03.quest.it.northwestern.edu:/home/<netid>/testjob5.o19936802
StartCount:     1
Partition List: quest3,quest4,quest5,quest6
SrcRM:          internal  DstRM: quest  DstRMJID: 19936802.qsched03.quest.it.northwestern.edu
Flags:          BACKFILL,ADVRES:admin1,SUSPENDABLE,GLOBALQUEUE,JOINSTDERRTOSTDOUT
Attr:           BACKFILL,checkpoint
Variables:      UsageRecord=9924734
StartPriority:  257
IterationJobRank: 0
PE:             1.00
Reservation '19936802' (-00:00:25 -> 00:04:35  Duration: 00:05:00)

Example for a Job with an Error

This example is for a job that requested more cores per node than are available on Quest. See Quest Technical Specifications for details on the nodes. In the submission script, 30 cores per node were requested with the line:

#MSUB -l nodes=1:ppn=30

The first indication of a problem with the job was that the job ID begins with "Moab."

Note in the output below that:

  • The State is listed as Idle (State: Idle)
  • The NOTE: entries at the bottom tell you that the requested tasks/procs (cores) is greater than the maximum number of cores per node (PPN) for each of Quest's partitions.
[<netid>@quser13 ~]$ checkjob -v Moab.1670323
job Moab.1670323

AName: testjob5
State: Idle 
Creds:  user:<netid>  group:<netid>  account:<allocationID>  qos:<allocationID>qos
WallTime:   00:00:00 of 00:05:00
SubmitTime: Mon May 15 15:14:30
  (Time Queued  Total: 00:00:40  Eligible: 00:00:00)

TemplateSets:  DEFAULT
NodeMatchPolicy: EXACTNODE
Total Requested Tasks: 30
Total Requested Nodes: 1

Req[0]  TaskCount: 30  Partition: ALL
TasksPerNode: 30  NodeCount:  1


SystemID:   Moab
SystemJID:  Moab.1670323
Notification Events: JobFail  Notification Address: <email>

IWD:            /home/<netid>
SubmitDir:      /home/<netid>
UMask:          0002 
Executable:     /home/<netid>/testjob.sh

Partition List: quest3,quest4,quest5,quest6,SHARED
SrcRM:          internal
Flags:          ADVRES:admin1,GLOBALQUEUE,JOINSTDERRTOSTDOUT
StartPriority:  0
IterationJobRank: 0
PE:             30.00
NOTE:  job violates constraints for partition quest3 (Max partition ppn exceeded on partition "quest3". TasksPerNode(30) * Procs(1) > Max Partition PPN(16))

NOTE:  job violates constraints for partition quest4 (Max partition ppn exceeded on partition "quest4". TasksPerNode(30) * Procs(1) > Max Partition PPN(20))

NOTE:  job violates constraints for partition quest5 (Max partition ppn exceeded on partition "quest5". TasksPerNode(30) * Procs(1) > Max Partition PPN(24))

NOTE:  job violates constraints for partition quest6 (Max partition ppn exceeded on partition "quest6". TasksPerNode(30) * Procs(1) > Max Partition PPN(28))

This job will remain in the idle state indefinitely and will never run. You must cancel this job.

Cancelling jobs

You can cancel one or all of your jobs with mjobctl. Proceed with caution, as this cannot be undone, and you will not be prompted for confirmation after issuing the command.

Cancel a single job using the job number:

mjobctl -c <jobID>

Cancel all of your jobs:

mjobctl -c -w user=<your_netID>

Additional mjobctl Commands

The Moab job control command (mjobctl) is used for holding, releasing, and canceling jobs, or changing the parameters of a submitted job. You can place your job in a “user hold” state after the job has been submitted with

msub –h <jobID>

Jobs placed in a “user hold” state will appear in the output of showq and checkjob commands. You can then release your job with

mjobctl -r <jobID>

Moab permits modification of some job parameters after job submission and before the job starts running. These parameters include:

  • Account
  • Queue
  • Job name
  • Wall clock limit

In general, the syntax for modifying an attribute is

mjobctl -m attr=value <jobID>

Some examples are provided in the list below:

Command Description
mjobctl -m reqawduration+=600 <jobID> Add 10 minutes (60*10=600 seconds) to walltime
mjobctl -h <jobID> Place job on hold
mjobctl -r <jobID> Release hold
mjobctl -c <jobID>
Delete (cancel) the job
mjobctl -m account=<allocationID> <jobID> Change the allocation
mjobctl -m queue=short <jobID> Change queue to short
mjobctl -m depend=1000 <jobID> Change job to depend upon job number 1000
mjobctl -m userprio-=100 <jobID> Reduce priority by 100

For a complete listing of mjobctl options, see the official mjobctl documentation.

See Also:




Keywords:Quest, batch, job, status, commands, moab   Doc ID:70710
Owner:Research Computing .Group:Northwestern
Created:2017-02-15 15:25 CSTUpdated:2018-11-02 08:16 CST
Sites:Northwestern
Feedback:  3   0