Managing Jobs on Quest
How to manage batch jobs after they’ve been submitted on Quest.
Please note: Job resource requirements, such as the number of cores or node or the amount of memory, in the submission script are recorded by the scheduler. Any changes made to the job script (jobscript.sh in the example on Submitting a Job on Quest) after the job has been submitted with msub will have no effect on the job. After submitting a job, you can only Hold, Release, Kill, and Modify job parameters using the the Moab commands in the list below.
If you're job isn't running or ended before you expected, see Troubleshooting Jobs for some tips.
When you submit jobs on Quest using msub, the scheduler returns the job ID and queues it for execution.
For example, if a user submitted jobscript.sh using msub:
[abc123@quser10 ~]$ msub jobscript.sh
If the job was submitted successfully, a job ID will be returned:
You can use this job ID later to monitor the job.
In most cases, the job ID will be only numbers. In some cases, the job ID may start with "Moab.", for example:
"Moab." is part of the job ID. There are two common cases when this might occur:
- Jobs submitted as part of a job array will be assigned a job ID prefixed by "Moab." This is normal.
- Jobs that are rejected by the Torque resource manager may be assigned a job ID prefixed by "Moab." This can happen, for example, if you request more cores per node than are available on Quest. If your job ID begins with "Moab." and you aren't using a job array, then you should use the checkjob command to investigate problems with your job. Also see Troubleshooting Jobs on Quest for additional details.
After submitting a job, you can execute the showq command or the checkjob command to check the status of your job. On Quest, submitted jobs are analyzed and queued by the scheduler. When a job is sent to the scheduler, it is first checked by a resource manager. The resource manager ensures that you have enough resources, particularly compute hours, on the system in order to run your job.
If enough resources exist in your allocation, the job is forwarded to the scheduler to be put in queue. It is important to note that if there is a typo in your job submission script, it may be flagged by the resource manager and you job will be rejected and placed on BatchHold.
When the scheduler receives a job, it will prioritize your job relative to other jobs currently in the queue. The accounting system assumes that your job will run with the amount of time and number of cores that you specified in your job submission script. If your job requires less time than you specified, the accounting system only charges you for the time used on the system.
If you lack enough compute hours left in your allocation to run your job, it will be placed on BatchHold or SystemHold. Jobs in a BatchHold or SystemHold state, will remain in this state until you cancel the job or a system administrator intervenes to either add enough compute hours for your job to run, or to redirect your job to another account for you to access so your job can run. If your job is under a BatchHold or SystemHold and you need assistance from a system administrator, please contact firstname.lastname@example.org for help.
Generally, the more resources that a job requires, the longer a job may sit in the queue until the necessary resources become free and can be scheduled. Full access nodes are dedicated resources thus the access criteria, queues, job duration and job size limits for these nodes are different. See Full Access Job Commands for specialized information.
Commonly Used Commands
The showq Command
The showq command (without any options) displays the job queues for all users on Quest. To quickly access information about your specific job(s), there are options to filter the results (you can combine multiple options):
|showq -u <netID>||Show only jobs belonging to user specified|
|showq -r||Show running jobs|
|showq -i||Show idle jobs|
|showq -b||Show blocked jobs|
|showq -w acct=<allocationID>||
Show only jobs belonging to account specified
|showq --help||See documentation and additional options|
The output of the showq command groups jobs into three categories: active, eligible, and blocked. Active jobs are running. Eligible jobs are being considered by the scheduler when additional computing resources become available; they are currently idle. There is a limit of 30 idle jobs per user. If you have submitted more than 30 jobs, and they weren't scheduled immediately, then some of the jobs will appear in the Blocked list; these jobs will be moved to the Eligible list as space becomes available. Jobs may also appear on the Blocked list if they were submitted to an expired allocation, there are insufficent compute hours on an allocation to complete a job, or the job has other errors or resource limit issues. Use the checkjob command to get more information.
The checkjob command displays detailed information about a submitted job’s status and diagnostic information that can be useful for troubleshooting submission issues. It can also be used to obtain useful information about completed jobs such as the allocated nodes, resources used, and exit codes. The -v flag is useful for gathering additional diagnostic information about your job.
checkjob -v <jobID>
where you can get your <jobID> using the showq commands above.
Example for a Successfully Running Job
Note in the output below that:
- The State is listed as Running (State: Running)
- The amount of walltime used is listed with the amount of walltime requested at the bottom (Reservation '19936802' (-00:00:25 -> 00:04:35 Duration: 00:05:00))
- The node name(s) are listed (Allocated Nodes: [qnode5056:1])
[<netid>@quser13 ~]$ checkjob -v 19936802 job 19936802 (RM job '19936802.qsched03.quest.it.northwestern.edu') AName: testjob5 State: Running Creds: user:<netid> group:<netid> account:<allocationID> class:short qos:<allocationID>qos WallTime: 00:00:00 of 00:05:00 SubmitTime: Mon May 15 15:10:05 (Time Queued Total: 00:01:06 Eligible: 00:00:00) StartTime: Mon May 15 15:11:11 TemplateSets: DEFAULT NodeMatchPolicy: EXACTNODE Total Requested Tasks: 1 Total Requested Nodes: 1 Req TaskCount: 1 Partition: quest5 TasksPerNode: 1 NodeCount: 1 Allocated Nodes: [qnode5056:1] SystemID: Moab SystemJID: Moab.1670310 Notification Events: JobFail Notification Address: <email> Task Distribution: qnode5056 IWD: $HOME SubmitDir: $HOME UMask: 0002 Executable: /opt/moab/spool/moab.job.EXIRwo OutputFile: qsched03.quest.it.northwestern.edu:/home/<netid>/testjob5.o19936802 StartCount: 1 Partition List: quest3,quest4,quest5,quest6 SrcRM: internal DstRM: quest DstRMJID: 19936802.qsched03.quest.it.northwestern.edu Flags: BACKFILL,ADVRES:admin1,SUSPENDABLE,GLOBALQUEUE,JOINSTDERRTOSTDOUT Attr: BACKFILL,checkpoint Variables: UsageRecord=9924734 StartPriority: 257 IterationJobRank: 0 PE: 1.00 Reservation '19936802' (-00:00:25 -> 00:04:35 Duration: 00:05:00)
Example for a Job with an Error
This example is for a job that requested more cores per node than are available on Quest. See Quest Technical Specifications for details on the nodes. In the submission script, 30 cores per node were requested with the line:
#MSUB -l nodes=1:ppn=30
The first indication of a problem with the job was that the job ID begins with "Moab."
Note in the output below that:
- The State is listed as Idle (State: Idle)
- The NOTE: entries at the bottom tell you that the requested tasks/procs (cores) is greater than the maximum number of cores per node (PPN) for each of Quest's partitions.
[<netid>@quser13 ~]$ checkjob -v Moab.1670323 job Moab.1670323 AName: testjob5 State: Idle Creds: user:<netid> group:<netid> account:<allocationID> qos:<allocationID>qos WallTime: 00:00:00 of 00:05:00 SubmitTime: Mon May 15 15:14:30 (Time Queued Total: 00:00:40 Eligible: 00:00:00) TemplateSets: DEFAULT NodeMatchPolicy: EXACTNODE Total Requested Tasks: 30 Total Requested Nodes: 1 Req TaskCount: 30 Partition: ALL TasksPerNode: 30 NodeCount: 1 SystemID: Moab SystemJID: Moab.1670323 Notification Events: JobFail Notification Address: <email> IWD: /home/<netid> SubmitDir: /home/<netid> UMask: 0002 Executable: /home/<netid>/testjob.sh Partition List: quest3,quest4,quest5,quest6,SHARED SrcRM: internal Flags: ADVRES:admin1,GLOBALQUEUE,JOINSTDERRTOSTDOUT StartPriority: 0 IterationJobRank: 0 PE: 30.00 NOTE: job violates constraints for partition quest3 (Max partition ppn exceeded on partition "quest3". TasksPerNode(30) * Procs(1) > Max Partition PPN(16)) NOTE: job violates constraints for partition quest4 (Max partition ppn exceeded on partition "quest4". TasksPerNode(30) * Procs(1) > Max Partition PPN(20)) NOTE: job violates constraints for partition quest5 (Max partition ppn exceeded on partition "quest5". TasksPerNode(30) * Procs(1) > Max Partition PPN(24)) NOTE: job violates constraints for partition quest6 (Max partition ppn exceeded on partition "quest6". TasksPerNode(30) * Procs(1) > Max Partition PPN(28))
This job will remain in the idle state indefinitely and will never run. You must cancel this job.
You can cancel one or all of your jobs with mjobctl. Proceed with caution, as this cannot be undone, and you will not be prompted for confirmation after issuing the command.
Cancel a single job using the job number:
mjobctl -c <jobID>
Cancel all of your jobs:
mjobctl -c -w user=<your_netID>
Additional mjobctl Commands
The Moab job control command (mjobctl) is used for holding, releasing, and canceling jobs, or changing the parameters of a submitted job. You can place your job in a “user hold” state after the job has been submitted with
msub –h <jobID>
Jobs placed in a “user hold” state will appear in the output of showq and checkjob commands. You can then release your job with
mjobctl -r <jobID>
Moab permits modification of some job parameters after job submission and before the job starts running. These parameters include:
- Job name
- Wall clock limit
In general, the syntax for modifying an attribute is
mjobctl -m attr=value <jobID>
Some examples are provided in the list below:
|mjobctl -m reqawduration+=600 <jobID>||Add 10 minutes (60*10=600 seconds) to walltime|
|mjobctl -h <jobID>||Place job on hold|
|mjobctl -r <jobID>||Release hold|
mjobctl -c <jobID>
Delete (cancel) the job
|mjobctl -m account=<allocationID> <jobID>||Change the allocation|
|mjobctl -m queue=short <jobID>||Change queue to short|
|mjobctl -m depend=1000 <jobID>||Change job to depend upon job number 1000|
|mjobctl -m userprio-=100 <jobID>||Reduce priority by 100|
For a complete listing of mjobctl options, see the official mjobctl documentation.
- Submitting a Job on Quest
- Managing Full Access Allocations on Quest
- Checking Processor and Memory Utilization for Jobs on Quest
- Troubleshooting Jobs on Quest