Quest Troubleshooting: Why won't my interactive job start?

This page demonstrates troubleshooting an interactive job on Quest that won't start.

Example of submitting an interactive job:

[<netid>@quser13 ~]$ msub -I -l nodes=20;ppn=20
qsub: waiting for job 19903252.qsched03.quest.it.northwestern.edu to start

If after you request an interactive job, it remains in the waiting state without putting you on an assigned node, you can use the checkjob command to get information about your job's status. In a separate terminal:

checkjob 19903252

If the scheduler hasn't fully processed your job yet, then you will see:

ERROR: invalid job specified: 19903252

You need to wait longer for the scheduler to complete its work.

If resources aren't available for your job, you will see output like:

job 19903252

AName: STDIN
State: Idle 
Creds:  user:<netid>  group:<netid>  account:<allocation>  class:short  qos:<allocation>qos
WallTime:   00:00:00 of 4:00:00
SubmitTime: Wed May 10 16:49:41
  (Time Queued  Total: 00:01:23  Eligible: 00:01:08)

TemplateSets:  DEFAULT
NodeMatchPolicy: EXACTNODE
Total Requested Tasks: 20

Req[0]  TaskCount: 20  Partition: ALL
NodeCount:  20


SystemID:   Moab
SystemJID:  Moab.1655957
Notification Events: JobFail

Partition List: quest3,quest4,quest5,quest6
Flags:          ADVRES:admin1,SUSPENDABLE,INTERACTIVE
Attr:           INTERACTIVE,checkpoint
StartPriority:  258
IterationJobRank: 135
rejected for State        - (null)
rejected for Reserved     - (null)
NOTE:  job req cannot run in partition quest3 (insufficient procs: 20 procs needed, 16 procs available)

rejected for CPU          - (null)
rejected for State        - (null)
rejected for Reserved     - (null)
NOTE:  job req cannot run in partition quest4 (available procs do not meet requirements: 20 procs needed, 0 procs found)
idle procs: 1967  feasible procs:   0

Node Rejection Summary: [CPU: 3][State: 85][Reserved: 149]

available for 24 tasks     - qnode[5056-5060]
rejected for CPU          - (null)
rejected for State        - (null)
rejected for Reserved     - (null)
NOTE:  job cannot run in partition quest5 (insufficient idle nodes: 20 nodes needed, 5 nodes available)

rejected for State        - (null)
rejected for Reserved     - (null)
NOTE:  job req cannot run in partition quest6 (available procs do not meet requirements: 20 procs needed, 0 procs found)
idle procs: 2699  feasible procs:   0

Node Rejection Summary: [State: 66][Reserved: 118]

PROBLEM: In this case checkjob tells us that there are not enough Nodes/Cores to complete the request. That isn't suprising for this example, where we unrealistically requested 20 nodes with 20 cores per node in order to generate a job in this situation. But Quest is routinely a very busy cluster so even smaller jobs that request a more realistic amount of resources may not be able to start immediately.

SOLUTIONS: Wait for more processes to free up or resubmit with a smaller number of cores.

To cancel an interactive job that is waiting, type Control-C. You will be asked whether you want to terminate the job.

See Also:




Keywords:research computing, quest, interactive, hold, qsub   Doc ID:78609
Owner:Research Computing .Group:Northwestern
Created:2017-12-07 14:46 CSTUpdated:2017-12-07 14:46 CST
Sites:Northwestern
Feedback:  0   0