Quest Troubleshooting: Why won't my interactive job start?
This page demonstrates troubleshooting an interactive job on Quest that won't start.
Example of submitting an interactive job:
[<netid>@quser13 ~]$ msub -I -l nodes=20;ppn=20
qsub: waiting for job 19903252.qsched03.quest.it.northwestern.edu to start
If after you request an interactive job, it remains in the waiting state without putting you on an assigned node, you can use the checkjob command to get information about your job's status. In a separate terminal:
If the scheduler hasn't fully processed your job yet, then you will see:
ERROR: invalid job specified: 19903252
You need to wait longer for the scheduler to complete its work.
If resources aren't available for your job, you will see output like:
job 19903252 AName: STDIN State: Idle Creds: user:<netid> group:<netid> account:<allocation> class:short qos:<allocation>qos WallTime: 00:00:00 of 4:00:00 SubmitTime: Wed May 10 16:49:41 (Time Queued Total: 00:01:23 Eligible: 00:01:08) TemplateSets: DEFAULT NodeMatchPolicy: EXACTNODE Total Requested Tasks: 20 Req TaskCount: 20 Partition: ALL NodeCount: 20 SystemID: Moab SystemJID: Moab.1655957 Notification Events: JobFail Partition List: quest3,quest4,quest5,quest6 Flags: ADVRES:admin1,SUSPENDABLE,INTERACTIVE Attr: INTERACTIVE,checkpoint StartPriority: 258 IterationJobRank: 135 rejected for State - (null) rejected for Reserved - (null) NOTE: job req cannot run in partition quest3 (insufficient procs: 20 procs needed, 16 procs available) rejected for CPU - (null) rejected for State - (null) rejected for Reserved - (null) NOTE: job req cannot run in partition quest4 (available procs do not meet requirements: 20 procs needed, 0 procs found) idle procs: 1967 feasible procs: 0 Node Rejection Summary: [CPU: 3][State: 85][Reserved: 149] available for 24 tasks - qnode[5056-5060] rejected for CPU - (null) rejected for State - (null) rejected for Reserved - (null) NOTE: job cannot run in partition quest5 (insufficient idle nodes: 20 nodes needed, 5 nodes available) rejected for State - (null) rejected for Reserved - (null) NOTE: job req cannot run in partition quest6 (available procs do not meet requirements: 20 procs needed, 0 procs found) idle procs: 2699 feasible procs: 0 Node Rejection Summary: [State: 66][Reserved: 118]
PROBLEM: In this case checkjob tells us that there are not enough Nodes/Cores to complete the request. That isn't suprising for this example, where we unrealistically requested 20 nodes with 20 cores per node in order to generate a job in this situation. But Quest is routinely a very busy cluster so even smaller jobs that request a more realistic amount of resources may not be able to start immediately.
SOLUTIONS: Wait for more processes to free up or resubmit with a smaller number of cores.
To cancel an interactive job that is waiting, type Control-C. You will be asked whether you want to terminate the job.