Troubleshooting Jobs on Quest
How to know if something is wrong with your job on Quest and how to fix it.
There are two common issues with jobs on Quest: a job isn't running or a job ended before you expected. This page has some tips for figuring out what may be going wrong. For additional help, contact email@example.com.
Why isn't my job running?
Common reasons why a job or jobs may not be running are:
- Your allocation is expired or does not have enough compute hours remaining to cover the walltime and cores you requested. Use the checkproject utility to check your allocation resources. See this example of output from such a situation for more details.
- You requested more resources (memory or cores) per node than are available on Quest. See Managing Jobs for the checkjob command, and Checking Processor and Memory Utilization for Jobs to check the resources used by your job.
- You submitted a large number of jobs, and some are listed as "blocked." This is normal. You can only have 30 jobs eligible to be scheduled at any given time. As jobs are scheduled, jobs will be moved from "blocked" to "eligible."
- Quest is busy.
- If you are running under a general access Research Allocation I or Research Allocation II, you are using shared resources. If you can keep your jobs under four hours of walltime, your jobs will have access to more nodes than longer jobs.
- If you are running under a full access allocation with dedicated nodes, see Full Access Allocation Management for commands to check your nodes and jobs. Make sure you're specified the buyin queue when submitting your job (or other queues specific to your allocation).
- For interactive jobs, see Why won't my interactive job start?
Why did my job stop/abort/fail?
Check the Output File
The first place to look is the output/error file for your job. See Checking the Job Output File for an example. The Job Output File will be in the directory from which you submitted your job. Even if you directed output from your script/program to another location, there is still an output file with information about the job itself. By default, the output file is named <jobname>.o<jobID>. If you haven't joined the output and error files, then there may also be an error file <jobname>.o<jobID>. You can use the cat command to print the contents to the terminal, or open the file in your preferred text editor.
A Job exit value: 0 means a successfully completed job (as far as the scheduler is concerned). Any other value indicates an potential error. The exit value comes from Torque. The list of PBS Exit Codes is fairly cryptic, but you can contact firstname.lastname@example.org for assistance deciphering a non-zero exit code.
If you have an exit code of 0 but still think your job ended early or had a problem, there may have been an error in your program code/script. Check any additional output and files you may have written from your code if there isn't additional information in the job output/error file.
Besides errors in your script, you job may be aborted by the system if it is still running when the walltime limit you requested (or the upper walltime limit for the queue) is reached, or if you use more cores than you requested. The latter can happen with programs that are multi-threaded.
Out of Disk Space
Your job could also fail if you exceed your storage quote in your home or projects directory.
Check how much space you're using in your home directory with
du -h --max-depth=0 ~
Check how much space is used in your projects directory with
- Checking the output file when your job fails
- Why won't my interactive job start?
- Insufficient hours or expired allocation
- Submitting a Job on Quest
- Managing Jobs on Quest
- Checking Processor and Memory Utilization for Jobs on Quest