NERSCPowering Scientific Discovery Since 1974

Monitoring and Managing Jobs

Monitoring and Managing Batch Jobs

These are some basic commands for monitoring and modifiying batch jobs while they're queued or running.

NERSC has developed a new tool for monitoring and viewing the state of batch jobs for genepool called qs.  Please read about Monitoring jobs with qs

ActionHow to do itComment
Get a listing of your jobs and their states qs -u  If you skip the -u option, you'll get all the jobs on Genepool/Phoebe.
qstat -u user_name If you skip the -u option, you'll only see jobs for your username.
qstat_long -u user_name Regular qstat truncates job names to 10 characters. If you need a full name - use qstat_long.
Get detailed info about a specific job qstat -j job_ID You can get job_ID by listing your jobs as described above.
See how much cputime a job has used qstat -j job_ID Look in the next to the last line or grep the output on "usage".   Note that in the memory usage GBs stands for Gigabyte-seconds.
Kill a specific job qdel job_ID If qdel doesn't work try qdel -f job_ID
Kill all your jobs qdel -u user_name  
Select a job to run first qalter -js NN job_ID
NN is some positive number
In UGE you control the relative priority of your jobs by adjusting their job shares. A larger job share results in a higher priority.
Use multiple job slots for your job qalter -pe pe_slots NN job_ID
NN is some positive number
Set NN to the number of job slots your job needs to prevent overloading the node. For example, if you are are running a multithreaded job set NN to the number of threads.
Clear jobs in Eqw state qmod -cj job_ID The Eqw state means the job started but there was some error.  Check the error with "qstat -j job_ID".  It will be listed near the end of the output.  Fix it if necessary before clearing the job or it will just go back into the Eqw state again. This can only be done from the genepool login nodes (not the gpints).