NERSC logo National Energy Research Scientific Computing Center
  A DOE Office of Science User Facility
  at Lawrence Berkeley National Laboratory
 

SGE Batch System


SGE is the batch system used at PDSF.

At this point all the nodes (with exclusion of some special purpose nodes) are affected by a 1 day equivalent wallclock limit. On the fastest nodes (3GHz) it is 1 day and on slower nodes the time limit is extended by the processor speed ratio (e.g. 3 days on a 1GHz node). A announcement will be sent out when this changes.

If you are planning to submit thousands of short jobs (<10 minutes) concurrently consider using job arrays if possible to reduce SGE accounting overhead. There is an entry in the FAQ that describes how to use SGE job arrays.

NOTE: Full SGE documentation , Manuals and How To's are available from the SUN site but below is a short table that should help you get started.

Action How to do it in SGE Comment
submit a job qsub -V script In SGE you have to always submit a script, not an executable.
If you want your job to inherit all the environmental variables of the submitting shell you have to request it with the -V option. Note: your job will not inherit your LD_LIBRARY_PATH (even if you specify -V).
submit a job with a dvio requirement qsub -hard -l dv<#>io=1 script Replace <#> by the number associated with your diskvault
submit a job to the debug queue qsub -q debug.q script Debug queue only has a few nodes and has a one hour time limit.
submit a job that depends on other jobs qsub -hold_jid [job_ID|job_name] script SGE just recognizes whether or not [job_ID|job_name] is finished before submitting your job, and it only lets you "AND" job IDs/job names.
get e-mail from your job upon completion no e-mail by default, add the -m option of qsub to request e-mail see man pages for details
check on your job (running or pending) qstat -u user_name If you skip the -u option, you'll get all scheduled and running jobs.
qstat_lite "lite" version, easier on the batch system, but updated only every 5 min. Recommended if you can live with this 5min uncertainity.
qstat_long Regular qstat truncates job names to 10 characters. If you need a full name - use qstat_long.
check on your finished job qacct -o user_name -j If you skip the -o option (-o for a change not -u like above !), you'll get a summary of all the jobs ran by all the users during the last accounting period. Don't forget the -j option, without it, you'll just get your own grand total.
kill a job qdel job_ID If qdel doesn't work try qdel -f job_ID
kill all your jobs qdel -u use_name You can do -u all. If I did it it would kill all running and pending jobs for everybody, but since you have not enough prioviledge, it will kill only your own. But it's a bad practice and a dangerous habit.
start an interactive session on a batch node qsh Note that batch system commands like qstat are not available on the batch nodes.
start an interactive session on a specific batch node qsh -l h=pc<#> -now n
use lower case l above
replace <#> with a node of your choice,
"-now n" means that you are willing to wait if the node of choice is not immediately available
Select a job to run first qalter -js NN job_ID
NN is some positive number
In SGE you control the relative priority of your jobs by adjusting their job shares. A larger job share results in a higher priority.
Also look up this FAQ


LBNL Home
Page last modified: Wed, 30 Apr 2008 22:48:49 GMT
Page URL: http://www.nersc.gov/nusers/systems/PDSF/software/SGE.php
Web contact: webmaster@nersc.gov
Computing questions: consult@nersc.gov

Privacy and Security Notice
DOE Office of Science