Batch Queues
on halem
Job environment
The life cycle of a batch job begins when
you submit the job to LSF (Load
Sharing Facility) using the LSF
GUI (xlsbatch) or the command line
(bsub). Either method allows you
to specify many options to modify
the default behavior.
If you do not submit your job to LSF, it
will run interactively on the login
nodes. However, we strongly discourage
you from running interactive jobs
(serial or parallel) this way.
Even one job with a large memory
requirement (over 1 GB) will significantly
slow down interactive performance.
The login nodes are designed first
and foremost for submitting batch
jobs and then for general interactive
sessions such as editing files
or compiling codes by all users.
Note that interactive jobs that consume
system resources unreasonably because
they are running in ways other
than those described in
this guide may be killed at any
moment without warning by the system
administrator.
Jobs submitted on halem are allocated whole
nodes (multiples of 4 CPUs). If
your CPU request is not a multiple
of four, it will automatically
be rounded up to the nearest multiple
of four to allow for whole-node
allocation. The only queue that
allows 1-CPU allocation is datamove,
and this queue is intended to be
used only for moving data to and from halem.
To allow for maximum data transfer rates, this
queue only runs on the file-serving nodes of
halem. Non-data-movement jobs that are run
in this queue may be
killed at any time.
| Top
of Page |
halem batch queues
Jobs are submitted from the login nodes on
halema and executed on the compute
nodes assigned by the batch subsystem.
The table below summarizes halem
queue names and their parameters.
Use the command
bqueues
-l queue_name
to
see detailed information about
any batch queue. Issuing the
command
bqueues -u userid
will
list all the batch queues
to which userid is
allowed to submit jobs.
Queue |
Priority |
Default Run Limit (min) |
Max Run Limit (min) |
Max Job Size (CPU) |
Job Limit per User (CPU) |
Job Limit per Queue (CPU) |
Hosts |
debug |
31 |
5 |
15 |
32 |
|
32 |
hlm100 |
general |
30 |
10 |
180 |
64 |
128 |
440 |
hlm125 |
general_lng |
30 |
10 |
720 |
64 |
128 |
228 |
hlm100 |
datamove |
30 |
30 |
60 |
1 |
8 |
16 |
halema |
giss_b |
30 |
720 |
1800 |
96 |
|
300 |
hlmd-e |
geos_int |
30 |
60 |
480 |
64 |
|
256 |
hlmf-h1 |
gmao_pproc |
30 |
60 |
60 |
4 |
12 |
16 |
halema |
gmao_hi |
30 |
360 |
720 |
560 |
|
0 |
hlm125 |
gmao_big |
29 |
360 |
360 |
320 |
|
0 |
hlm125 |
gmao_long |
29 |
1440 |
1440 |
256 |
|
0 |
hlm125 |
gmao_short |
28 |
120 |
120 |
560 |
|
0 |
hlm125 |
gmao_lo |
28 |
360 |
360 |
560 |
|
0 |
hlm125 |
background |
25 |
60 |
240 |
128 |
192 |
540 |
hlm100 |
special_b |
31 |
360 |
720 |
852 |
|
852 |
hlm125 |
special |
25 |
120 |
1500 |
128 |
128 |
128 |
hlm125 |
Note that the queue structure
and limits above can change at
any time. To display a detailed
list of your limits, issue the
command
bqueues -l -u userid
In
the long (-l) listing of the
bqueues command, the column named
MAX lists the maximum number
of CPUs available for a queue.
The MAXIMUM LIMITS section defines
RUNLIMIT (maximum CPU time limit)
for any job in the queue, and the
column PROCLIMIT further defines
rules as to how many processors
can be used by a job. The leftmost
number is the minimum number of
processors that will be used, the
middle number is the default number
of processors, and the rightmost
number is the maximum number of
processors that can be used by
one job.
The default run limit will be used for batch
jobs submitted without the "bsub -W <time_limit>" option
or without specifying "#BSUB -W time_limit" in
the job scripts. Specifying a run time limit
is highly recommended for all batch jobs.
If no queue name is specified when a batch
job is submitted, it will go into the general
queue by default.
| Top
of Page |
Interactive batch environment
Interactive batch session with pseudo-terminal
support
Interactive batch session with pseudo-terminal
support is the preferred method
for submitting an interactive batch
job if you do not require X-windows.
It avoids sending encrypted graphics
over the network, which can sometimes
be slow. To start a session, at
the prompt issue the command:
% bsub
-P sponsor_code -q
queue_name -Is
-n num_cpus
shell_name
The
key option is the "-Is," which
submits a batch interactive job
and creates a pseudo-terminal with
shell mode support when the job starts. For
example,
% bsub -P b000 -q general -Is -n 4 /usr/dlocal/bin/tcsh
The
above example assumes that your
sponsor code account is b000, you
are submitting a 4-processor batch
interactive job to the general queue, and you
want tcsh as the interactive shell when your
job starts.
Interactive batch session with remote
display of X-windows
If your interactive
batch session requires remote
display of X-windows,
- Verify that
X-window display works on
your local system. How you do
this varies from system to system.
For non-Unix-based systems, you
may need to ensure that some
X-server is installed on your
system and running. Contact your
local system administrator and/or
the NCCS
User Services Group if
you need assistance.
- Enable X11
forwarding by using the "-X" option
when you connect to the login
server:
ssh -X userid@login.nccs.nasa.gov
- After
logging into halem, requesting
an interactive batch session
is the same as described above
for pseudo-terminal support.
- You may test your ability to run
an X application using a simple
command such as
/usr/bin/X11/xclock
| Top
of Page |
Batch request using script
To
submit a job for execution on halem,
create an LSF batch script. The simplest
script might look like:
#!/bin/csh
#BSUB -P k1234
#BSUB -J jjob
#BSUB -n 4
#BSUB -W 00:06
#BSUB -q giss_sm_b
#BSUB -i /scr/userid/Poisson_Hybrid/omp/inp
#BSUB -o /scr/userid/Poisson_Hybrid/omp/bsout.%J
#BSUB -e /scr/userid/Poisson_Hybrid/omp/bserr.%J
echo "This job
runs on `hostname`."
echo " ---- starting job output"
chdir /scr/userid/Poisson_Hybrid/omp
prun -c 4 poisson_omp
Then to submit
a script for execution, type
% bsub < script_name
Please note that the "<" sign
must be used before the batch
file name. If you do not use the "<" and
simply type "bsub script_name", none
of the #BSUB directives in the
job script will be processed, and some unexpected
job behavior may result. Always use "<" when
submitting your batch script via the bsub command.
Some of the option flags for the bsub commands
are:
- -P, to assign a sponsor
code. To get your sponsor
code, use the command getsponsor.
- -J, to assign
a name to your job.
- '-n
numprocs', to submit a parallel
job with 'numprocs' minimum processor.
- -W, to set the run time limit
of the batch job.
- -q, to submit the
job to the specified queue.
- -i, to get the standard input
for the job from specified file.
- -e, to specify
the file path and append the
standard error output of the
job to the specified file.
- -o, to direct the job
output to the
specified file. If you use -o
without -e, the standard
error of the job is stored in
the output file. If you use the
special character %J in the
name of the output/error file,
then %J is replaced by the job
ID of the job.
These options may be specified on the command
line or in the batch
script by preceeding them with
the "#BSUB" directive. See "man
bsub"
for further information on bsub
command options.
A word on running jobs
conditionally
The LSF bsub command
provides a '-w' flag (not to
be confused with the capital -W)
to use dependency conditions
to control the placement of a job.
For example, the script:
#!/bin/csh
#BSUB -P k32
#BSUB -J job3
#BSUB -n 8
#BSUB -W 00:06
#BSUB -q general
#BSUB -o /scr/m/est/out3.%J
#BSUB -e /scr/m/est/err3.%J
#BSUB -w 'done("job1") && done("job2")'
prun -n 8 job3.exe
would only allow job3 to
run if job1 and job2 had completed
successfully. The dependency condition inside
the single quote has to be 'True'
for the job to be allowed to
run. The logical operators, && (AND),
|| (OR), and ! (NOT) can be used
as well as parentheses to generate the conditional
expression. For example,
'done("job1") && done("job2")'
runs the 3rd job if the 1st AND 2nd job are
successful,
'exit("job1") && exit("job2")'
runs the 3rd job if the 1st AND 2nd job both
fail, and
'ended("job1") && ended("job2")'
runs the 3rd job when the 1st AND 2nd job
have finished, regardless of whether they ran
normally or failed.
Except for putting the job name in double
quotes, an LSF job ID can be used in parentheses
without the double quotes.
The bkill command may be used to delete jobs
that are queued or running. Use "bkill <jobid>" to
kill a previously submitted job. "bkill
0" may also be used to kill all your jobs
that are running or pending. See "man
bkill" for more information on the bkill
command.
| Top
of Page |
Monitoring jobs
The jobs on halem can be monitored via RMS
or LSF commands. RMS provides a
set of commands for running parallel
programs and monitoring their execution.
LSF handles the integration with
RMS and the allocation of resources
so that you normally don't have
to worry about issuing RMS specific
commands other than prun to run their executables.
Use the "rinfo" command
to see the status of running jobs
and allocated resources.
A group of commands is also available to display
information about LSF jobs and
hosts including bjobs, bqueues,
bhist, bpeek, and bhosts. For example,
% bjobs
-u all -q general
displays all
currently running jobs for all
users for a specific queue, in
this case 'general', and
% bjobs
-a -u userid
displays information
about jobs by 'userid' in all
states, including finished jobs
that finished recently, within
an interval specified by CLEAN_PERIOD
in lsb.params (the default period
is 1 hour). For jobs that finished more than
CLEAN_PERIOD seconds ago, use the bhist command.
You
can find your jobs within
some time frame by issuing the
command
% bhist -T 2002/8/16/00:40,2002/8/16/17:50
-n 0 -N 1.00 -u userid
and then
find detailed descriptions on
each job by
% bhist -n 0 -N 1.00
-l JOBID
| Top
of Page |
Checkpointing
jobs
Currently there is no system-level checkpointing
for interactive or batch jobs.
You are responsible for generating
output files and restart files
in case the system behaves abnormally.
Contact the NCCS
User Services Group if
you are interested in adding checkpoint
capability to your application. |