+ Visit NASA.gov

More halem links:

+ Quick Start Guide

+ Overview of System Resources

+ Programming Environment

+ Batch Queues

+ Software and Tools

Batch Queues on halem

+ Job environment
+ halem batch queues
+ Interactive batch environment

+ Batch request using script
+ Monitoring jobs
+ Checkpointing jobs

Job environment

The life cycle of a batch job begins when you submit the job to LSF (Load Sharing Facility) using the LSF GUI (xlsbatch) or the command line (bsub). Either method allows you to specify many options to modify the default behavior.

If you do not submit your job to LSF, it will run interactively on the login nodes. However, we strongly discourage you from running interactive jobs (serial or parallel) this way. Even one job with a large memory requirement (over 1 GB) will significantly slow down interactive performance. The login nodes are designed first and foremost for submitting batch jobs and then for general interactive sessions such as editing files or compiling codes by all users.

Note that interactive jobs that consume system resources unreasonably because they are running in ways other than those described in this guide may be killed at any moment without warning by the system administrator.

Jobs submitted on halem are allocated whole nodes (multiples of 4 CPUs). If your CPU request is not a multiple of four, it will automatically be rounded up to the nearest multiple of four to allow for whole-node allocation. The only queue that allows 1-CPU allocation is datamove, and this queue is intended to be used only for moving data to and from halem. To allow for maximum data transfer rates, this queue only runs on the file-serving nodes of halem. Non-data-movement jobs that are run in this queue may be killed at any time.

| Top of Page |

halem batch queues

Jobs are submitted from the login nodes on halema and executed on the compute nodes assigned by the batch subsystem. The table below summarizes halem queue names and their parameters. Use the command

bqueues -l queue_name

to see detailed information about any batch queue. Issuing the command

bqueues -u userid

will list all the batch queues to which userid is allowed to submit jobs.

Queue	Priority	Default Run Limit (min)	Max Run Limit (min)	Max Job Size (CPU)	Job Limit per User (CPU)	Job Limit per Queue (CPU)	Hosts
debug	31	5	15	32		32	hlm100
general	30	10	180	64	128	440	hlm125
general_lng	30	10	720	64	128	228	hlm100
datamove	30	30	60	1	8	16	halema
giss_b	30	720	1800	96		300	hlmd-e
geos_int	30	60	480	64		256	hlmf-h1
gmao_pproc	30	60	60	4	12	16	halema
gmao_hi	30	360	720	560		0	hlm125
gmao_big	29	360	360	320		0	hlm125
gmao_long	29	1440	1440	256		0	hlm125
gmao_short	28	120	120	560		0	hlm125
gmao_lo	28	360	360	560		0	hlm125
background	25	60	240	128	192	540	hlm100
special_b	31	360	720	852		852	hlm125
special	25	120	1500	128	128	128	hlm125

Note that the queue structure and limits above can change at any time. To display a detailed list of your limits, issue the command

bqueues -l -u userid

In the long (-l) listing of the bqueues command, the column named MAX lists the maximum number of CPUs available for a queue. The MAXIMUM LIMITS section defines RUNLIMIT (maximum CPU time limit) for any job in the queue, and the column PROCLIMIT further defines rules as to how many processors can be used by a job. The leftmost number is the minimum number of processors that will be used, the middle number is the default number of processors, and the rightmost number is the maximum number of processors that can be used by one job.

The default run limit will be used for batch jobs submitted without the "bsub -W <time_limit>" option or without specifying "#BSUB -W time_limit" in the job scripts. Specifying a run time limit is highly recommended for all batch jobs.

If no queue name is specified when a batch job is submitted, it will go into the general queue by default.

| Top of Page |

Interactive batch environment

Interactive batch session with pseudo-terminal support

Interactive batch session with pseudo-terminal support is the preferred method for submitting an interactive batch job if you do not require X-windows. It avoids sending encrypted graphics over the network, which can sometimes be slow. To start a session, at the prompt issue the command:

% bsub -P sponsor_code -q queue_name -Is -n num_cpus shell_name

The key option is the "-Is," which submits a batch interactive job and creates a pseudo-terminal with shell mode support when the job starts. For example,

% bsub -P b000 -q general -Is -n 4 /usr/dlocal/bin/tcsh

The above example assumes that your sponsor code account is b000, you are submitting a 4-processor batch interactive job to the general queue, and you want tcsh as the interactive shell when your job starts.

Interactive batch session with remote display of X-windows

If your interactive batch session requires remote display of X-windows,

Verify that X-window display works on your local system. How you do this varies from system to system. For non-Unix-based systems, you may need to ensure that some X-server is installed on your system and running. Contact your local system administrator and/or the NCCS User Services Group if you need assistance.
Enable X11 forwarding by using the "-X" option when you connect to the login server:
ssh -X userid@login.nccs.nasa.gov
After logging into halem, requesting an interactive batch session is the same as described above for pseudo-terminal support.
You may test your ability to run an X application using a simple command such as
/usr/bin/X11/xclock

| Top of Page |

Batch request using script

To submit a job for execution on halem, create an LSF batch script. The simplest script might look like:

#!/bin/csh
#BSUB -P k1234
#BSUB -J jjob
#BSUB -n 4
#BSUB -W 00:06
#BSUB -q giss_sm_b
#BSUB -i /scr/userid/Poisson_Hybrid/omp/inp
#BSUB -o /scr/userid/Poisson_Hybrid/omp/bsout.%J
#BSUB -e /scr/userid/Poisson_Hybrid/omp/bserr.%J

echo "This job runs on `hostname`."
echo " ---- starting job output"

chdir /scr/userid/Poisson_Hybrid/omp
prun -c 4 poisson_omp

Then to submit a script for execution, type

% bsub < script_name

Please note that the "<" sign must be used before the batch file name. If you do not use the "<" and simply type "bsub script_name", none of the #BSUB directives in the job script will be processed, and some unexpected job behavior may result. Always use "<" when submitting your batch script via the bsub command. Some of the option flags for the bsub commands are:

-P, to assign a sponsor code. To get your sponsor code, use the command getsponsor.
-J, to assign a name to your job.
'-n numprocs', to submit a parallel job with 'numprocs' minimum processor.
-W, to set the run time limit of the batch job.
-q, to submit the job to the specified queue.
-i, to get the standard input for the job from specified file.
-e, to specify the file path and append the standard error output of the job to the specified file.
-o, to direct the job output to the specified file. If you use -o without -e, the standard error of the job is stored in the output file. If you use the special character %J in the name of the output/error file, then %J is replaced by the job ID of the job.

These options may be specified on the command line or in the batch script by preceeding them with the "#BSUB" directive. See "man bsub" for further information on bsub command options.

A word on running jobs conditionally

The LSF bsub command provides a '-w' flag (not to be confused with the capital -W) to use dependency conditions to control the placement of a job. For example, the script:

#!/bin/csh
#BSUB -P k32
#BSUB -J job3
#BSUB -n 8
#BSUB -W 00:06
#BSUB -q general
#BSUB -o /scr/m/est/out3.%J
#BSUB -e /scr/m/est/err3.%J
#BSUB -w 'done("job1") && done("job2")'

prun -n 8 job3.exe

would only allow job3 to run if job1 and job2 had completed successfully. The dependency condition inside the single quote has to be 'True' for the job to be allowed to run. The logical operators, && (AND), || (OR), and ! (NOT) can be used as well as parentheses to generate the conditional expression. For example,

'done("job1") && done("job2")'

runs the 3rd job if the 1st AND 2nd job are successful,

'exit("job1") && exit("job2")'

runs the 3rd job if the 1st AND 2nd job both fail, and

'ended("job1") && ended("job2")'

runs the 3rd job when the 1st AND 2nd job have finished, regardless of whether they ran normally or failed.

Except for putting the job name in double quotes, an LSF job ID can be used in parentheses without the double quotes.

The bkill command may be used to delete jobs that are queued or running. Use "bkill <jobid>" to kill a previously submitted job. "bkill 0" may also be used to kill all your jobs that are running or pending. See "man bkill" for more information on the bkill command.

| Top of Page |

Monitoring jobs

The jobs on halem can be monitored via RMS or LSF commands. RMS provides a set of commands for running parallel programs and monitoring their execution. LSF handles the integration with RMS and the allocation of resources so that you normally don't have to worry about issuing RMS specific commands other than prun to run their executables. Use the "rinfo" command to see the status of running jobs and allocated resources.

A group of commands is also available to display information about LSF jobs and hosts including bjobs, bqueues, bhist, bpeek, and bhosts. For example,

% bjobs -u all -q general

displays all currently running jobs for all users for a specific queue, in this case 'general', and

% bjobs -a -u userid

displays information about jobs by 'userid' in all states, including finished jobs that finished recently, within an interval specified by CLEAN_PERIOD in lsb.params (the default period is 1 hour). For jobs that finished more than CLEAN_PERIOD seconds ago, use the bhist command.

You can find your jobs within some time frame by issuing the command

% bhist -T 2002/8/16/00:40,2002/8/16/17:50 -n 0 -N 1.00 -u userid

and then find detailed descriptions on each job by

% bhist -n 0 -N 1.00 -l JOBID

| Top of Page |

Checkpointing jobs

Currently there is no system-level checkpointing for interactive or batch jobs. You are responsible for generating output files and restart files in case the system behaves abnormally. Contact the NCCS User Services Group if you are interested in adding checkpoint capability to your application.

+ Privacy Policy and Important Notices
+ Sciences and Exploration Directorate
+ CISTO

Curator: Mason Chang,
NCCS User Services Group (301-286-9120)
NASA Official: Phil Webster, High-Performance
Computing Lead, GSFC Code 606.2