Skip all navigation and jump to content Jump to site navigation Jump to section navigation.
NASA Logo - Goddard Space Flight Center + Visit NASA.gov
NASA Center for Computational Sciences
NCCS HOME USER SERVICES SYSTEMS DOCUMENTATION NEWS GET MORE HELP

 

Documentation
OVERVIEW
GENERAL SOFTWARE INFO
HALEM
DALEY AND COURANT
PALM/EXPLORE
DIRAC/JIMPF

More halem links:

+ Quick Start Guide

+ Overview of System Resources

+ Programming Environment

+ Batch Queues

+ Software and Tools

Batch Queues on halem

+ Job environment
+ halem batch queues
+ Interactive batch environment
+ Batch request using script
+ Monitoring jobs
+ Checkpointing jobs

Job environment

The life cycle of a batch job begins when you submit the job to LSF (Load Sharing Facility) using the LSF GUI (xlsbatch) or the command line (bsub). Either method allows you to specify many options to modify the default behavior.

If you do not submit your job to LSF, it will run interactively on the login nodes. However, we strongly discourage you from running interactive jobs (serial or parallel) this way. Even one job with a large memory requirement (over 1 GB) will significantly slow down interactive performance. The login nodes are designed first and foremost for submitting batch jobs and then for general interactive sessions such as editing files or compiling codes by all users.

Note that interactive jobs that consume system resources unreasonably because they are running in ways other than those described in this guide may be killed at any moment without warning by the system administrator.

Jobs submitted on halem are allocated whole nodes (multiples of 4 CPUs). If your CPU request is not a multiple of four, it will automatically be rounded up to the nearest multiple of four to allow for whole-node allocation. The only queue that allows 1-CPU allocation is datamove, and this queue is intended to be used only for moving data to and from halem. To allow for maximum data transfer rates, this queue only runs on the file-serving nodes of halem. Non-data-movement jobs that are run in this queue may be killed at any time.

| Top of Page |


halem batch queues

Jobs are submitted from the login nodes on halema and executed on the compute nodes assigned by the batch subsystem. The table below summarizes halem queue names and their parameters. Use the command

bqueues -l queue_name

to see detailed information about any batch queue. Issuing the command

bqueues -u userid

will list all the batch queues to which userid is allowed to submit jobs.

Queue Priority Default Run Limit (min) Max Run Limit (min) Max Job Size (CPU) Job Limit per User (CPU) Job Limit per Queue (CPU) Hosts
debug 31 5 15 32   32 hlm100
general 30 10 180 64 128 440 hlm125
general_lng 30 10 720 64 128 228 hlm100
datamove 30 30 60 1 8 16 halema
giss_b 30 720 1800 96   300 hlmd-e
geos_int 30 60 480 64   256 hlmf-h1
gmao_pproc 30 60 60 4 12 16 halema
gmao_hi 30 360 720 560   0 hlm125
gmao_big 29 360 360 320   0 hlm125
gmao_long 29 1440 1440 256   0 hlm125
gmao_short 28 120 120 560   0 hlm125
gmao_lo 28 360 360 560   0 hlm125
background 25 60 240 128 192 540 hlm100
special_b 31 360 720 852   852 hlm125
special 25 120 1500 128 128 128 hlm125

Note that the queue structure and limits above can change at any time. To display a detailed list of your limits, issue the command

bqueues -l -u userid

In the long (-l) listing of the bqueues command, the column named MAX lists the maximum number of CPUs available for a queue. The MAXIMUM LIMITS section defines RUNLIMIT (maximum CPU time limit) for any job in the queue, and the column PROCLIMIT further defines rules as to how many processors can be used by a job. The leftmost number is the minimum number of processors that will be used, the middle number is the default number of processors, and the rightmost number is the maximum number of processors that can be used by one job.

The default run limit will be used for batch jobs submitted without the "bsub -W <time_limit>" option or without specifying "#BSUB -W time_limit" in the job scripts. Specifying a run time limit is highly recommended for all batch jobs.

If no queue name is specified when a batch job is submitted, it will go into the general queue by default.

| Top of Page |


Interactive batch environment

Interactive batch session with pseudo-terminal support

Interactive batch session with pseudo-terminal support is the preferred method for submitting an interactive batch job if you do not require X-windows. It avoids sending encrypted graphics over the network, which can sometimes be slow. To start a session, at the prompt issue the command:

% bsub -P sponsor_code -q queue_name -Is -n num_cpus shell_name

The key option is the "-Is," which submits a batch interactive job and creates a pseudo-terminal with shell mode support when the job starts. For example,

% bsub -P b000 -q general -Is -n 4 /usr/dlocal/bin/tcsh

The above example assumes that your sponsor code account is b000, you are submitting a 4-processor batch interactive job to the general queue, and you want tcsh as the interactive shell when your job starts.

Interactive batch session with remote display of X-windows

If your interactive batch session requires remote display of X-windows,

  • Verify that X-window display works on your local system. How you do this varies from system to system. For non-Unix-based systems, you may need to ensure that some X-server is installed on your system and running. Contact your local system administrator and/or the NCCS User Services Group if you need assistance.
  • Enable X11 forwarding by using the "-X" option when you connect to the login server:
    ssh -X userid@login.nccs.nasa.gov
  • After logging into halem, requesting an interactive batch session is the same as described above for pseudo-terminal support.
  • You may test your ability to run an X application using a simple command such as
    /usr/bin/X11/xclock

| Top of Page |


Batch request using script

To submit a job for execution on halem, create an LSF batch script. The simplest script might look like:

#!/bin/csh
#BSUB -P k1234
#BSUB -J jjob
#BSUB -n 4
#BSUB -W 00:06
#BSUB -q giss_sm_b
#BSUB -i /scr/userid/Poisson_Hybrid/omp/inp
#BSUB -o /scr/userid/Poisson_Hybrid/omp/bsout.%J
#BSUB -e /scr/userid/Poisson_Hybrid/omp/bserr.%J

echo "This job runs on `hostname`."
echo " ---- starting job output"

chdir /scr/userid/Poisson_Hybrid/omp
prun -c 4 poisson_omp

Then to submit a script for execution, type

% bsub < script_name

Please note that the "<" sign must be used before the batch file name. If you do not use the "<" and simply type "bsub script_name", none of the #BSUB directives in the job script will be processed, and some unexpected job behavior may result. Always use "<" when submitting your batch script via the bsub command. Some of the option flags for the bsub commands are:

  • -P, to assign a sponsor code. To get your sponsor code, use the command getsponsor.
  • -J, to assign a name to your job.
  • '-n numprocs', to submit a parallel job with 'numprocs' minimum processor.
  • -W, to set the run time limit of the batch job.
  • -q, to submit the job to the specified queue.
  • -i, to get the standard input for the job from specified file.
  • -e, to specify the file path and append the standard error output of the job to the specified file.
  • -o, to direct the job output to the specified file. If you use -o without -e, the standard error of the job is stored in the output file. If you use the special character %J in the name of the output/error file, then %J is replaced by the job ID of the job.

These options may be specified on the command line or in the batch script by preceeding them with the "#BSUB" directive. See "man bsub" for further information on bsub command options.

A word on running jobs conditionally

The LSF bsub command provides a '-w' flag (not to be confused with the capital -W) to use dependency conditions to control the placement of a job. For example, the script:

#!/bin/csh
#BSUB -P k32
#BSUB -J job3
#BSUB -n 8
#BSUB -W 00:06
#BSUB -q general
#BSUB -o /scr/m/est/out3.%J
#BSUB -e /scr/m/est/err3.%J
#BSUB -w 'done("job1") && done("job2")'

prun -n 8 job3.exe

would only allow job3 to run if job1 and job2 had completed successfully. The dependency condition inside the single quote has to be 'True' for the job to be allowed to run. The logical operators, && (AND), || (OR), and ! (NOT) can be used as well as parentheses to generate the conditional expression. For example,

'done("job1") && done("job2")'

runs the 3rd job if the 1st AND 2nd job are successful,

'exit("job1") && exit("job2")'

runs the 3rd job if the 1st AND 2nd job both fail, and

'ended("job1") && ended("job2")'

runs the 3rd job when the 1st AND 2nd job have finished, regardless of whether they ran normally or failed.

Except for putting the job name in double quotes, an LSF job ID can be used in parentheses without the double quotes.

The bkill command may be used to delete jobs that are queued or running. Use "bkill <jobid>" to kill a previously submitted job. "bkill 0" may also be used to kill all your jobs that are running or pending. See "man bkill" for more information on the bkill command.

| Top of Page |


Monitoring jobs

The jobs on halem can be monitored via RMS or LSF commands. RMS provides a set of commands for running parallel programs and monitoring their execution. LSF handles the integration with RMS and the allocation of resources so that you normally don't have to worry about issuing RMS specific commands other than prun to run their executables. Use the "rinfo" command to see the status of running jobs and allocated resources.

A group of commands is also available to display information about LSF jobs and hosts including bjobs, bqueues, bhist, bpeek, and bhosts. For example,

% bjobs -u all -q general

displays all currently running jobs for all users for a specific queue, in this case 'general', and

% bjobs -a -u userid

displays information about jobs by 'userid' in all states, including finished jobs that finished recently, within an interval specified by CLEAN_PERIOD in lsb.params (the default period is 1 hour). For jobs that finished more than CLEAN_PERIOD seconds ago, use the bhist command.

You can find your jobs within some time frame by issuing the command

% bhist -T 2002/8/16/00:40,2002/8/16/17:50 -n 0 -N 1.00 -u userid

and then find detailed descriptions on each job by

% bhist -n 0 -N 1.00 -l JOBID

| Top of Page |


Checkpointing jobs

Currently there is no system-level checkpointing for interactive or batch jobs. You are responsible for generating output files and restart files in case the system behaves abnormally. Contact the NCCS User Services Group if you are interested in adding checkpoint capability to your application.


FirstGov logo + Privacy Policy and Important Notices
+ Sciences and Exploration Directorate
+ CISTO
NASA Curator: Mason Chang,
NCCS User Services Group (301-286-9120)
NASA Official: Phil Webster, High-Performance
Computing Lead, GSFC Code 606.2