NERSC logo National Energy Research Scientific Computing Center
  A DOE Office of Science User Facility
  at Lawrence Berkeley National Laboratory
 

Running Jobs on Franklin

Important Notice

NERSC is upgrading Franklin to a quad-core XT4 system from July to October 2008. Please refer to Franklin Quad Core Upgrade Plan for detailed time lines and changes in user environment and running jobs.

Essential Points about Running Jobs on Franklin

  • Please remember to use aprun to launch tasks onto the compute nodes. That is the ONLY way tasks are launched on the compute nodes of Franklin.
  • This is a fundamental difference from Jacquard and Bassi. On those systems, anything in a batch job runs on compute nodes held by the job. But on Franklin, any shell command in a batch script, including aprun itself, is executed on a login node which is shared with other users, NOT on the compute nodes held by the job.
  • Users accumulate MPP charges based on the wall-clock time that the batch job holds the compute nodes, even if they are all idle while the batch job shell commands and other serial tasks (e.g. hsi, make, compiles, IDL) run on shared login nodes.
  • Batch jobs running processes on login nodes which generate significant memory or CPU loads are discouraged. Large parallel makes, data analysis using IDL, or large script tasks using Python or Perl (whether run interactively or from a batch script) can significantly load a login node.

Reporting Slow Response Time

If you experience slow response time on a login node, please email to consult@nersc.gov the output of the following command sequence:

hostname; pwd; uptime


Overview

Franklin has 9,672 compute nodes for High Performance Computing (HPC) jobs and an additional 16 nodes for user logins and shell access. Each node has a dual-core Opteron processor and 4 GB of memory. When you SSH to Franklin you are randomly connected to a login node, which runs a full Linux operating system and from which you can edit files, compile code, submit jobs, etc. Compute jobs are launched through a Cray-specific command and execute exclusively on the compute nodes, which run a high-performance - but limited - version of Linux named "Compute Node Linux," or CNL.

Three software packages are used by the Franklin system to schedule and run HPC jobs:

  • Torque, which is a batch system framework
  • Moab, the job scheduler
  • ALPS, which runs the job

These are the steps typically followed to run an application code:

  1. A batch script is written using a text editor.
  2. The user verifies that the batch script contains the desired keywords and the ALPS aprun job-launch command (similar to the MPICH mpirun command)
  3. The batch script is submitted to the batch system using the Torque qsub command.

Users should run batch jobs out of their $SCRATCH directory, rather that $HOME for better I/O performance. NGF (/project) directories are not available from the compute nodes and can not be used for I/O.

Franklin Interactive Jobs

Command Line Interactive Jobs

Command line interactive jobs are not available at this time. All the compute nodes are configured as batch nodes. Interactive jobs should be run as interactive batch jobs, see below.

Interactive Batch Jobs

Interactive batch jobs are useful for debugging purposes. The following command requests 4 PEs from the interactive batch class using account reponame (other Torque keywords including other queue classes could also be used).

 
franklin% qsub -I -V -q interactive -l mppwidth=4 [-l mppnppn=num_of_tasks_per_node] [-A reponame] 
franklin% cd $PBS_O_WORKDIR 

A new shell with user's home directory will start from the qsub command. The directory from which a job is submitted is defined in the environment variable $PBS_O_WORKDIR, from which the user can then launch interactive jobs with the aprun utility. Jobs can run in either dual core or single core mode. In this example, the job will use 2 nodes (with 2 cores per node by default) from the "interactive" batch class have been requested, "-V" will ensure the environment setting from your terminal will be inherited in your batch job. It is optional to specify your account to be charged with "-A reponame".

A user can run a 4 processor job using 4 nodes in single core mode or using 2 nodes in dual core mode. On Franklin, dual core mode is the default. The following example will run on 4 nodes in single core mode:

 
franklin% aprun -n 4 -N 1 ./a.out 

The following example will run on 2 nodes with both cores on a node:

 
franklin% aprun -n 4 -N 2 ./a.out 

Franklin Batch Jobs

Sample Batch Script

A batch script - a text file with Torque directives and job commands - is required to submit parallel jobs. Torque directive lines, which tell the batch system how to run a job, begin with #PBS. The following is an example batch script submitting to the debug queue, requesting 2 nodes using 4 processors with the default dual core option, with a 10 minute wall clock limit.

#PBS -q debug
#PBS -l mppwidth=4
#PBS -l walltime=00:10:00
#PBS -j eo

cd $PBS_O_WORKDIR
aprun -n 4 ./a.out

Here is another example requesting 4 processors using 4 nodes with only 1 core per node:

#PBS -q debug
#PBS -l mppwidth=4
#PBS -l mppnppn=1
#PBS -l walltime=00:10:00
#PBS -j eo

cd $PBS_O_WORKDIR
aprun -n 4 -N 1 ./a.out

Notice the number specified for "-l mppwidth" will always match the "-n" option for aprun. Since dual core mode is the default mode, the following code is equivalent to the first example:

#PBS -q debug
#PBS -l mppwidth=4
#PBS -l mppnppn=2
#PBS -l walltime=00:10:00
#PBS -j eo

cd $PBS_O_WORKDIR
aprun -n 4 -N 2 ./a.out

the TORQUE keyword "#PBS -l mppnppn=2" and aprun option "-N 2" are optional in dual core mode. The following table lists the most important corresponding "aprun" vs. "#PBS -l" options:

aprun option#PBS -l optionDescription
-n 4 -l mppwidth=4 Width (number of PEs)
-N 1 -l mppnppn=1 Number of PEs per node

In the sample scripts above, the line cd $PBS_O_WORKDIR changes the current working directory to the directory from which the script was submitted. NERSC recommends running jobs from the $SCRATCH instead of $HOME. The easiest way to run a job from $SCRATCH is to submit the job from the $SCRATCH directory. Alternatively, a user may replace cd $PBS_O_WORKDIR with cd $SCRATCH in the batch script.

Torque Keywords

The following table lists recommended and useful Torque keywords. For an expanded list of Torque job options and keywords see the Torque qsub documentation, but keep in mind that this is describes a generic Torque implementation and not all options are relevant to Franklin. All qsub command-line options can be embedded in a batch script on lines as #PBS option.

Recommended Torque Options/Directives
OptionDefaultDescription
-lmppwidth=total_tasks 1 Always specify the total number of MPI tasks or instances of the executable. (Cray XT4 specific)
-l walltime=HH:MM:SS 00:30:00 Always specify the maximum wallclock time for your job.
-q queue batch See Batch queues below.
-N job_name Job script name. Job Name: up to 15 printable, non-whitespace characters.
Useful Torque Options/Directives
OptionDefaultDescription
-lmppnppn=cores_per_node 2 Use cores_per_node cores per node (Cray XT4 specific)
-lmppdepth=threads_per_node 1 Run threads_per_node threads per node; use for OpenMP (Cray XT4 specific)
-A repo Default repo Charge this job to repo
-e filename <script_name>.e<job_id> Write STDERR to filename
-o filename <script_name>.o<job_id> Write STDOUT to filename
-j [eo|oe] Do not merge. Merge STDOUT and STDERR. If oe merge as standard output; if eo merge as standard error.
-m [a|b|e|n] a E-mail notification options:
a = send mail when job aborted by system
b = send mail when job begins
e = send mail when job ends
n = do not send mail
Options a,b,e may be combined.
-S shell Login shell Specify shell as the scripting language to use.
-V Do not import. Export the current environment variables into the batch job environment.

All options may be specified as either (1) qsub command-line options or (2) as directives in the batch script as #PBS option.

STDOUT and STDERR

While your job is running standard output (STDOUT) and standard error (STDERR) are written to a file (or files) in a system directory. This output is copied to your submission directory only when the job completes. To view it during the run, you could merge stderr file to stdout file (with Torque keyword) and redirect your output to a file. For example:

...
#PBS -j oe
...
     aprun -n 64 ./a.out >& my_output_file          (for csh/tcsh)
or:  aprun -n 64 ./a.out > my_output_file 2>&1      (for bash)

Submit, Delete, Hold, and Release Jobs

To submit a job for execution, type


% qsub batchscript

where batchscript is the name of the batch script. The output of the qsub command will include the jobid. Users should record this information, as it is very useful in debugging job failures.

To delete a previously submitted job, type


franklin% qdel jobid

where jobid is the jobid, produced by the qsub command.

To hold a previously submitted job, type


franklin% qhold jobid

To release a previously held job, type


franklin% qrls jobid

Job Steps and Dependencies

There is a qsub option -W depend=dependency_list or a Torque Keyword #PBS -W depend=dependency_list for job dependencies. The most commonly used dependency_list would be afterok:jobid[:jobid...], which means the job just submitted could only be executed after the dependent job(s) have terminated without an error.

For example, to run batch job2 only after batch job1 succeeds,


franklin% qsub job1
297873.nid00003

franklin% qsub -W depend=afterok:297873.nid00003 job2 

or:


franklin% qsub job1
297873.nid00003

franklin% cat job2 
#PBS -q debug
#PBS -l mppwidth=4
#PBS -l walltime=0:30:00
#PBS -W depend=afterok:297873.nid00003
#PBS -j oe
 
cd $PBS_O_WORKDIR
aprun -n 4 ./a.out

franklin% qsub job2

Second job will be in batch "Held" status until job1 has run successfully. Note job2 has to be submitted while job1 is still in the batch system, either running or in the queue. If job1 has exited before job2 is submitted, job2 will not be released from the "Held" status.

It is also possible to submit the second job in its dependent job (job1) batch script using Torque keyword "$PBS_JOBID":

#PBS -q debug
#PBS -l mppwidth=4
#PBS -l walltime=0:30:00
#PBS -j oe
 
cd $PBS_O_WORKDIR
qsub -W depend=afterok:$PBS_JOBID job2
aprun -n 4 ./a.out

Please refer to qsub man page for other -W depend=dependency_list options including afterany:jobid[:jobid...], afternotok:jobid[:jobid...], before:jobid[:jobid...], etc.

Running Multiple Parallel Jobs Sequentially

Multiple parallel jobs could be run sequentially in one single batch job. Be sure to specify the LARGEST number of nodes needed for the jobs times 2 for the Torque keyword "mppwidth". For example, the following sample script will reserve 10 cores, and run three executables in sequential order:

#PBS -q debug
#PBS -l mppwidth=10
#PBS -l walltime=0:30:00
#PBS -j oe
  
cd $PBS_O_WORKDIR
aprun -n 4 ./a.out 
aprun -n 10 ./b.out 
aprun -n 6 ./c.out 

Running Multiple Parallel Jobs Simultaneously

Multiple parallel jobs could be run simultaneously in one single batch job. Be sure to specify the TOTAL number of nodes needed for these jobs times 2 for the Torque keyword "mppwidth". (Note: not simply the total number of cores needed since only one executable could run on a single node). For example, the following sample script will reserve 30 cores (not simply adding 4+15+9=28), and run three executables simultaneously, a.out on 2 nodes, b.out on 8 nodes, and c.out on 5 nodes.

#PBS -q debug
#PBS -l mppwidth=30
#PBS -l walltime=0:30:00
#PBS -j oe
  
cd $PBS_O_WORKDIR
aprun -n 4 ./a.out & 
aprun -n 15 ./b.out &
aprun -n 9 ./c.out & 
wait 

Running MPMD (Multi Programming Multi Data) Jobs

To run an MPMD job, use aprun option " -n pes executable1 : -n pes executable2 : ...". All the executables share a single MPI_COMM_WORLD.

For example, the following command runs a.out on 4 cores and b.out on 8 cores:

aprun -n 4 ./a.out : -n 8 ./b.out

Please notice that the number of nodes needed for each executable should be calculated separately since only one executable could run on each node. The number of nodes needed for this job would be the total number of nodes needed for each executable.

For example, the following command runs a.out on 3 cores and b.out on 9 cores, and the total number of nodes (with default dual core) needed is 2 + 5 = 7, thus mppwidth needs to be set to 14, instead of simply adding 3 and 9:

#PBS -q debug
#PBS -l mppwidth=14
#PBS -l walltime=0:30:00
#PBS -j oe
  
cd $PBS_O_WORKDIR
aprun -n 3 ./a.out : -n 9 ./b.out

Running Hybrid MPI/OpenMP Jobs

Franklin has 2 cores sharing the memory on each node. OpenMP is supported within the node. To use OpenMP, the compiler option "-mp=nonuma" is needed to compile the code. A torque batch script need to specify keywords "-mppnppn=1" (use single-core for one MPI task per node) and "-mppdepth=2" (use 2 threads per MPI task). Also need to set OpenMP environment variable OMP_NUM_THREADS to 2, and use "-N 1" option for the "aprun" command.

Please refer to here for a sample Fortran MPI/OpenMP source code, and the batch script used on Franklin.

Undelivered Batch Output

Sometimes the batch system fails to deliver the stdout/stderr files back to the user. Once a night, the orphaned output files of a user's jobs will be be placed in the user's $SCRATCH/Undelivered_Batch_Output directory. The directory will be created if it does not yet exist. Output files there are identified by the job id.

Job Exit Summary

NERSC implemented some utilies to track and categorize the user job exit codes. The Job Exit Summary is displayed at the end of each job standard output file in the following format:


   -------------------------- Batch Job Report ------------------------------

   Job Id:         5504377.nid00003
   User Name:      yunhe
   Group Name:
   Job Name:       mm2
   Session ID:     18905
   Resource List:  walltime=00:10:00
   Queue Name:     debug
   Account String: mpccc
 
Job Exit Summary:
   APINFO_SUCCESS:  application completed with no detectable system errors

    The current possible job exit categories are:

  • APINFO_SUCCESS: application completed with no detectable system errors
  • APINFO_TORQUEWALLTIME: job hit wallclock time limit
  • APINFO_APRUNWIDTH: error with width parameters to aprun
  • APINFO_NODEFAIL: a node has failed, aborting your applicaiton
  • APINFO_MPICHUNEXBUFFERSIZE: new values for MPICH_UNEX_BUFFER_SIZE are required
  • APINFO_ENOENT: a critical file could not be located
  • APINFO_LIBSMA: shared memory problem
  • APINFO_SIGTERM: job was killed
  • APINFO_NOAPRUN: no aprun could be found for this job
  • APINFO_UNKNOWN: application had a non-zero exit code, and the error could not be determined
  • APINFO_NOTRACE: unable to locate the aprun for this job
  • APINFO_SHMEMATOMIC: shmem atomic operations are currently not supported
  • APINFO_DISKQUOTA: disk quota exceeded
  • APINFO_SIGSEGV: Segmentation violation
  • APINFO_CLAIM: node count exceeds reservation claim
  • APINFO_MPIABORT: application called MPI_Abort
  • APINFO_NIDTERM: compute node initiated termination, possible out of memory condition
  • APINFO_ROMIO: ROMIO-IO level error
  • APINFO_MPIIO: MPI-IO level error

Batch Queue Classes and Policies

Please refer to Batch Queue Classes and Policies page for detailed batch job submission classes and NERSC queue policies for Franklin.

Monitoring Jobs on Franklin

Franklin's current queue look is displayed on the web, updated every 10 minutes. A completed jobs list shows jobs that finished yesterday and before.

There are a few other Torque commands and Cray XT utilities that could be used to monitor batch jobs on Franklin:

qstat -a

Use qstat -a instead of qstat for more complete information. This command lists the jobs in submission order. Please see qstat man page for more information.


franklin% qstat -a
nid00003: 
                                                            Req'd  Req'd   Elap
Job ID               Username Queue    Jobname    SessID NDS   TSK Memory Time  S Time
-------------------- -------- -------- ---------- ------ ----- --- ------ ----- - -----
525573.nid00003      user1  reg_big    job1       15734   --   --    --  06:30 R 02:14
525579.nid00003      user2  reg_smal   job2       15164   --   --    --  36:00 Q  --
525580.nid00003      user3  debug      job3b      15556   --   --    --  00:30 R 00:17
...

qs

The NERSC qs command gives queue status information tailored to Franklin. It displays a terminal formatted summary of running and queued jobs. The qs command takes -u username and -w options that allow decreasing or increasing the amount of information reported. Please see qs man page for more information.


franklin% qs
 JOBID ST      USER        NAME  SIZE       REQ      USED            SUBMIT
 525565  R     user1       job1  3120   10:30:00  05:13:02   May 23 16:42:51
 525568  R     user2       job2   100   02:00:00  01:12:53   May 23 10:23:32
 525567  H     user3       job3   512   24:00:00         -   May 22 19:02:54
 ...

apstat

The apstat command gives the number of up nodes and idle nodes. Also a list of current pending and running jobs. apstat -r command displays all the nodes reservations. Please see apstat man page for more information.

showq

The showq command will list jobs in three categories: active jobs, eligible jobs and blocked jobs. This command lists jobs in priority order. showq -i lists details of all eligible jobs. A job already has a resource allocation is listed with a "*" mark next to its job id. Please see showq command overview for more information.

showstart

The showstart command takes a job id as its argument. It displays an estimate start time of a job. There are a few estimation methods for this command (or a similar job with same proc and wall time requirement): historical, reservation (default), and priority. The estimation time based on each method could be different. showstart -e all would show all estimates.


franklin% showstart -e all 2542
job 2542 requires 512 procs for 11:00:00
 
Estimated Rsv based start in                 5:34:28 on Tue Sep 11 17:45:52
Estimated Rsv based completion in           16:34:28 on Wed Sep 12 28 04:45:52
 
Estimated Priority based start in           22:28:05 on Wed Sep 12 10:39:29
Estimated Priority based completion in    1:09:28:05 on Wed Sep 12 21:39:29
 
Estimated Historical based start in         16:24:07 on Wed Sep 12 04:35:31
Estimated Historical based completion in  1:03:24:07 on Wed Sep 12 15:35:31
 
Best Partition: franklin

Please note that showstart only estimates the EARLIST time a job would start ASSUMMING this job is the HIGHEST priority job in the queue. And if a job has already a resource reservation (use showq -i to find out), showstart will display the correct reservation time for this job (a more reliable overpredict). Please see showstart command overview for more information.

checkjob

The checkjob command takes a job id as its argument. It displays the details of a job, such as why it is in a certain state.


franklin% checkjob 542
checking job 542
 ...
job cannot run  (job has hold in place)
job cannot run  (insufficient idle procs:  0 available) 

The above gives the reason why this job is in a blocked state. Please see checkjob command overview for more information.

xtshowcabs

The xtshowcabs command shows the current allocation and status of the system's nodes and gives information about each running job. The output displays the position of each node in the System Interconnection Network and represents the application running on the node with a symbol assigned for the particular execution of xtshowcabs. Please see the man page for more information.

Memory Usage Considerations

Each Franklin compute node has 4 GB (4096 MB) of physical memory, but, not all that memory is available to user programs. Compute Node Linux (the kernel), the Lustre file system software, and message passing library buffers all consume memory, as does loading the executable into memory. Thus the precise memory available to an application varies, approximately 3584 MB (3.5 GB) of memory can be allocated from within an MPI program using both cores per node, i.e., 1792 MB (1.75 GB) per MPI task on average. Using 1 core per node an MPI program can allocated up to about 3672 MB (3.58 GB) per task.

A user can change MPI buffer sizes by setting certain MPICH environment variables. See the man page for intro_mpi for more details.

Currently, there are no user level error messages when a job runs out of memory. A job may seem to terminate without obvious reason when it does. Sometimes out-of-memory jobs may leave some nodes in an unhealthy state that would affect future jobs landed on these nodes (bug reported to Cray). Users are encouraged to evaluate carefully the memory requirement via internal checking in their codes or by some tools. Craypat could track heap usage. And IPM also tracks memory usage.

MPI Task Distribution on Nodes

The distribution of MPI tasks on the nodes can be written to the standard output file by setting environment variable PMI_DEBUG to 1. Users can control the distribution of MPI tasks on the nodes using the environment variable MPICH_RANK_REORDER_METHOD. The default task distribution in dual core mode is SMP-style placement, when the environment variable MPICH_RANK_REORDER_METHOD is set to 1. For example, 8 MPI tasks would be distributed as follows:

Node Node 1 Node 2 Node 3 Node 4
Core 1 Core 2 Core 1 Core 2 Core 1 Core 2 Core 1 Core 2
MPI Rank
0
1
2
3
4
5
6
7

Setting MPICH_RANK_REORDER_METHOD to 2 would allow a folded-rank placement of MPI tasks:

Node Node 1 Node 2 Node 3 Node 4
Core 1 Core 2 Core 1 Core 2 Core 1 Core 2 Core 1 Core 2
MPI Rank
0
7
1
6
2
5
3
4

Setting MPI environment variable MPICH_RANK_REORDER_METHOD to 3 requires a custom placement of MPI ranks with a user defined file MPICH_RANK_ORDER. See the intro_mpi man page for more information.

CNL malloc Environment Variables

The CNL kernel provides the following runtime malloc tunable environment variables to control how the system memory allocation routine "malloc" behaves (note the trailing underscores):

  • MALLOC_TRIM_THRESHOLD_
  • MALLOC_TOP_PAD_
  • MALLOC_MMAP_THRESHOLD_
  • MALLOC_MMAP_MAX_
The two variables that have been found most useful are MALLOC_MMAP_MAX_ and MALLOC_TRIM_THRESHOLD_ . The recommended settings for these two variables are:
  • MALLOC_TRIM_THRESHOLD_ = 536870912
  • MALLOC_MMAP_MAX_ = 0

Setting MALLOC_MMAP_MAX_ limits the number of 'internal' mmap regions. The setting of 0 means that the program will not use any "non" heap mapping regions instead of the default value of 64. This eliminates the system calls to mmap/munmap.

MALLOC_TRIM_THRESHOLD_ is the amount of free space at the top of the heap after a free() that needs to exist before malloc will return the memory to the OS. Setting MALLOC_TRIM_THRESHOLD_ helps performance by reducing system time overhead by reducing the number of calls to sbrk/brk. The default setting of 128 KBytes is much too low for a node with 4 GBytes of memory and one application. We suggest setting it to 0.5 GBytes.

Please refer to Cray document CNL malloc Environment Variables for more information.


LBNL Home
Page last modified: Mon, 15 Sep 2008 22:28:31 GMT
Page URL: http://www.nersc.gov/nusers/systems/franklin/running_jobs/
Web contact: webmaster@nersc.gov
Computing questions: consult@nersc.gov

Privacy and Security Notice
DOE Office of Science