FranklinQuad Core Upgrade
Quick Start Guide
Access to Franklin
Software
Status & StatsNERSC MOTD Announcements Known Problems Current Queue Look Completed Jobs List Job Stats |
Running Jobs on FranklinImportant NoticeNERSC is upgrading Franklin to a quad-core XT4 system from July to October 2008. Please refer to Franklin Quad Core Upgrade Plan for detailed time lines and changes in user environment and running jobs. Essential Points about Running Jobs on Franklin
Reporting Slow Response TimeIf you experience slow response time on a login node, please email to consult@nersc.gov the output of the following command sequence:hostname; pwd; uptime
OverviewFranklin has 9,672 compute nodes for High Performance Computing (HPC) jobs and an additional 16 nodes for user logins and shell access. Each node has a dual-core Opteron processor and 4 GB of memory. When you SSH to Franklin you are randomly connected to a login node, which runs a full Linux operating system and from which you can edit files, compile code, submit jobs, etc. Compute jobs are launched through a Cray-specific command and execute exclusively on the compute nodes, which run a high-performance - but limited - version of Linux named "Compute Node Linux," or CNL. Three software packages are used by the Franklin system to schedule and run HPC jobs: These are the steps typically followed to run an application code:
Users should run batch jobs out of their $SCRATCH directory, rather that $HOME for better I/O performance. NGF (/project) directories are not available from the compute nodes and can not be used for I/O. Franklin Interactive Jobs
Command Line Interactive JobsCommand line interactive jobs are not available at this time. All the compute nodes are configured as batch nodes. Interactive jobs should be run as interactive batch jobs, see below. Interactive Batch JobsInteractive batch jobs are useful for debugging purposes. The following command requests 4 PEs from the interactive batch class using account reponame (other Torque keywords including other queue classes could also be used). franklin% qsub -I -V -q interactive -l mppwidth=4 [-l mppnppn=num_of_tasks_per_node] [-A reponame] franklin% cd $PBS_O_WORKDIR A new shell with user's home directory will start from the qsub command. The directory from which a job is submitted is defined in the environment variable $PBS_O_WORKDIR, from which the user can then launch interactive jobs with the aprun utility. Jobs can run in either dual core or single core mode. In this example, the job will use 2 nodes (with 2 cores per node by default) from the "interactive" batch class have been requested, "-V" will ensure the environment setting from your terminal will be inherited in your batch job. It is optional to specify your account to be charged with "-A reponame". A user can run a 4 processor job using 4 nodes in single core mode or using 2 nodes in dual core mode. On Franklin, dual core mode is the default. The following example will run on 4 nodes in single core mode: franklin% aprun -n 4 -N 1 ./a.out The following example will run on 2 nodes with both cores on a node: franklin% aprun -n 4 -N 2 ./a.out Franklin Batch JobsSample Batch Script
A batch script - a text file with Torque directives and job commands - is required to submit parallel jobs. Torque directive lines, which tell the batch system how to run a job, begin with #PBS. The following is an example batch script submitting to the debug queue, requesting 2 nodes using 4 processors with the default dual core option, with a 10 minute wall clock limit.
#PBS -q debug #PBS -l mppwidth=4 #PBS -l walltime=00:10:00 #PBS -j eo cd $PBS_O_WORKDIR aprun -n 4 ./a.out Here is another example requesting 4 processors using 4 nodes with only 1 core per node:
#PBS -q debug #PBS -l mppwidth=4 #PBS -l mppnppn=1 #PBS -l walltime=00:10:00 #PBS -j eo cd $PBS_O_WORKDIR aprun -n 4 -N 1 ./a.out Notice the number specified for "-l mppwidth" will always match the "-n" option for aprun. Since dual core mode is the default mode, the following code is equivalent to the first example:
#PBS -q debug #PBS -l mppwidth=4 #PBS -l mppnppn=2 #PBS -l walltime=00:10:00 #PBS -j eo cd $PBS_O_WORKDIR aprun -n 4 -N 2 ./a.out the TORQUE keyword "#PBS -l mppnppn=2" and aprun option "-N 2" are optional in dual core mode. The following table lists the most important corresponding "aprun" vs. "#PBS -l" options:
In the sample scripts above, the line cd $PBS_O_WORKDIR changes the current working directory to the directory from which the script was submitted. NERSC recommends running jobs from the $SCRATCH instead of $HOME. The easiest way to run a job from $SCRATCH is to submit the job from the $SCRATCH directory. Alternatively, a user may replace cd $PBS_O_WORKDIR with cd $SCRATCH in the batch script. Torque KeywordsThe following table lists recommended and useful Torque keywords. For an expanded list of Torque job options and keywords see the Torque qsub documentation, but keep in mind that this is describes a generic Torque implementation and not all options are relevant to Franklin. All qsub command-line options can be embedded in a batch script on lines as #PBS option.
All options may be specified as either (1) qsub command-line options or (2) as directives in the batch script as #PBS option. STDOUT and STDERRWhile your job is running standard output (STDOUT) and standard error (STDERR) are written to a file (or files) in a system directory. This output is copied to your submission directory only when the job completes. To view it during the run, you could merge stderr file to stdout file (with Torque keyword) and redirect your output to a file. For example: ... #PBS -j oe ... aprun -n 64 ./a.out >& my_output_file (for csh/tcsh) or: aprun -n 64 ./a.out > my_output_file 2>&1 (for bash) Submit, Delete, Hold, and Release JobsTo submit a job for execution, type % qsub batchscript where batchscript is the name of the batch script. The output of the qsub command will include the jobid. Users should record this information, as it is very useful in debugging job failures. To delete a previously submitted job, type franklin% qdel jobid where jobid is the jobid, produced by the qsub command. To hold a previously submitted job, type franklin% qhold jobid To release a previously held job, type franklin% qrls jobid Job Steps and DependenciesThere is a qsub option -W depend=dependency_list or a Torque Keyword #PBS -W depend=dependency_list for job dependencies. The most commonly used dependency_list would be afterok:jobid[:jobid...], which means the job just submitted could only be executed after the dependent job(s) have terminated without an error. For example, to run batch job2 only after batch job1 succeeds, franklin% qsub job1 297873.nid00003 franklin% qsub -W depend=afterok:297873.nid00003 job2 or: franklin% qsub job1 297873.nid00003 franklin% cat job2 #PBS -q debug #PBS -l mppwidth=4 #PBS -l walltime=0:30:00 #PBS -W depend=afterok:297873.nid00003 #PBS -j oe cd $PBS_O_WORKDIR aprun -n 4 ./a.out franklin% qsub job2 Second job will be in batch "Held" status until job1 has run successfully. Note job2 has to be submitted while job1 is still in the batch system, either running or in the queue. If job1 has exited before job2 is submitted, job2 will not be released from the "Held" status. It is also possible to submit the second job in its dependent job (job1) batch script using Torque keyword "$PBS_JOBID":
#PBS -q debug #PBS -l mppwidth=4 #PBS -l walltime=0:30:00 #PBS -j oe cd $PBS_O_WORKDIR qsub -W depend=afterok:$PBS_JOBID job2 aprun -n 4 ./a.out Please refer to qsub man page for other -W depend=dependency_list options including afterany:jobid[:jobid...], afternotok:jobid[:jobid...], before:jobid[:jobid...], etc. Running Multiple Parallel Jobs SequentiallyMultiple parallel jobs could be run sequentially in one single batch job. Be sure to specify the LARGEST number of nodes needed for the jobs times 2 for the Torque keyword "mppwidth". For example, the following sample script will reserve 10 cores, and run three executables in sequential order:
#PBS -q debug #PBS -l mppwidth=10 #PBS -l walltime=0:30:00 #PBS -j oe cd $PBS_O_WORKDIR aprun -n 4 ./a.out aprun -n 10 ./b.out aprun -n 6 ./c.out Running Multiple Parallel Jobs SimultaneouslyMultiple parallel jobs could be run simultaneously in one single batch job. Be sure to specify the TOTAL number of nodes needed for these jobs times 2 for the Torque keyword "mppwidth". (Note: not simply the total number of cores needed since only one executable could run on a single node). For example, the following sample script will reserve 30 cores (not simply adding 4+15+9=28), and run three executables simultaneously, a.out on 2 nodes, b.out on 8 nodes, and c.out on 5 nodes.
#PBS -q debug #PBS -l mppwidth=30 #PBS -l walltime=0:30:00 #PBS -j oe cd $PBS_O_WORKDIR aprun -n 4 ./a.out & aprun -n 15 ./b.out & aprun -n 9 ./c.out & wait
Running MPMD (Multi Programming Multi Data) JobsTo run an MPMD job, use aprun option " -n pes executable1 : -n pes executable2 : ...". All the executables share a single MPI_COMM_WORLD. For example, the following command runs a.out on 4 cores and b.out on 8 cores:
aprun -n 4 ./a.out : -n 8 ./b.out Please notice that the number of nodes needed for each executable should be calculated separately since only one executable could run on each node. The number of nodes needed for this job would be the total number of nodes needed for each executable. For example, the following command runs a.out on 3 cores and b.out on 9 cores, and the total number of nodes (with default dual core) needed is 2 + 5 = 7, thus mppwidth needs to be set to 14, instead of simply adding 3 and 9:
#PBS -q debug #PBS -l mppwidth=14 #PBS -l walltime=0:30:00 #PBS -j oe cd $PBS_O_WORKDIR aprun -n 3 ./a.out : -n 9 ./b.out Running Hybrid MPI/OpenMP JobsFranklin has 2 cores sharing the memory on each node. OpenMP is supported within the node. To use OpenMP, the compiler option "-mp=nonuma" is needed to compile the code. A torque batch script need to specify keywords "-mppnppn=1" (use single-core for one MPI task per node) and "-mppdepth=2" (use 2 threads per MPI task). Also need to set OpenMP environment variable OMP_NUM_THREADS to 2, and use "-N 1" option for the "aprun" command. Please refer to here for a sample Fortran MPI/OpenMP source code, and the batch script used on Franklin. Undelivered Batch OutputSometimes the batch system fails to deliver the stdout/stderr files back to the user. Once a night, the orphaned output files of a user's jobs will be be placed in the user's $SCRATCH/Undelivered_Batch_Output directory. The directory will be created if it does not yet exist. Output files there are identified by the job id. Job Exit SummaryNERSC implemented some utilies to track and categorize the user job exit codes. The Job Exit Summary is displayed at the end of each job standard output file in the following format: -------------------------- Batch Job Report ------------------------------ Job Id: 5504377.nid00003 User Name: yunhe Group Name: Job Name: mm2 Session ID: 18905 Resource List: walltime=00:10:00 Queue Name: debug Account String: mpccc Job Exit Summary: APINFO_SUCCESS: application completed with no detectable system errors
The current possible job exit categories are:
Batch Queue Classes and PoliciesPlease refer to Batch Queue Classes and Policies page for detailed batch job submission classes and NERSC queue policies for Franklin. Monitoring Jobs on FranklinFranklin's current queue look is displayed on the web, updated every 10 minutes. A completed jobs list shows jobs that finished yesterday and before. There are a few other Torque commands and Cray XT utilities that could be used to monitor batch jobs on Franklin: qstat -aUse qstat -a instead of qstat for more complete information. This command lists the jobs in submission order. Please see qstat man page for more information. franklin% qstat -a nid00003: Req'd Req'd Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time -------------------- -------- -------- ---------- ------ ----- --- ------ ----- - ----- 525573.nid00003 user1 reg_big job1 15734 -- -- -- 06:30 R 02:14 525579.nid00003 user2 reg_smal job2 15164 -- -- -- 36:00 Q -- 525580.nid00003 user3 debug job3b 15556 -- -- -- 00:30 R 00:17 ... qsThe NERSC qs command gives queue status information tailored to Franklin. It displays a terminal formatted summary of running and queued jobs. The qs command takes -u username and -w options that allow decreasing or increasing the amount of information reported. Please see qs man page for more information. franklin% qs JOBID ST USER NAME SIZE REQ USED SUBMIT 525565 R user1 job1 3120 10:30:00 05:13:02 May 23 16:42:51 525568 R user2 job2 100 02:00:00 01:12:53 May 23 10:23:32 525567 H user3 job3 512 24:00:00 - May 22 19:02:54 ...
apstatThe apstat command gives the number of up nodes and idle nodes. Also a list of current pending and running jobs. apstat -r command displays all the nodes reservations. Please see apstat man page for more information.
showqThe showq command will list jobs in three categories: active jobs, eligible jobs and blocked jobs. This command lists jobs in priority order. showq -i lists details of all eligible jobs. A job already has a resource allocation is listed with a "*" mark next to its job id. Please see showq command overview for more information. showstartThe showstart command takes a job id as its argument. It displays an estimate start time of a job. There are a few estimation methods for this command (or a similar job with same proc and wall time requirement): historical, reservation (default), and priority. The estimation time based on each method could be different. showstart -e all would show all estimates. franklin% showstart -e all 2542 job 2542 requires 512 procs for 11:00:00 Estimated Rsv based start in 5:34:28 on Tue Sep 11 17:45:52 Estimated Rsv based completion in 16:34:28 on Wed Sep 12 28 04:45:52 Estimated Priority based start in 22:28:05 on Wed Sep 12 10:39:29 Estimated Priority based completion in 1:09:28:05 on Wed Sep 12 21:39:29 Estimated Historical based start in 16:24:07 on Wed Sep 12 04:35:31 Estimated Historical based completion in 1:03:24:07 on Wed Sep 12 15:35:31 Best Partition: franklin Please note that showstart only estimates the EARLIST time a job would start ASSUMMING this job is the HIGHEST priority job in the queue. And if a job has already a resource reservation (use showq -i to find out), showstart will display the correct reservation time for this job (a more reliable overpredict). Please see showstart command overview for more information. checkjobThe checkjob command takes a job id as its argument. It displays the details of a job, such as why it is in a certain state. franklin% checkjob 542 checking job 542 ... job cannot run (job has hold in place) job cannot run (insufficient idle procs: 0 available) The above gives the reason why this job is in a blocked state. Please see checkjob command overview for more information. xtshowcabsThe xtshowcabs command shows the current allocation and status of the system's nodes and gives information about each running job. The output displays the position of each node in the System Interconnection Network and represents the application running on the node with a symbol assigned for the particular execution of xtshowcabs. Please see the man page for more information. Memory Usage ConsiderationsEach Franklin compute node has 4 GB (4096 MB) of physical memory, but, not all that memory is available to user programs. Compute Node Linux (the kernel), the Lustre file system software, and message passing library buffers all consume memory, as does loading the executable into memory. Thus the precise memory available to an application varies, approximately 3584 MB (3.5 GB) of memory can be allocated from within an MPI program using both cores per node, i.e., 1792 MB (1.75 GB) per MPI task on average. Using 1 core per node an MPI program can allocated up to about 3672 MB (3.58 GB) per task. A user can change MPI buffer sizes by setting certain MPICH environment variables. See the man page for intro_mpi for more details. Currently, there are no user level error messages when a job runs out of memory. A job may seem to terminate without obvious reason when it does. Sometimes out-of-memory jobs may leave some nodes in an unhealthy state that would affect future jobs landed on these nodes (bug reported to Cray). Users are encouraged to evaluate carefully the memory requirement via internal checking in their codes or by some tools. Craypat could track heap usage. And IPM also tracks memory usage.MPI Task Distribution on NodesThe distribution of MPI tasks on the nodes can be written to the standard output file by setting environment variable PMI_DEBUG to 1. Users can control the distribution of MPI tasks on the nodes using the environment variable MPICH_RANK_REORDER_METHOD. The default task distribution in dual core mode is SMP-style placement, when the environment variable MPICH_RANK_REORDER_METHOD is set to 1. For example, 8 MPI tasks would be distributed as follows:
Setting MPICH_RANK_REORDER_METHOD to 2 would allow a folded-rank placement of MPI tasks:
Setting MPI environment variable MPICH_RANK_REORDER_METHOD to 3 requires a custom placement of MPI ranks with a user defined file MPICH_RANK_ORDER. See the intro_mpi man page for more information. CNL malloc Environment VariablesThe CNL kernel provides the following runtime malloc tunable environment variables to control how the system memory allocation routine "malloc" behaves (note the trailing underscores):
Setting MALLOC_MMAP_MAX_ limits the number of 'internal' mmap regions. The setting of 0 means that the program will not use any "non" heap mapping regions instead of the default value of 64. This eliminates the system calls to mmap/munmap. MALLOC_TRIM_THRESHOLD_ is the amount of free space at the top of the heap after a free() that needs to exist before malloc will return the memory to the OS. Setting MALLOC_TRIM_THRESHOLD_ helps performance by reducing system time overhead by reducing the number of calls to sbrk/brk. The default setting of 128 KBytes is much too low for a node with 4 GBytes of memory and one application. We suggest setting it to 0.5 GBytes. Please refer to Cray document CNL malloc Environment Variables for more information. |
![]() |
Page last modified: Mon, 15 Sep 2008 22:28:31 GMT Page URL: http://www.nersc.gov/nusers/systems/franklin/running_jobs/ Web contact: webmaster@nersc.gov Computing questions: consult@nersc.gov Privacy and Security Notice |
![]() |