MPP2 Details
Contents: Configuration - Access - File Systems - Environment - Compilers - Modules - MPI - Job Submission - NWChem jobs - Sample Script - Interactive jobs - Time Allocation Accounts - Job Policies - FAQ
MSC News
Additional Information
Configuration
MPP2 is a 11.8 TFlops system that consists of 980 Hewlett-Packard Longs Peak nodes (of which 944 will be used for batch processing) with dual Intel 1.5 GHz Itanium-2 processors (also called Madison) and HP's zx1 chipset. The Madison processors are 64-bit processors with a theoretical peak of 6 GFlops. There are two types of nodes on the system, FatNodes (10 Gbytes of memory, i.e. 5 Gbytes per processor, and 430 Gbytes of local disk space) and ThinNodes (10 Gbytes of memory, i.e. 5 Gbytes per processor, and 10 Gbytes of local disk space). Fast interprocessor communication between the processors is obtained using a single rail QSNetII/Elan-4 interconnect from Quadrics. The system runs a version of Linux based on Red Hat Linux Advanced Server. A global 53 Tbyte Lustre file system is available to all the processors. Processor allocation is scheduled using the LSF resource manager.
Access [top]
Accessing MPP2 with SecurID®
For security reasons access to the Molecular Science Computing Facility is obtained through one-time passcodes using SecurID® cards.
The procedure for remote access to MPP2 is a two step process in which the user needs to log-on onto mpp2e.emsl.pnl.gov (an IA32 bit node) from which users can log onto MPP2 using their normal Username/Kerberos_password combination. The step-by-step procedure is presented below. You must have a SecurID® card from MSCF/PNNL and followed the initialization procedure before you try to log onto mpp2e.
From Linux or Unix systems
{Note: Our machines use protocol 2, you may need to use ssh2 or ssh -2 for it to work}
- type the following at the window prompt: ssh <Username>@mpp2e.emsl.pnl.gov
- when prompted for the passcode, enter your PIN and SecurID® number
- once logged in on mpp2e, enter the command: ssh mpp2
- when prompted for your password, enter your Kerberos password for MPP2
From PC or Mac systems
You will need to use at least version 5.3 build 23 of SSH software from F-Secure (current version is 5.4 build 54). When connecting to MSCF's machines, The Authentication method must be set to "Keyboard Interactive". Or you can use PuTTY for Win32 platforms.
- Start F-Secure or PuTTY
- Set Host name to mpp2e.emsl.pnl.gov
- Set User Name to your Username on MPP2
- Set Authentication method to "Keyboard Interactive" (the default for PuTTY)
- Click on 'Connect' {F-Secure} or 'Open' {PuTTY}
- when prompted for the passcode, enter your PIN and SecurID® number
- once logged in on mpp2e, enter the command: ssh mpp2
- when prompted for your password, enter your Kerberos password for MPP2
File Systems [top]
There are four file systems available on the cluster:
- Local file system mounted as /scratch on each of the compute nodes. This is a non-persistent storage area provided to a parallel job running on that node. After the job terminates the data might be immediately cleaned by another job that gets the node.
- NFS file system mounted as /home. This is where user home directory and files are located. This file system uses RAID-5 for reliability. Nightly backup is scheduled.
- Global file system mounted as /dtemp. This is where users should put restart files and files needed for post analysis. The file system has an aggregate write rate of 3.2 Gbyte/s. This is long term global scratch space, files that are older than 60 days will be deleted. /dtemp does NOT get backed up.
- AFS file system mounted as /msrc on the front-end (non-compute) nodes. You can access your AFS files if you have an EMSL AFS account. The AFS commands (klog,tokens,fs,pts) should already be in your path.
Environment [top]
Software development and application requires a correct set of compilers, communication libraries and math libraries as well as tools that are not interchangeable with other pieces in the software development suite, and that are being regularly updated. In order to make software more supportable and environment setup more automatic, i.e. increasing the ease of use for the user community, we have adopted "modules" as a way to present packages of software that work together and to describe required dependencies among software packages.
Environment setup through modules [top]
The loading of a module environment will provide the user with the correct paths to commands, compilers, libraries, and will set up the necessary environment variables. The default module environment, which is considered to be integer*8 (i.e. -i8) is loaded at login time.
Various commands are available to probe your environment and to switch, add, or change (pieces of) the user environment:
- module help provides list of available module commands
- module list prints modules currently loaded
(We recommend the user adds this command into the job submission script as a way of recording the runtime environment) - mpp_modules shows available "pnnl_env" module packages that provide the user with a complete software development environment of compilers and libraries that have been tested to work together
- module avail provides an extensive list of available compilers, libraries, etc. No guarantee can be given that each of these packages will work in conjunction with the others, other than those listed in pnnl_env and some modules listed below
- module add|load add a new module to the users' environment
- module switch switch two modules in the users' environment
- module rm|unload remove module from the users' environment
- module purge remove all modules from users' environment
Notes using modules
- By default the environment is set up to use integer*8 (-i8) in Fortran. If your code requires integer*4 (or -i4) the user ) you need to switch to the pnnl_env/i4 module:module swap pnnl_env pnnl_env/i4
- The pre-May environment with the Intel 7.1 compiler and integer*4 as a default is still available for a limited time to enable backward compatibility. Users are strongly urged to recompile their codes with the Intel 8.1 compiler and the current default environment. You can access the old environment by typing: module swap pnnl_env pnnl_env/old
- Those users that are using non-default modules listed with "module avail" should make sure that those libraries get loaded in their job script before running the binary
- Default environment variables MPI_INCLUDE, MPI_LIB, MLIB_LIB will be set by the modules that can be used in makefiles. Environment variables (for example LD_LIBRARY_PATH) set in your .cshrc, .bashrc, etc. will override the defaults set by "modules" at login. Users are strongly encouraged to remove unnecessary variables from these files. Additional libraries are available for expert users.
- Normally the module command sets up environment variables that will be inherited by subsequent scripts. If you find that you need to run module commands from within a job submission script (which is the case if you are not using the default environment), you may find that the module command isn't available. To be able to use the module commands from within a script, you may need to first source the initialization package:
- If you're a bash shell user you should use: . /home/mscf/sw/modules/init/bash
- If you're a csh user you should use: source /home/mscf/sw/modules/init/csh
- Modules are available for libraries such as PETSc and the Global Array Tools
Compilers [top]
The primary compilers are Intel's ifort (for Fortran) and icc (for C) compilers. The current version 8.1 is installed on the system. The following compiler options and libraries will enhance the performance of the codes you compile:
- -tpp2 tells the compiler to optimize for Intel's Itanium-2 processor
- -O2 is the default optimization. -O3 might or might not improve the performance, you'll have to test your own code to see what works best.
- -i4 or -i8 defines the default length of the integers used. -i8 is the default option in the default environment. If you have a need to use -i4, please read the modules documentation on how to set the appropriate environment.
- -r4 or -r8 defines the default length of the reals used. -r8 is the default option.
- -Vaxlib, a portability library, should be included at the linking stage.
- HP's MLIB libraries contain optimized subroutines including all BLAS 1, 2, and 3 subroutines, sparse BLAS subroutines, a collection of commonly used dense and sparse linear system solvers, including LAPACK. To use these fast libraries you will have to link them in the following manner:
- 32-bit (or -i4): -L$(MLIB_LIB) -lveclib -llapack -lguide -lpthread
- 64-bit (or -i8): -L$(MLIB_LIB) -lveclib8 -llapack8 -lguide -lpthread
The MLIB_LIB environment variables gets set by the module environment to point to the appropriate location.
HP MLIB for Linux contains SMP parallelism. The implementation is based on OpenMP calls and, therefore the option "-openmp" is required when linking MLIB (VECLIB or LAPACK). To enable parallelism users must set the environment variable MLIB_NUMBER_OF_THREADS to a positive number greater than 1. The variable OMP_NUM_THREADS is basically an upper bound on the parallelism when it is greater than MLIB_NUMBER_OF_THREADS.
For NWChem a good choice is to use:- setenv OMP_NUM_THREADS 1
- setenv MLIB_NUMBER_OF_THREADS 1
Some other applications might want to use the multi-threaded capability of MLIB and, hence, the environment variables will have to be set to higher values as discussed above
More information about the HP MLIB library can be found on the HP website.
Additional options can be found on the man pages, by typing "efc -help" or in the Intel Online Compiler Documentation. One additional option that might be important for users that transfer binary files between systems (to SGI's for example) is the environment variable that forces the code to read and write Big Endian binary files:
setenv F_UFMTENDIAN big
Intel's idb parallel debugger is available on the system. The GNU gdb debugger can be used to debug individual processes of a parallel program on each processor. In addition the TotalView debugger (find it at /home/scicons/apps/totalview.6.3.1-0) and the Vampir performance analyzer (find it at /home/scicons/apps/vampir) are available for debugging and performance enhancement purposes.
MPI [top]
The primary communication protocol for running parallel jobs is MPI. The MPI libraries, based on MPICH, have been implemented by Quadrics on top of the Elan3 (or Elan4) interconnect. There are a number of ways you can compile your parallel codes:
- Using mpicc, mpif77, or mpif90
To compile MPI programs with the Intel one can use mpicc, mpif77, or mpif90. Using these compiler scripts will ensure that the proper MPI libraries and include files are included at compile time - Using the Intel compiler and setting your own flags, paths and libraries
Setting up your own compiler and linker structure requires you to use the default environment variables that have been set by modules. The following lines will provide the correct include and library files:- -I$(MPI_INCLUDE) will provide a path to the necessary include files.
- -L$(MPI_LIB) -lmpifarg -lmpi -lelan will link in all the necessary libraries.
The environment variables MPI_INCLUDE and MPI_LIB gets set by the module environment to point to the appropriate location.
Job Submission [top]
Platform's LSF is a batch scheduler and resource manager used to submit and run jobs. Its commands are very similar to those of NQS and PBS. For example bsub, bjobs, and bkill work similar to the PBS commands qsub, qstat, and qdel. These three commands along with showq, rinfo and window are probably the only batch commands you will ever use on MPP2. The format of the job submission script will be discussed in the next section.
To submit a LSF jobfile:
- % bsub < jobfile
To view the LSF queue:
- % bjobs
- Notes: - by default only your jobs are shown. Use "bjobs -u all" to see all jobs submitted.
- - "bjobs -l" gives you a more detailed view of your job.
Alternative and easier to read view of the LSF queue:
- % showq
To remove a jobfile from the queue:
- % bkill <jobid>
- Note: the jobid can be found from bjobs or showq. The jobids used in rinfo should not be used for this purpose.
An overview of the processor status can be obtained from:
- % rinfo
- Note: a summary of the system can be obtained using "rinfo -nl".
Check how many processors are available:
- % window
Submitting NWChem batch jobs [top]
When running NWChem calculations, users are encouraged to submit their jobs through the llnw script (available at /home/scicons/bin/llnw). This script will setup the jobs script, running environment, and makes sure the appropriate files get copied from and to your working directory.
Sample Script for Batch Jobs [top]
Here is a csh example of a LSF jobfile. The following example is a file for submitting a batch parallel job. Replace the items in red italic with your account information.
#!/bin/csh #BSUB -P account #BSUB -n number of processors #BSUB -m type of processors #BSUB -W 4:00 #BSUB -J jobname #BSUB -i input file #BSUB -e sample.err.%J #BSUB -o sample.out.%J #BSUB -u your_email@pnl.gov #BSUB -N ############################################################################# # Copy files to /scratch (if necessary) ############################################################################# foreach host ($LSB_HOSTS) rcp <your file> ${host}:/scratch/<your file> end ############################################################################# # Run code (or multiple codes by repeating the prun command) ############################################################################# prun -n <number of processors> your_program.x ############################################################################# # Copy back important files to working_directory ############################################################################# foreach host ($LSB_HOSTS) rcp ${host}:/scratch/<file to be copied> <your working dir>/<file to be copied> end
The bsub options in the script above will be discussed briefly:
- -P <account> specifies the account name to be used to run the calculations. A default can be set via the environment variable LSB_DEFAULTPROJECT. Note: This default overrules any option that is set by hand by the user.
- -n <##> specifies the number (##) of processors the job should be run on.
- -m <##> specifies the type of processor you want to run on, i.e. FatNodes (8 Gbyte RAM + 430 Gbyte local scratch). If the processor choice is not important this line can be left out of the job script.
- -W <HH:MM> specifies the wall clock time for the job.
- -J <jobname> assigns the jobname to your job. Note: no spaces can be used in the job name specification.
- -i <inputfile> defines the name of the standard input file used by the job. Default is standard input, and hence parameter only needs to be defined when an alternative input source is needed.
- -e <sample.err.%J> defines the name of the error file produced by the job.
- -o <sample.out.%J> defines the name of the output file produced by the job. %J will attach the job ID, or when a jobname has been specified the jobname will be attached.
- -u <address> specifies the email address to send job information to. A default can be set via the environment variable LSB_MAILTO.
- -N specifies that an email should be send when the job is finished.
There are many more options can be specified. For those please read the man pages of the bsub command.
The prun command in the job script specifies the parallel run. The options are:
- -n <number of processors>. This option specifies the number of processors you would like to run on. This number should be equal or less than the number specified in #BSUB -n. When this option is absent prun assumes the value specified by #BSUB -n.
- -N <number of nodes>. This option specifies the number of nodes you would like to run on and should not be used except in special cases (for example if you would like to run on 1 processor per node). This number should be equal or less than the number specified in #BSUB -n divided by 2. Do remember, the combination -n and -N should make sense.
Alternative ways can be used to copy files to scratch disks. For example, if the same file gets needs to be copied to all nodes one could use the following line:
pdsh -f 30 -w `nodes c $LSB_HOSTS` cp <your file> /scratch/<your file>
Running Interactive Jobs [top]
MPP2 has a pseudo-interactive queue of 32 processors available for software testing and debugging purposes. The run limit is 8 processors and the time limit in this queue is 30 minutes. To start an interactive job the following command should be used:
- % bsub -n <# proc> -W <4:00> -P <account> -Is <csh>
Except for -Is the variables are described above. By defining -Is you define that an interactive job should be started. The argument of -Is is the type of interactive shell you would like to have opened (i.e. csh or bash).
After obtaining the processors you can start a parallel job using the prun command
Note: this queue is a pseudo-interactive queue. Nodes are obtained from LSF in the same way batch jobs are. This means that there could be a delay in your interactive job actually starting (due to processors not being available or multiple people waiting for interactive nodes). In general interactive processors should be available in 30 minutes.
Time Allocation Accounts [top]
The Time Allocation Account needed to submit both batch and interactive jobs can best be seen as a bank account holding the CPU hours that have been allotted to the project you are involved in. The name of the account can be obtained from your project PI, or by typing qbalance -h. This command will show you the account name and the number of hours available on this account to you and the other users on the account. If no accounts are shown please contact the MSCF Consulting team. Some users are involved in multiple projects and have multiple account names to choose from. Please make sure you use the appropriate account for the job you are planning to submit. If you are not sure which account to use, please contact your PI.
Job Policies [top]
The primary objective of the MSCF is to provide teraflop computing resources for grand-challenge computational problems. The job scheduling policy has been established to provide a higher priority on effective throughput and turnaround of large jobs. To maximize system flexibility, all jobs are submitted to a single queue. The job scheduler controls the allocation of compute processors to the users job and will place the job in one of the four available queues, short, normal, large, and idle. This allocation is governed by a number of policy constraints. All policy constraint values must be satisfied. For more information on MSCF policies, please see User Policies.
Job Policy Constraints:
- Maximum number of running jobs: 3
- Maximum number of queued jobs: 8
- Maximum number of processors per job: 1800
- Minimum number of processors per job: 8
There is also a set of default values that limit the time a single job with a particular number of processors can have, which are shown below.
Number of Processors in a Single Job | Time Limit | Notes |
---|---|---|
512 - 1800 | 48 wall clock hours | These jobs will be placed ahead of the jobs in the queues below, i.e. they will receive highest priority. |
256-512 | 48 wall clock hours | These jobs will be placed ahead of the jobs in the queues below, i.e. they will receive higher priority. |
33 - 255 | 48 wall clock hours | Normal priority jobs. Note that many of these jobs will backfill with the large jobs in the larger queues. |
8 - 32 | 36 wall clock hours | Normal priority jobs. Note that many of these jobs will backfill with the large jobs in the larger queues. |
1-8 | 30 minutes | Test / Interactive queue, the 32 processors in this queue are reserved on the ThinNodes only. |
Idle queue:
The idle queue provides the opportunity for projects that have run out of their regular allocation to use processors that are idle on the machine. The primary purpose of this queue is to increase machine usage and help projects that have run out of their original allocation to get some computations done. The only limit on the "idle queue" is that jobs must be run for 90 minutes or less. Time used in the "idle queue" needs to be tracked in the GOLD accounting system and is designated as a "Charge Limit" for the project. Projects that qualify for the "idle queue" will have time assigned in the "CreditLimit" column of the "gbalance -h -u <UserID>" command. If your job will need to create a restart file, be sure it gets written before the 90 minute window terminates. To submit a job to the Idle queue, include the command: #BSUB -q "idle" and to see jobs in the "idle queue" you will need to use the -x flag for showq. For example: "showq -x | grep <UserID>" will find all of your jobs in all queues on MPP2. Send an e-mail to mscf-consulting@emsl.pnl.gov for more information or to answer any questions.
SIGHTS special purpose queue:
In addition to the queue limitations mentioned above the users can request access to a special purpose queue called Scientific Impact Generated by High Teraflop Simulations (SIGHTS). The SIGHTS queue is for compute jobs that require resources beyond the normal queue limits for MPP2, and serve uniquely impactful cutting-edge PNNL/EMSL mission science opportunities which cannot be performed at any other computing facility. SIGHTS jobs should require the use of 1024 processors or more, up to the capacity of MPP2. SIGHTS jobs are not automatically set in the MPP2 queue. SIGHTS jobs can be submitted anytime after approval and will be tended by an MSCF scientific consultant and operations personnel to assist in successful job completion . All requests for a week with a monthly outage will need to be submitted by 12 noon on Thursday before the scheduled outage.
Access to the SIGHTS is by request only and is subject to time availability. All requests are submitted to the MSCF consulting group for review. Please pick keyword "SIGHTS". In your request please provide a short (one-two page) description of what you plan to do and how you plan on doing it. Upon receipt of the request a consultant will be assigned to the job. The consultant will work with the users to be sure the job is ready. The consultants and operations staff will watch all SIGHTS jobs to be sure they are running correctly. Details about SIGHTS jobs are:
- Minimum number of processors is 1024.
- No time limit.
- Job can be started any time or after a scheduled system outage.
- Allocation costs for these jobs is none. All allocations for the job will be returned to the user.
Short pool:
The short pool of 16 reserved ThinNode processors allows users to run small and short jobs to test or debug their codes. Interactive or test jobs will be limited to a maximum of 8 processors and a 30 minutes time limit per job. Note: the reserved processors in this pool are ThinNodes. Hence, if you request FatNodes in your jobs the job you will have to wait for FatNode processors to become available.
These constraints are used as system default values. If you require resources beyond these limits (more processor, longer run times), please have your Principle Investigator contact MSCF Computer Projects Manager and the appropriate user account can be configured with exceptions to over-ride the default values.
FAQ [top]
Below a list of frequently asked questions. This list will grow larger when more user questions arise.
- Error message "Bad packet length 1349676916" at login.
The "Bad packet length" error message is generated when the user tries to login using a secure shell version that uses protocol 1. On MPP2 the user is required to login with protocol version 2, using ssh (with protocol 2 as default), ssh2 or ssh -2.
We invite you to log in, exercise the system and report any problems/issues that you have with the machine.
For application software, hardware and/or system software questions/problems please contact the MSCF-consulting group through the web mscf-consulting form (internal users only) or send email to mscf-consulting@emsl.pnl.gov.