Bassi Benchmark Code Profiles
NERSC has a suite of codes that has been used to monitor, compare, and verify performance on Bassi. Some characteristics of the codes - measured on Bassi - follow. The measurements quoted below were made using profiled and instrumented runs, thus the absolute performance is somewhat lower than for uninstrumented runs.
- NPB Serial Benchmarks: FT, MG , SP
- NPB Parallel Benchmarks: FT, MG, SP
- Application Benchmarks: CAM 3.0, GTC, PARATEC
- Component Benchmarks: MEMRATE
Serial Benchmarks
NPB 2.3 FT CLASS B
Reference: NASA Ames Research Center
Approximate wallclock run time: | 100 sec |
Maximum memory: | 1,792 MB |
Instructions per load/store: | 2.09 |
Algebraic floating point operations: | 94,125 M |
Floating point instructions/all instructions: | 0.35 |
Approximate floating point performance: | 1,000 MFlops/sec |
Approximate floating point percent of peak: | 13% |
NPB 2.3 MG CLASS B
Reference: NASA Ames Research Center
Approximate wallclock run time: | 15 sec |
Maximum memory: | 512 MB |
Instructions per load/store: | 1.98 |
Algebraic floating point operations: | 21,526 M |
Floating point instructions/all instructions: | 0.34 |
Approximate floating point performance: | 1,400 MFlops/sec |
Approximate floating point percent of peak: | 18% |
NPB 2.3 SP CLASS B
Reference: NASA Ames Research Center
Approximate wallclock run time: | 653 sec |
Maximum memory: | 512 MB |
Instructions per load/store: | 1.82 |
Algebraic floating point operations: | 359,390 M |
Floating point instructions/all instructions: | 0.33 |
Approximate floating point performance: | 550 MFlops/sec |
Approximate floating point percent of peak: | 7% |
Parallel Benchmarks
NPB 2.4 FT CLASS D
Reference: NASA Ames Research Center
A 3-D partial differential equation solution using FFTs. This kernel performs the essence of many "spectral" codes. It is a rigorous test of long-distance communiction performance.
Task Sum | Task Mean | Task Min. | Task Max. | |
---|---|---|---|---|
Number of MPI tasks | 64 | |||
Approximate wallclock run time: | 186 sec | |||
Memory (GB) | 120.6 | 1.884 | 1.768 | 2.008 |
Instructions per load/store: | 1.95 | |||
Algebraic floating point operations: | 9.49x1012 | 1.48x1011 | 1.48x1011 | 1.48x1011 |
Floating point instructions/all instructions: | 0.36 | |||
Approximate floating point performance (Gflops/sec): | 51.0 | 0.797 | ||
Approximate floating point percent of peak: | 10.5% | |||
Percent of total run time spent in MPI routines: | 23.2% | 19.5% | 24.6% |
MPI Routine Timings (sec)
Call | Sum | Average | CV (%) | Minimum | Maxium | % of MPI | % of wall |
---|---|---|---|---|---|---|---|
MPI_Alltoall | 2.704e+03 | 4.225e+01 | 3.33 | 3.633e+01 | 4.440e+01 | 97.790 | 22.720 |
MPI_Reduce | 5.984e+01 | 9.350e-01 | 128.15 | 8.349e-04 | 5.181e+00 | 2.164 | 0.503 |
MPI_Barrier | 1.195e+00 | 1.868e-02 | 61.12 | 4.625e-05 | 4.492e-02 | 0.043 | 0.010 |
MPI_Bcast | 7.341e-02 | 1.147e-03 | 18.59 | 3.123e-05 | 1.378e-03 | 0.003 | 0.001 |
NPB 2.4 MG CLASS D
Reference: NASA Ames Research Center
A simpled multigrid kernel. It requires highly structured long distance communication and tests both short and long distance data communication. The V-cycle multigrid algorithm is used to obtain an approximate solution to the discrete Poisson problem on a grid with periodic boundary conditions.
Task Sum | Task Mean | Task Min. | Task Max. | |
---|---|---|---|---|
Number of MPI tasks | 64 | |||
Approximate wallclock run time: | 40 sec | |||
Memory (GB) | 48.5 | 0.758 | 0.758 | 0.758 |
Instructions per load/store: | 1.97 | |||
Algebraic floating point operations: | 3.27x1012 | 5.11x1010 | 5.11x1010 | 5.11x1011 |
Floating point instructions/all instructions: | 0.34 | |||
Approximate floating point performance (Gflops/sec): | 81.7 | 1.27 | ||
Approximate floating point percent of peak: | 16.8% | |||
Percent of total run time spent in MPI routines: | 4.52% | 3.7% | 5.1% |
MPI Routine Timings (secs)
Call | Sum | Average | CV (%) | Minimum | Maxium | % of MPI | % of wall |
---|---|---|---|---|---|---|---|
MPI_Send | 9.225e+01 | 1.441e+00 | 7.57 | 1.237e+00 | 1.688e+00 | 78.676 | 3.559 |
MPI_Wait | 2.265e+01 | 3.540e-01 | 23.35 | 2.130e-01 | 5.721e-01 | 19.320 | 0.874 |
MPI_Irecv | 1.111e+00 | 1.736e-02 | 9.69 | 1.403e-02 | 2.181e-02 | 0.948 | 0.043 |
MPI_Barrier | 6.087e-01 | 9.510e-03 | 28.95 | 2.966e-03 | 1.401e-02 | 0.519 | 0.023 |
MPI_Allreduce | 5.539e-01 | 8.654e-03 | 13.16 | 6.693e-03 | 1.197e-02 | 0.472 | 0.021 |
MPI_Bcast | 7.558e-02 | 1.181e-03 | 11.95 | 7.010e-05 | 1.206e-03 | 0.064 | 0.003 |
MPI_Reduce | 8.214e-04 | 1.283e-05 | 47.53 | 8.345e-06 | 4.220e-05 | 0.001 | 0.000 |
NPB 2.4 SP CLASS D
Reference: NASA Ames Research Center
Three sets of uncoupled systems of equations are solved, first in x, then in y, and finally in z. The systems are scalar pentadiagonal. In the multi-partition algorithm used, each processor is responsible for several disjoint sub-blocks of points ("cells") of the grid. The cells are arranged such that for each direction of the line solve phase the cells belonging to a certain processor will be evenly distributed along the direction of solution. This allows each processor to perform useful work throughout a line solve, instead of being forced to wait for the partial solution to a line from another processor before beginning work. Additionally, the information from a cell is not sent to the next processor until all sections of linear equation systems handled in this cell have been solved. Therefore the granularity of communications is kept large and fewer messages are sent.
Task Sum | Task Mean | Task Min. | Task Max. | |
---|---|---|---|---|
Number of MPI tasks | 64 | |||
Approximate wallclock run time: | 780 sec | |||
Memory (GB) | 32.6 | 0.509 | 0.509 | 0.509 |
Instructions per load/store: | 1.81 | |||
Algebraic floating point operations: | 3.23x1013 | 5.05x1011 | 5.05x1011 | 5.05x1011 |
Floating point instructions/all instructions: | 0.33 | |||
Approximate floating point performance (Gflops/sec): | 41.5 | 0.648 | ||
Approximate floating point percent of peak: | 8.5% | |||
Percent of total run time spent in MPI routines: | 3.25% | 2.9% | 3.6% |
MPI Routine Timings (sec)
Call | Sum | Average | CV (%) | Minimum | Maxium | % of MPI | % of wall |
---|---|---|---|---|---|---|---|
MPI_Waitall | 1.586e+03 | 2.479e+01 | 4.82 | 2.197e+01 | 2.711e+01 | 97.736 | 3.182 |
MPI_Isend | 3.133e+01 | 4.895e-01 | 12.05 | 3.930e-01 | 6.290e-01 | 1.930 | 0.063 |
MPI_Irecv | 4.352e+00 | 6.800e-02 | 5.77 | 5.827e-02 | 7.815e-02 | 0.268 | 0.009 |
MPI_Bcast | 4.518e-01 | 7.060e-03 | 12.60 | 5.317e-05 | 7.184e-03 | 0.028 | 0.001 |
MPI_Allreduce | 3.444e-01 | 5.381e-03 | 38.75 | 1.118e-03 | 1.071e-02 | 0.021 | 0.001 |
MPI_Barrier | 2.639e-01 | 4.124e-03 | 18.51 | 4.942e-04 | 5.381e-03 | 0.016 | 0.001 |
MPI_Reduce | 1.342e-03 | 2.096e-05 | 157.37 | 9.775e-06 | 1.657e-04 | 0.000 | 0.000 |
GTC (Gyrokinetic Toroidal Code)
Reference: Princeton Plasma Physics Lab Theory Department
The Gyrokinetic Toroidal Code (GTC) was developed to study the dominant mechanism for energy transport in fusion devices, namely, plasma microturbulence. GTC solves the gyroaveraged Vlasov-Poisson system of equations using the particle-in-cell (PIC) approach. This method makes use of particles to sample the distribution function of the plasma system under study. The particles interact with each other only through a self-consistent field described on a grid such that no binary forces need to be calculated. The equations of motion to be solved for the particles are simple ordinary differential equations and are easily solved using a second order Runge-Kutta algorithm.
The main tasks of the PIC method at each time step are as follows: The charge of each particle is distributed among its nearest grid points according to the current position of that particle; this is called the scatter operation. The Poisson equation is then solved on the grid to obtain the electrostatic potential at each point. The force acting on each particle is then calculated from the potential at the nearest grid points; this is the "gather" operation. Next, the particles are "moved" by using the equations of motion. These steps are repeated until the end of the simulation.
GTC has been highly optimized for cache-based super-scalar machines such as the IBM SP. The data structure and loop ordering have been arranged for maximum cache reuse, which is the most important method of achieving higher performance on this type of processor. In GTC, themain bottleneck is the charge deposition, or scatter operation, mentioned above, and this is also true for most particle codes. The classic scatter algorithm consists of a loop over the particles, finding the nearest grid points surrounding each particle position. A fraction of the particle's charge is assigned to the grid points proportionally to their distance from the particle's position. The charge fractions are accumulated in a grid array. The scatter algorithm in GTC is more complex since one is dealing with fast gyrating particles for which motion is described by charged rings being tracked by their guiding center. This results in a larger number of operations, since several points are picked on the rings and each of them has its own neighboring grid points.
GTC Summary Profile
Task Sum | Task Mean | Task Min. | Task Max. | |
---|---|---|---|---|
Number of MPI tasks | 64 | |||
Approximate wallclock run time: | 164 sec | |||
Memory (GB) | 17.0 | 0.266 | 0.263 | 0.274 |
Instructions per load/store: | 2.51 | |||
Algebraic floating point operations: | 7.06x1012 | 1.10x1011 | 1.09x1011 | 1.12x1011 |
Floating point instructions/all instructions: | 0.28 | |||
Approximate floating point performance (Gflops/sec): | 43.1 | 0.673 | ||
Approximate floating point percent of peak: | 8.9% | |||
Percent of total run time spent in MPI routines: | 5.22% | 2.3% | 8.5% |
GTC MPI Routine Timings (sec)
Call | Sum | Average | CV (%) | Minimum | Maxium | % of MPI | % of wall |
---|---|---|---|---|---|---|---|
MPI_Allreduce | 2.676e+02 | 4.182e+00 | 54.98 | 9.639e-01 | 8.752e+00 | 48.877 | 2.551 |
MPI_Sendrecv | 2.639e+02 | 4.124e+00 | 65.22 | 1.625e+00 | 1.153e+01 | 48.204 | 2.516 |
MPI_Bcast | 8.042e+00 | 1.257e-01 | 12.67 | 3.002e-04 | 1.278e-01 | 1.469 | 0.077 |
MPI_Gather | 7.763e+00 | 1.213e-01 | 8.06 | 1.077e-01 | 1.524e-01 | 1.418 | 0.074 |
MPI_Reduce | 1.714e-01 | 2.678e-03 | 108.97 | 6.495e-04 | 8.418e-03 | 0.031 | 0.002 |
MPI_Recv | 9.179e-04 | 1.434e-05 | 0.00 | 9.179e-04 | 9.179e-04 | 0.000 | 0.000 |
MPI_Send | 6.447e-04 | 1.007e-05 | 417.04 | 5.484e-06 | 5.889e-05 | 0.000 | 0.000 |
PARATEC
PARATEC (Parallel Total Energy Code) performs ab-initio quantum-mechanical total energy calculations using pseudopotentials and a plane wave basis set. The pseudopotentials are of the standard norm-conserving variety (typically Hamann-Schulter-Chiang or Trouillier-Martins). Forces and stress can be easily calculated with PARATEC and used to relax the atoms into their equilibrium positions. The total energy minimization of the electrons is performed by the self-consistent field (SCF) method using Broyden, Pulay-Kerker or the newly developed Pulay-Kerker-Thomas-Fermi charge mixing schemes. In the SCF method, the electronic minimization is performed by unconstrained conjugate gradient algorithms on the Mauri-Galli-Ordejon (MGO) functional or the generalized Rayleigh quotient.
PARATEC Summary Profile
Task Sum | Task Mean | Task Min. | Task Max. | |
---|---|---|---|---|
Number of MPI tasks | 64 | |||
Approximate wallclock run time: | 600 sec | |||
Memory (GB) | 51.0 | 0.796 | 0.761 | 1.258 |
Instructions per load/store: | 3.35 | |||
Algebraic floating point operations: | 1.81x1014 | 2.83x1012 | 2.819x1012 | 2.835x1012 |
Floating point instructions/all instructions: | 0.60 | |||
Approximate floating point performance (Gflops/sec): | 304 | 4.750 | ||
Approximate floating point percent of peak: | 63% | |||
Percent of total run time spent in MPI routines: | 12.8% | 11.5% | 14.0% |
Main PARATEC MPI Routine Timings (sec)
Call | Sum | Average | CV (%) | Minimum | Maxium | % of MPI | % of wall |
---|---|---|---|---|---|---|---|
MPI_Wait | 2.452e+03 | 3.832e+01 | 4.21 | 3.513e+01 | 4.154e+01 | 49.523 | 6.384 |
MPI_Allreduce | 1.129e+03 | 1.764e+01 | 16.91 | 1.103e+01 | 2.435e+01 | 22.796 | 2.939 |
MPI_Isend | 6.663e+02 | 1.041e+01 | 3.29 | 9.699e+00 | 1.119e+01 | 13.455 | 1.734 |
MPI_Bcast | 3.757e+02 | 5.871e+00 | 6.95 | 2.758e+00 | 6.177e+00 | 7.587 | 0.978 |
MPI_Irecv | 1.438e+02 | 2.248e+00 | 3.82 | 2.067e+00 | 2.378e+00 | 2.905 | 0.374 |
MPI_Recv | 1.362e+02 | 2.128e+00 | 39.99 | 5.677e-01 | 4.699e+00 | 2.750 | 0.354 |
CAM 3.0
The Community Atmosphere Model (CAM) is the latest in a series of global atmosphere models developed at NCAR for the weather and climate research communities. CAM also serves as the atmospheric component of the Community Climate System Model (CCSM).
CAM was configured thusly:
$CAM_ROOT/models/atm/cam/bld/configure -res 128x256 -cam_bld build \ -cflags='-qlargepage' -fc='mpxlf90_r' \ -cflags='-qlargepage' -fopt='-qtune=auto -qarch=auto -O3 -strict' \ -nc_inc="$NETCDF_INCLUDE" -nc_lib="$NETCDF_LIB" -smp -spmd -notest
CAM 3.0 Summary Profile
Task Sum | Task Mean | Task Min. | Task Max. | |
---|---|---|---|---|
Number of MPI tasks | 16 | |||
Approximate wallclock run time: | 1400 sec | |||
Memory (GB) | 8.63 | 0.539 | 0.523 | 0.780 |
Instructions per load/store: | 2.20 | |||
Algebraic floating point operations: | 1.06x1013 | 6.62x1112 | 6.238x1011 | 7.791x1011 |
Floating point instructions/all instructions: | 0.31 | |||
Approximate floating point performance (Gflops/sec): | 7.54 | 0.471 | ||
Approximate floating point percent of peak: | 4% | |||
Percent of total run time spent in MPI routines: | 17.2% | 7.4% | 21.6% |
Main CAM MPI Routine Timings (sec)
Call | Sum | Average | CV (%) | Minimum | Maxium | % of MPI | % of wall |
---|---|---|---|---|---|---|---|
MPI_Alltoallv | 2.77e+03 | 1.73e+02 | 27.9 | 8.52e+01 | 2.80e+02 | 71.56 | 12.30 |
MPI_Allgatherv | 8.36e+02 | 5.22e+01 | 48.7 | 2.73e+00 | 6.81e+01 | 21.61 | 3.71 |
MPI_Bcast | 7.87e+01 | 4.92e+00 | 26.1 | 1.38e-01 | 5.40e+00 | 2.03 | 0.35 |
MPI_Gatherv | 7.10e+01 | 4.43e+00 | 11.1 | 2.82e+00 | 5.10e+00 | 1.83 | 0.31 |
MPI_Scatterv | 5.70e+01 | 3.56e+00 | 25.1 | 2.00e-01 | 3.80e+00 | 1.47 | 0.25 |
PIORAW Disk Read/Write Bandwidth Test
The NERSC PIORAW benchmark tests read and write bandwidth, using one file per MPI task. The standard Bassi test uses 32 tasks, a buffer size of 2 MB and individual file sizes of 2.5 GB.
Filesystem | MPI Tasks | Nodes | Aggregate Read BW (MB/sec) | Read BW per task (MB/sec) | Aggregate Write BW (MB/sec) | Write BW per task (MB/sec) |
---|---|---|---|---|---|---|
/scratch | 32 | 32 | 4401 | 138 | 4114 | 129 |
/scratch | 32 | 16 | 4400 | 138 | 4110 | 129 |
/scratch | 32 | 8 | 3796 | 119 | 3727 | 116 |
/scratch | 32 | 4 | 1874 | 58.6 | 2760 | 86.2 |
MEMRATE
The MEMRATE benchmarks is a facsimile of the STREAMS/StarSTREAM benchmark. NERSC uses the TRIAD routine to characterize memory bandwidth. The TRIAD is a simple loop of the form:
do i=1,n w(i) = u(i) + scale*v(i) end do
This measures the ability to "stream" data to and from main memory. The result is quoted in MB/sec per MPI process. The measurement is taken using all 8 CPUs per node concurrently, which puts an extreme stress on the memory subsystem.
MEMRATE Summary Profile
Task Sum | Task Mean | Task Min. | Task Max. | |
---|---|---|---|---|
Number of MPI tasks | 8 | |||
Approximate wallclock run time: | 300 sec | |||
Memory (GB) | 10.0 | 1.25 | 1.25 | 1.25 |
FFTW Full Configuration Test
The purpose of the Full Configuration Test is to demonstrate the ability of the system to efficiently run applications that utilize all compute processors. The FCT is a scalable parallel code based on FFTW. The 3D array size is equal to 2048x2048x2048. [Source Code]