NERSCPowering Scientific Discovery Since 1974

Bassi Benchmark Code Profiles

 

NERSC has a suite of codes that has been used to monitor, compare, and verify performance on Bassi. Some characteristics of the codes - measured on Bassi - follow. The measurements quoted below were made using profiled and instrumented runs, thus the absolute performance is somewhat lower than for uninstrumented runs.

Serial Benchmarks

  [TOP]

NPB 2.3 FT CLASS B

Reference: NASA Ames Research Center

Approximate wallclock run time: 100 sec
Maximum memory: 1,792 MB
Instructions per load/store: 2.09
Algebraic floating point operations: 94,125 M
Floating point instructions/all instructions: 0.35
Approximate floating point performance: 1,000 MFlops/sec
Approximate floating point percent of peak: 13%

  [TOP]

NPB 2.3 MG CLASS B

Reference: NASA Ames Research Center

Approximate wallclock run time: 15 sec
Maximum memory: 512 MB
Instructions per load/store: 1.98
Algebraic floating point operations: 21,526 M
Floating point instructions/all instructions: 0.34
Approximate floating point performance: 1,400 MFlops/sec
Approximate floating point percent of peak: 18%

  [TOP]

NPB 2.3 SP CLASS B

Reference: NASA Ames Research Center

Approximate wallclock run time: 653 sec
Maximum memory: 512 MB
Instructions per load/store: 1.82
Algebraic floating point operations: 359,390 M
Floating point instructions/all instructions: 0.33
Approximate floating point performance: 550 MFlops/sec
Approximate floating point percent of peak: 7%

Parallel Benchmarks

  [TOP]

NPB 2.4 FT CLASS D

Reference: NASA Ames Research Center

A 3-D partial differential equation solution using FFTs. This kernel performs the essence of many "spectral" codes. It is a rigorous test of long-distance communiction performance.

Task SumTask MeanTask Min.Task Max.
Number of MPI tasks 64      
Approximate wallclock run time:   186 sec    
Memory (GB) 120.6 1.884 1.768 2.008
Instructions per load/store:   1.95    
Algebraic floating point operations: 9.49x1012 1.48x1011 1.48x1011 1.48x1011
Floating point instructions/all instructions:   0.36    
Approximate floating point performance (Gflops/sec): 51.0 0.797    
Approximate floating point percent of peak:   10.5%    
Percent of total run time spent in MPI routines:   23.2% 19.5% 24.6%

MPI Routine Timings (sec)

CallSumAverage CV (%) MinimumMaxium % of MPI% of wall
MPI_Alltoall 2.704e+03 4.225e+01 3.33 3.633e+01 4.440e+01 97.790 22.720
MPI_Reduce 5.984e+01 9.350e-01 128.15 8.349e-04 5.181e+00 2.164 0.503
MPI_Barrier 1.195e+00 1.868e-02 61.12 4.625e-05 4.492e-02 0.043 0.010
MPI_Bcast 7.341e-02 1.147e-03 18.59 3.123e-05 1.378e-03 0.003 0.001

  [TOP]

NPB 2.4 MG CLASS D

Reference: NASA Ames Research Center

A simpled multigrid kernel. It requires highly structured long distance communication and tests both short and long distance data communication. The V-cycle multigrid algorithm is used to obtain an approximate solution to the discrete Poisson problem on a grid with periodic boundary conditions.

Task SumTask MeanTask Min.Task Max.
Number of MPI tasks 64      
Approximate wallclock run time:   40 sec    
Memory (GB) 48.5 0.758 0.758 0.758
Instructions per load/store:   1.97    
Algebraic floating point operations: 3.27x1012 5.11x1010 5.11x1010 5.11x1011
Floating point instructions/all instructions:   0.34    
Approximate floating point performance (Gflops/sec): 81.7 1.27    
Approximate floating point percent of peak:   16.8%    
Percent of total run time spent in MPI routines:   4.52% 3.7% 5.1%

MPI Routine Timings (secs)

CallSumAverage CV (%) MinimumMaxium % of MPI% of wall
MPI_Send 9.225e+01 1.441e+00 7.57 1.237e+00 1.688e+00 78.676 3.559
MPI_Wait 2.265e+01 3.540e-01 23.35 2.130e-01 5.721e-01 19.320 0.874
MPI_Irecv 1.111e+00 1.736e-02 9.69 1.403e-02 2.181e-02 0.948 0.043
MPI_Barrier 6.087e-01 9.510e-03 28.95 2.966e-03 1.401e-02 0.519 0.023
MPI_Allreduce 5.539e-01 8.654e-03 13.16 6.693e-03 1.197e-02 0.472 0.021
MPI_Bcast 7.558e-02 1.181e-03 11.95 7.010e-05 1.206e-03 0.064 0.003
MPI_Reduce 8.214e-04 1.283e-05 47.53 8.345e-06 4.220e-05 0.001 0.000

  [TOP]

NPB 2.4 SP CLASS D

Reference: NASA Ames Research Center

Three sets of uncoupled systems of equations are solved, first in x, then in y, and finally in z. The systems are scalar pentadiagonal. In the multi-partition algorithm used, each processor is responsible for several disjoint sub-blocks of points ("cells") of the grid. The cells are arranged such that for each direction of the line solve phase the cells belonging to a certain processor will be evenly distributed along the direction of solution. This allows each processor to perform useful work throughout a line solve, instead of being forced to wait for the partial solution to a line from another processor before beginning work. Additionally, the information from a cell is not sent to the next processor until all sections of linear equation systems handled in this cell have been solved. Therefore the granularity of communications is kept large and fewer messages are sent.

Task SumTask MeanTask Min.Task Max.
Number of MPI tasks 64      
Approximate wallclock run time:   780 sec    
Memory (GB) 32.6 0.509 0.509 0.509
Instructions per load/store:   1.81    
Algebraic floating point operations: 3.23x1013 5.05x1011 5.05x1011 5.05x1011
Floating point instructions/all instructions:   0.33    
Approximate floating point performance (Gflops/sec): 41.5 0.648    
Approximate floating point percent of peak:   8.5%    
Percent of total run time spent in MPI routines:   3.25% 2.9% 3.6%

MPI Routine Timings (sec)

CallSumAverage CV (%) MinimumMaxium % of MPI% of wall
MPI_Waitall 1.586e+03 2.479e+01 4.82 2.197e+01 2.711e+01 97.736 3.182
MPI_Isend 3.133e+01 4.895e-01 12.05 3.930e-01 6.290e-01 1.930 0.063
MPI_Irecv 4.352e+00 6.800e-02 5.77 5.827e-02 7.815e-02 0.268 0.009
MPI_Bcast 4.518e-01 7.060e-03 12.60 5.317e-05 7.184e-03 0.028 0.001
MPI_Allreduce 3.444e-01 5.381e-03 38.75 1.118e-03 1.071e-02 0.021 0.001
MPI_Barrier 2.639e-01 4.124e-03 18.51 4.942e-04 5.381e-03 0.016 0.001
MPI_Reduce 1.342e-03 2.096e-05 157.37 9.775e-06 1.657e-04 0.000 0.000

  [TOP]

GTC (Gyrokinetic Toroidal Code)

Reference: Princeton Plasma Physics Lab Theory Department

The Gyrokinetic Toroidal Code (GTC) was developed to study the dominant mechanism for energy transport in fusion devices, namely, plasma microturbulence. GTC solves the gyroaveraged Vlasov-Poisson system of equations using the particle-in-cell (PIC) approach. This method makes use of particles to sample the distribution function of the plasma system under study. The particles interact with each other only through a self-consistent field described on a grid such that no binary forces need to be calculated. The equations of motion to be solved for the particles are simple ordinary differential equations and are easily solved using a second order Runge-Kutta algorithm.

The main tasks of the PIC method at each time step are as follows: The charge of each particle is distributed among its nearest grid points according to the current position of that particle; this is called the scatter operation. The Poisson equation is then solved on the grid to obtain the electrostatic potential at each point. The force acting on each particle is then calculated from the potential at the nearest grid points; this is the "gather" operation. Next, the particles are "moved" by using the equations of motion. These steps are repeated until the end of the simulation.

GTC has been highly optimized for cache-based super-scalar machines such as the IBM SP. The data structure and loop ordering have been arranged for maximum cache reuse, which is the most important method of achieving higher performance on this type of processor. In GTC, themain bottleneck is the charge deposition, or scatter operation, mentioned above, and this is also true for most particle codes. The classic scatter algorithm consists of a loop over the particles, finding the nearest grid points surrounding each particle position. A fraction of the particle's charge is assigned to the grid points proportionally to their distance from the particle's position. The charge fractions are accumulated in a grid array. The scatter algorithm in GTC is more complex since one is dealing with fast gyrating particles for which motion is described by charged rings being tracked by their guiding center. This results in a larger number of operations, since several points are picked on the rings and each of them has its own neighboring grid points.

GTC Summary Profile

Task SumTask MeanTask Min.Task Max.
Number of MPI tasks 64      
Approximate wallclock run time:   164 sec    
Memory (GB) 17.0 0.266 0.263 0.274
Instructions per load/store:   2.51    
Algebraic floating point operations: 7.06x1012 1.10x1011 1.09x1011 1.12x1011
Floating point instructions/all instructions:   0.28    
Approximate floating point performance (Gflops/sec): 43.1 0.673    
Approximate floating point percent of peak:   8.9%    
Percent of total run time spent in MPI routines:   5.22% 2.3% 8.5%

GTC MPI Routine Timings (sec)

CallSumAverage CV (%) MinimumMaxium % of MPI% of wall
MPI_Allreduce 2.676e+02 4.182e+00 54.98 9.639e-01 8.752e+00 48.877 2.551
MPI_Sendrecv 2.639e+02 4.124e+00 65.22 1.625e+00 1.153e+01 48.204 2.516
MPI_Bcast 8.042e+00 1.257e-01 12.67 3.002e-04 1.278e-01 1.469 0.077
MPI_Gather 7.763e+00 1.213e-01 8.06 1.077e-01 1.524e-01 1.418 0.074
MPI_Reduce 1.714e-01 2.678e-03 108.97 6.495e-04 8.418e-03 0.031 0.002
MPI_Recv 9.179e-04 1.434e-05 0.00 9.179e-04 9.179e-04 0.000 0.000
MPI_Send 6.447e-04 1.007e-05 417.04 5.484e-06 5.889e-05 0.000 0.000

  [TOP]

PARATEC

PARATEC (Parallel Total Energy Code) performs ab-initio quantum-mechanical total energy calculations using pseudopotentials and a plane wave basis set. The pseudopotentials are of the standard norm-conserving variety (typically Hamann-Schulter-Chiang or Trouillier-Martins). Forces and stress can be easily calculated with PARATEC and used to relax the atoms into their equilibrium positions. The total energy minimization of the electrons is performed by the self-consistent field (SCF) method using Broyden, Pulay-Kerker or the newly developed Pulay-Kerker-Thomas-Fermi charge mixing schemes. In the SCF method, the electronic minimization is performed by unconstrained conjugate gradient algorithms on the Mauri-Galli-Ordejon (MGO) functional or the generalized Rayleigh quotient.

PARATEC Summary Profile

Task SumTask MeanTask Min.Task Max.
Number of MPI tasks 64      
Approximate wallclock run time:   600 sec    
Memory (GB) 51.0 0.796 0.761 1.258
Instructions per load/store:   3.35    
Algebraic floating point operations: 1.81x1014 2.83x1012 2.819x1012 2.835x1012
Floating point instructions/all instructions:   0.60    
Approximate floating point performance (Gflops/sec): 304 4.750    
Approximate floating point percent of peak:   63%    
Percent of total run time spent in MPI routines:   12.8% 11.5% 14.0%

Main PARATEC MPI Routine Timings (sec)

CallSumAverage CV (%) MinimumMaxium % of MPI% of wall
MPI_Wait 2.452e+03 3.832e+01 4.21 3.513e+01 4.154e+01 49.523 6.384
MPI_Allreduce 1.129e+03 1.764e+01 16.91 1.103e+01 2.435e+01 22.796 2.939
MPI_Isend 6.663e+02 1.041e+01 3.29 9.699e+00 1.119e+01 13.455 1.734
MPI_Bcast 3.757e+02 5.871e+00 6.95 2.758e+00 6.177e+00 7.587 0.978
MPI_Irecv 1.438e+02 2.248e+00 3.82 2.067e+00 2.378e+00 2.905 0.374
MPI_Recv 1.362e+02 2.128e+00 39.99 5.677e-01 4.699e+00 2.750 0.354

  [TOP]

CAM 3.0

The Community Atmosphere Model (CAM) is the latest in a series of global atmosphere models developed at NCAR for the weather and climate research communities. CAM also serves as the atmospheric component of the Community Climate System Model (CCSM).

CAM was configured thusly:

$CAM_ROOT/models/atm/cam/bld/configure -res 128x256 -cam_bld build \
        -cflags='-qlargepage' -fc='mpxlf90_r' \
	-cflags='-qlargepage' -fopt='-qtune=auto -qarch=auto -O3 -strict' \
        -nc_inc="$NETCDF_INCLUDE" -nc_lib="$NETCDF_LIB" -smp -spmd -notest

CAM 3.0 Summary Profile

Task SumTask MeanTask Min.Task Max.
Number of MPI tasks 16      
Approximate wallclock run time:   1400 sec    
Memory (GB) 8.63 0.539 0.523 0.780
Instructions per load/store:   2.20    
Algebraic floating point operations: 1.06x1013 6.62x1112 6.238x1011 7.791x1011
Floating point instructions/all instructions:   0.31    
Approximate floating point performance (Gflops/sec): 7.54 0.471    
Approximate floating point percent of peak:   4%    
Percent of total run time spent in MPI routines:   17.2% 7.4% 21.6%

Main CAM MPI Routine Timings (sec)

CallSumAverage CV (%) MinimumMaxium % of MPI% of wall
MPI_Alltoallv 2.77e+03 1.73e+02 27.9 8.52e+01 2.80e+02 71.56 12.30
MPI_Allgatherv 8.36e+02 5.22e+01 48.7 2.73e+00 6.81e+01 21.61 3.71
MPI_Bcast 7.87e+01 4.92e+00 26.1 1.38e-01 5.40e+00 2.03 0.35
MPI_Gatherv 7.10e+01 4.43e+00 11.1 2.82e+00 5.10e+00 1.83 0.31
MPI_Scatterv 5.70e+01 3.56e+00 25.1 2.00e-01 3.80e+00 1.47 0.25

PIORAW Disk Read/Write Bandwidth Test

The NERSC PIORAW benchmark tests read and write bandwidth, using one file per MPI task. The standard Bassi test uses 32 tasks, a buffer size of 2 MB and individual file sizes of 2.5 GB.

Filesystem MPI Tasks Nodes Aggregate Read BW (MB/sec) Read BW per task (MB/sec) Aggregate Write BW (MB/sec) Write BW per task (MB/sec)
/scratch 32 32 4401 138 4114 129
/scratch 32 16 4400 138 4110 129
/scratch 32 8 3796 119 3727 116
/scratch 32 4 1874 58.6 2760 86.2

 

MEMRATE

The MEMRATE benchmarks is a facsimile of the STREAMS/StarSTREAM benchmark. NERSC uses the TRIAD routine to characterize memory bandwidth. The TRIAD is a simple loop of the form:

	do i=1,n
        	w(i) = u(i) + scale*v(i)
	end do

This measures the ability to "stream" data to and from main memory. The result is quoted in MB/sec per MPI process. The measurement is taken using all 8 CPUs per node concurrently, which puts an extreme stress on the memory subsystem.

MEMRATE Summary Profile

Task SumTask MeanTask Min.Task Max.
Number of MPI tasks 8      
Approximate wallclock run time:   300 sec    
Memory (GB) 10.0 1.25 1.25 1.25

FFTW Full Configuration Test

The purpose of the Full Configuration Test is to demonstrate the ability of the system to efficiently run applications that utilize all compute processors. The FCT is a scalable parallel code based on FFTW. The 3D array size is equal to 2048x2048x2048. [Source Code]