Bassi Benchmark Code Profiles

NERSC has a suite of codes that has been used to monitor, compare, and verify performance on Bassi. Some characteristics of the codes - measured on Bassi - follow. The measurements quoted below were made using profiled and instrumented runs, thus the absolute performance is somewhat lower than for uninstrumented runs.

NPB Serial Benchmarks: FT, MG , SP
NPB Parallel Benchmarks: FT, MG, SP
Application Benchmarks: CAM 3.0, GTC, PARATEC
Component Benchmarks: MEMRATE

Serial Benchmarks

[TOP]

NPB 2.3 FT CLASS B

Reference: NASA Ames Research Center

Approximate wallclock run time:	100 sec
Maximum memory:	1,792 MB
Instructions per load/store:	2.09
Algebraic floating point operations:	94,125 M
Floating point instructions/all instructions:	0.35
Approximate floating point performance:	1,000 MFlops/sec
Approximate floating point percent of peak:	13%

[TOP]

NPB 2.3 MG CLASS B

Reference: NASA Ames Research Center

Approximate wallclock run time:	15 sec
Maximum memory:	512 MB
Instructions per load/store:	1.98
Algebraic floating point operations:	21,526 M
Floating point instructions/all instructions:	0.34
Approximate floating point performance:	1,400 MFlops/sec
Approximate floating point percent of peak:	18%

[TOP]

NPB 2.3 SP CLASS B

Reference: NASA Ames Research Center

Approximate wallclock run time:	653 sec
Maximum memory:	512 MB
Instructions per load/store:	1.82
Algebraic floating point operations:	359,390 M
Floating point instructions/all instructions:	0.33
Approximate floating point performance:	550 MFlops/sec
Approximate floating point percent of peak:	7%

Parallel Benchmarks

[TOP]

NPB 2.4 FT CLASS D

Reference: NASA Ames Research Center

A 3-D partial differential equation solution using FFTs. This kernel performs the essence of many "spectral" codes. It is a rigorous test of long-distance communiction performance.

Task Sum	Task Mean	Task Min.	Task Max.
Number of MPI tasks	64
Approximate wallclock run time:		186 sec
Memory (GB)	120.6	1.884	1.768	2.008
Instructions per load/store:		1.95
Algebraic floating point operations:	9.49x10¹²	1.48x10¹¹	1.48x10¹¹	1.48x10¹¹
Floating point instructions/all instructions:		0.36
Approximate floating point performance (Gflops/sec):	51.0	0.797
Approximate floating point percent of peak:		10.5%
Percent of total run time spent in MPI routines:		23.2%	19.5%	24.6%

MPI Routine Timings (sec)

Call	Sum	Average	CV (%)	Minimum	Maxium	% of MPI	% of wall
MPI_Alltoall	2.704e+03	4.225e+01	3.33	3.633e+01	4.440e+01	97.790	22.720
MPI_Reduce	5.984e+01	9.350e-01	128.15	8.349e-04	5.181e+00	2.164	0.503
MPI_Barrier	1.195e+00	1.868e-02	61.12	4.625e-05	4.492e-02	0.043	0.010
MPI_Bcast	7.341e-02	1.147e-03	18.59	3.123e-05	1.378e-03	0.003	0.001

[TOP]

NPB 2.4 MG CLASS D

Reference: NASA Ames Research Center

A simpled multigrid kernel. It requires highly structured long distance communication and tests both short and long distance data communication. The V-cycle multigrid algorithm is used to obtain an approximate solution to the discrete Poisson problem on a grid with periodic boundary conditions.

Task Sum	Task Mean	Task Min.	Task Max.
Number of MPI tasks	64
Approximate wallclock run time:		40 sec
Memory (GB)	48.5	0.758	0.758	0.758
Instructions per load/store:		1.97
Algebraic floating point operations:	3.27x10¹²	5.11x10¹⁰	5.11x10¹⁰	5.11x10¹¹
Floating point instructions/all instructions:		0.34
Approximate floating point performance (Gflops/sec):	81.7	1.27
Approximate floating point percent of peak:		16.8%
Percent of total run time spent in MPI routines:		4.52%	3.7%	5.1%

MPI Routine Timings (secs)

Call	Sum	Average	CV (%)	Minimum	Maxium	% of MPI	% of wall
MPI_Send	9.225e+01	1.441e+00	7.57	1.237e+00	1.688e+00	78.676	3.559
MPI_Wait	2.265e+01	3.540e-01	23.35	2.130e-01	5.721e-01	19.320	0.874
MPI_Irecv	1.111e+00	1.736e-02	9.69	1.403e-02	2.181e-02	0.948	0.043
MPI_Barrier	6.087e-01	9.510e-03	28.95	2.966e-03	1.401e-02	0.519	0.023
MPI_Allreduce	5.539e-01	8.654e-03	13.16	6.693e-03	1.197e-02	0.472	0.021
MPI_Bcast	7.558e-02	1.181e-03	11.95	7.010e-05	1.206e-03	0.064	0.003
MPI_Reduce	8.214e-04	1.283e-05	47.53	8.345e-06	4.220e-05	0.001	0.000

[TOP]

NPB 2.4 SP CLASS D

Reference: NASA Ames Research Center

Three sets of uncoupled systems of equations are solved, first in x, then in y, and finally in z. The systems are scalar pentadiagonal. In the multi-partition algorithm used, each processor is responsible for several disjoint sub-blocks of points ("cells") of the grid. The cells are arranged such that for each direction of the line solve phase the cells belonging to a certain processor will be evenly distributed along the direction of solution. This allows each processor to perform useful work throughout a line solve, instead of being forced to wait for the partial solution to a line from another processor before beginning work. Additionally, the information from a cell is not sent to the next processor until all sections of linear equation systems handled in this cell have been solved. Therefore the granularity of communications is kept large and fewer messages are sent.

Task Sum	Task Mean	Task Min.	Task Max.
Number of MPI tasks	64
Approximate wallclock run time:		780 sec
Memory (GB)	32.6	0.509	0.509	0.509
Instructions per load/store:		1.81
Algebraic floating point operations:	3.23x10¹³	5.05x10¹¹	5.05x10¹¹	5.05x10¹¹
Floating point instructions/all instructions:		0.33
Approximate floating point performance (Gflops/sec):	41.5	0.648
Approximate floating point percent of peak:		8.5%
Percent of total run time spent in MPI routines:		3.25%	2.9%	3.6%

MPI Routine Timings (sec)

Call	Sum	Average	CV (%)	Minimum	Maxium	% of MPI	% of wall
MPI_Waitall	1.586e+03	2.479e+01	4.82	2.197e+01	2.711e+01	97.736	3.182
MPI_Isend	3.133e+01	4.895e-01	12.05	3.930e-01	6.290e-01	1.930	0.063
MPI_Irecv	4.352e+00	6.800e-02	5.77	5.827e-02	7.815e-02	0.268	0.009
MPI_Bcast	4.518e-01	7.060e-03	12.60	5.317e-05	7.184e-03	0.028	0.001
MPI_Allreduce	3.444e-01	5.381e-03	38.75	1.118e-03	1.071e-02	0.021	0.001
MPI_Barrier	2.639e-01	4.124e-03	18.51	4.942e-04	5.381e-03	0.016	0.001
MPI_Reduce	1.342e-03	2.096e-05	157.37	9.775e-06	1.657e-04	0.000	0.000

[TOP]

GTC (Gyrokinetic Toroidal Code)

Reference: Princeton Plasma Physics Lab Theory Department

The Gyrokinetic Toroidal Code (GTC) was developed to study the dominant mechanism for energy transport in fusion devices, namely, plasma microturbulence. GTC solves the gyroaveraged Vlasov-Poisson system of equations using the particle-in-cell (PIC) approach. This method makes use of particles to sample the distribution function of the plasma system under study. The particles interact with each other only through a self-consistent field described on a grid such that no binary forces need to be calculated. The equations of motion to be solved for the particles are simple ordinary differential equations and are easily solved using a second order Runge-Kutta algorithm.

The main tasks of the PIC method at each time step are as follows: The charge of each particle is distributed among its nearest grid points according to the current position of that particle; this is called the scatter operation. The Poisson equation is then solved on the grid to obtain the electrostatic potential at each point. The force acting on each particle is then calculated from the potential at the nearest grid points; this is the "gather" operation. Next, the particles are "moved" by using the equations of motion. These steps are repeated until the end of the simulation.

GTC has been highly optimized for cache-based super-scalar machines such as the IBM SP. The data structure and loop ordering have been arranged for maximum cache reuse, which is the most important method of achieving higher performance on this type of processor. In GTC, themain bottleneck is the charge deposition, or scatter operation, mentioned above, and this is also true for most particle codes. The classic scatter algorithm consists of a loop over the particles, finding the nearest grid points surrounding each particle position. A fraction of the particle's charge is assigned to the grid points proportionally to their distance from the particle's position. The charge fractions are accumulated in a grid array. The scatter algorithm in GTC is more complex since one is dealing with fast gyrating particles for which motion is described by charged rings being tracked by their guiding center. This results in a larger number of operations, since several points are picked on the rings and each of them has its own neighboring grid points.

GTC Summary Profile

Task Sum	Task Mean	Task Min.	Task Max.
Number of MPI tasks	64
Approximate wallclock run time:		164 sec
Memory (GB)	17.0	0.266	0.263	0.274
Instructions per load/store:		2.51
Algebraic floating point operations:	7.06x10¹²	1.10x10¹¹	1.09x10¹¹	1.12x10¹¹
Floating point instructions/all instructions:		0.28
Approximate floating point performance (Gflops/sec):	43.1	0.673
Approximate floating point percent of peak:		8.9%
Percent of total run time spent in MPI routines:		5.22%	2.3%	8.5%

GTC MPI Routine Timings (sec)

Call	Sum	Average	CV (%)	Minimum	Maxium	% of MPI	% of wall
MPI_Allreduce	2.676e+02	4.182e+00	54.98	9.639e-01	8.752e+00	48.877	2.551
MPI_Sendrecv	2.639e+02	4.124e+00	65.22	1.625e+00	1.153e+01	48.204	2.516
MPI_Bcast	8.042e+00	1.257e-01	12.67	3.002e-04	1.278e-01	1.469	0.077
MPI_Gather	7.763e+00	1.213e-01	8.06	1.077e-01	1.524e-01	1.418	0.074
MPI_Reduce	1.714e-01	2.678e-03	108.97	6.495e-04	8.418e-03	0.031	0.002
MPI_Recv	9.179e-04	1.434e-05	0.00	9.179e-04	9.179e-04	0.000	0.000
MPI_Send	6.447e-04	1.007e-05	417.04	5.484e-06	5.889e-05	0.000	0.000

[TOP]

PARATEC

PARATEC (Parallel Total Energy Code) performs ab-initio quantum-mechanical total energy calculations using pseudopotentials and a plane wave basis set. The pseudopotentials are of the standard norm-conserving variety (typically Hamann-Schulter-Chiang or Trouillier-Martins). Forces and stress can be easily calculated with PARATEC and used to relax the atoms into their equilibrium positions. The total energy minimization of the electrons is performed by the self-consistent field (SCF) method using Broyden, Pulay-Kerker or the newly developed Pulay-Kerker-Thomas-Fermi charge mixing schemes. In the SCF method, the electronic minimization is performed by unconstrained conjugate gradient algorithms on the Mauri-Galli-Ordejon (MGO) functional or the generalized Rayleigh quotient.

PARATEC Summary Profile

Task Sum	Task Mean	Task Min.	Task Max.
Number of MPI tasks	64
Approximate wallclock run time:		600 sec
Memory (GB)	51.0	0.796	0.761	1.258
Instructions per load/store:		3.35
Algebraic floating point operations:	1.81x10¹⁴	2.83x10¹²	2.819x10¹²	2.835x10¹²
Floating point instructions/all instructions:		0.60
Approximate floating point performance (Gflops/sec):	304	4.750
Approximate floating point percent of peak:		63%
Percent of total run time spent in MPI routines:		12.8%	11.5%	14.0%

Main PARATEC MPI Routine Timings (sec)

Call	Sum	Average	CV (%)	Minimum	Maxium	% of MPI	% of wall
MPI_Wait	2.452e+03	3.832e+01	4.21	3.513e+01	4.154e+01	49.523	6.384
MPI_Allreduce	1.129e+03	1.764e+01	16.91	1.103e+01	2.435e+01	22.796	2.939
MPI_Isend	6.663e+02	1.041e+01	3.29	9.699e+00	1.119e+01	13.455	1.734
MPI_Bcast	3.757e+02	5.871e+00	6.95	2.758e+00	6.177e+00	7.587	0.978
MPI_Irecv	1.438e+02	2.248e+00	3.82	2.067e+00	2.378e+00	2.905	0.374
MPI_Recv	1.362e+02	2.128e+00	39.99	5.677e-01	4.699e+00	2.750	0.354

[TOP]

CAM 3.0

The Community Atmosphere Model (CAM) is the latest in a series of global atmosphere models developed at NCAR for the weather and climate research communities. CAM also serves as the atmospheric component of the Community Climate System Model (CCSM).

CAM was configured thusly:

$CAM_ROOT/models/atm/cam/bld/configure -res 128x256 -cam_bld build \
        -cflags='-qlargepage' -fc='mpxlf90_r' \
	-cflags='-qlargepage' -fopt='-qtune=auto -qarch=auto -O3 -strict' \
        -nc_inc="$NETCDF_INCLUDE" -nc_lib="$NETCDF_LIB" -smp -spmd -notest

CAM 3.0 Summary Profile

Task Sum	Task Mean	Task Min.	Task Max.
Number of MPI tasks	16
Approximate wallclock run time:		1400 sec
Memory (GB)	8.63	0.539	0.523	0.780
Instructions per load/store:		2.20
Algebraic floating point operations:	1.06x10¹³	6.62x11¹²	6.238x10¹¹	7.791x10¹¹
Floating point instructions/all instructions:		0.31
Approximate floating point performance (Gflops/sec):	7.54	0.471
Approximate floating point percent of peak:		4%
Percent of total run time spent in MPI routines:		17.2%	7.4%	21.6%

Main CAM MPI Routine Timings (sec)

Call	Sum	Average	CV (%)	Minimum	Maxium	% of MPI	% of wall
MPI_Alltoallv	2.77e+03	1.73e+02	27.9	8.52e+01	2.80e+02	71.56	12.30
MPI_Allgatherv	8.36e+02	5.22e+01	48.7	2.73e+00	6.81e+01	21.61	3.71
MPI_Bcast	7.87e+01	4.92e+00	26.1	1.38e-01	5.40e+00	2.03	0.35
MPI_Gatherv	7.10e+01	4.43e+00	11.1	2.82e+00	5.10e+00	1.83	0.31
MPI_Scatterv	5.70e+01	3.56e+00	25.1	2.00e-01	3.80e+00	1.47	0.25

PIORAW Disk Read/Write Bandwidth Test

The NERSC PIORAW benchmark tests read and write bandwidth, using one file per MPI task. The standard Bassi test uses 32 tasks, a buffer size of 2 MB and individual file sizes of 2.5 GB.

Filesystem	MPI Tasks	Nodes	Aggregate Read BW (MB/sec)	Read BW per task (MB/sec)	Aggregate Write BW (MB/sec)	Write BW per task (MB/sec)
/scratch	32	32	4401	138	4114	129
/scratch	32	16	4400	138	4110	129
/scratch	32	8	3796	119	3727	116
/scratch	32	4	1874	58.6	2760	86.2

MEMRATE

The MEMRATE benchmarks is a facsimile of the STREAMS/StarSTREAM benchmark. NERSC uses the TRIAD routine to characterize memory bandwidth. The TRIAD is a simple loop of the form:

	do i=1,n
        	w(i) = u(i) + scale*v(i)
	end do

This measures the ability to "stream" data to and from main memory. The result is quoted in MB/sec per MPI process. The measurement is taken using all 8 CPUs per node concurrently, which puts an extreme stress on the memory subsystem.

MEMRATE Summary Profile

Task Sum	Task Mean	Task Min.	Task Max.
Number of MPI tasks	8
Approximate wallclock run time:		300 sec
Memory (GB)	10.0	1.25	1.25	1.25

FFTW Full Configuration Test

The purpose of the Full Configuration Test is to demonstrate the ability of the system to efficiently run applications that utilize all compute processors. The FCT is a scalable parallel code based on FFTW. The 3D array size is equal to 2048x2048x2048. [Source Code]

Bassi Benchmark Code Profiles

Serial Benchmarks

NPB 2.3 FT CLASS B

NPB 2.3 MG CLASS B

NPB 2.3 SP CLASS B

Parallel Benchmarks

NPB 2.4 FT CLASS D

MPI Routine Timings (sec)

NPB 2.4 MG CLASS D

MPI Routine Timings (secs)

NPB 2.4 SP CLASS D

MPI Routine Timings (sec)

GTC (Gyrokinetic Toroidal Code)

GTC Summary Profile

GTC MPI Routine Timings (sec)

PARATEC

PARATEC Summary Profile

Main PARATEC MPI Routine Timings (sec)

CAM 3.0

CAM 3.0 Summary Profile

Main CAM MPI Routine Timings (sec)

PIORAW Disk Read/Write Bandwidth Test

MEMRATE

MEMRATE Summary Profile

FFTW Full Configuration Test

Send us feedback about this page