BassiAbout Bassi Logging In Accounts/Charges File Storage Programming Running Jobs Software AIX Environment IBM Manuals Detailed Specs Node Network Map Bassi Timeline Benchmark Codes Relative Performance Performance Monitoring Links
Quick Start Guide
Status & StatsUP Wed 10/31 14:51NERSC MOTD Announcements Known Problems Current Queue Look Completed Jobs List Job Stats |
Bassi Benchmark Code ProfilesNERSC has a suite of codes that has been used to monitor, compare, and verify performance on Bassi. Some characteristics of the codes - measured on Bassi - follow. The measurements quoted below were made using profiled and instrumented runs, thus the absolute performance is somewhat lower than for uninstrumented runs.
Serial Benchmarks[TOP]NPB 2.3 FT CLASS BReference: NASA Ames Research Center
NPB 2.3 MG CLASS BReference: NASA Ames Research Center
NPB 2.3 SP CLASS BReference: NASA Ames Research Center
Parallel Benchmarks[TOP]NPB 2.4 FT CLASS DReference: NASA Ames Research CenterA 3-D partial differential equation solution using FFTs. This kernel performs the essence of many "spectral" codes. It is a rigorous test of long-distance communiction performance.
MPI Routine Timings (sec)
NPB 2.4 MG CLASS DReference: NASA Ames Research CenterA simpled multigrid kernel. It requires highly structured long distance communication and tests both short and long distance data communication. The V-cycle multigrid algorithm is used to obtain an approximate solution to the discrete Poisson problem on a grid with periodic boundary conditions.
MPI Routine Timings (secs)
NPB 2.4 SP CLASS DReference: NASA Ames Research CenterThree sets of uncoupled systems of equations are solved, first in x, then in y, and finally in z. The systems are scalar pentadiagonal. In the multi-partition algorithm used, each processor is responsible for several disjoint sub-blocks of points ("cells") of the grid. The cells are arranged such that for each direction of the line solve phase the cells belonging to a certain processor will be evenly distributed along the direction of solution. This allows each processor to perform useful work throughout a line solve, instead of being forced to wait for the partial solution to a line from another processor before beginning work. Additionally, the information from a cell is not sent to the next processor until all sections of linear equation systems handled in this cell have been solved. Therefore the granularity of communications is kept large and fewer messages are sent.
MPI Routine Timings (sec)
GTC (Gyrokinetic Toroidal Code)Reference: Princeton Plasma Physics Lab Theory DepartmentThe Gyrokinetic Toroidal Code (GTC) was developed to study the dominant mechanism for energy transport in fusion devices, namely, plasma microturbulence. GTC solves the gyroaveraged Vlasov-Poisson system of equations using the particle-in-cell (PIC) approach. This method makes use of particles to sample the distribution function of the plasma system under study. The particles interact with each other only through a self-consistent field described on a grid such that no binary forces need to be calculated. The equations of motion to be solved for the particles are simple ordinary differential equations and are easily solved using a second order Runge-Kutta algorithm. The main tasks of the PIC method at each time step are as follows: The charge of each particle is distributed among its nearest grid points according to the current position of that particle; this is called the scatter operation. The Poisson equation is then solved on the grid to obtain the electrostatic potential at each point. The force acting on each particle is then calculated from the potential at the nearest grid points; this is the "gather" operation. Next, the particles are "moved" by using the equations of motion. These steps are repeated until the end of the simulation. GTC has been highly optimized for cache-based super-scalar machines such as the IBM SP. The data structure and loop ordering have been arranged for maximum cache reuse, which is the most important method of achieving higher performance on this type of processor. In GTC, themain bottleneck is the charge deposition, or scatter operation, mentioned above, and this is also true for most particle codes. The classic scatter algorithm consists of a loop over the particles, finding the nearest grid points surrounding each particle position. A fraction of the particle's charge is assigned to the grid points proportionally to their distance from the particle's position. The charge fractions are accumulated in a grid array. The scatter algorithm in GTC is more complex since one is dealing with fast gyrating particles for which motion is described by charged rings being tracked by their guiding center. This results in a larger number of operations, since several points are picked on the rings and each of them has its own neighboring grid points. GTC Summary Profile
GTC MPI Routine Timings (sec)
PARATECPARATEC (Parallel Total Energy Code) performs ab-initio quantum-mechanical total energy calculations using pseudopotentials and a plane wave basis set. The pseudopotentials are of the standard norm-conserving variety (typically Hamann-Schulter-Chiang or Trouillier-Martins). Forces and stress can be easily calculated with PARATEC and used to relax the atoms into their equilibrium positions. The total energy minimization of the electrons is performed by the self-consistent field (SCF) method using Broyden, Pulay-Kerker or the newly developed Pulay-Kerker-Thomas-Fermi charge mixing schemes. In the SCF method, the electronic minimization is performed by unconstrained conjugate gradient algorithms on the Mauri-Galli-Ordejon (MGO) functional or the generalized Rayleigh quotient. PARATEC Summary Profile
Main PARATEC MPI Routine Timings (sec)
CAM 3.0The Community Atmosphere Model (CAM) is the latest in a series of global atmosphere models developed at NCAR for the weather and climate research communities. CAM also serves as the atmospheric component of the Community Climate System Model (CCSM). CAM was configured thusly: $CAM_ROOT/models/atm/cam/bld/configure -res 128x256 -cam_bld build \ -cflags='-qlargepage' -fc='mpxlf90_r' \ -cflags='-qlargepage' -fopt='-qtune=auto -qarch=auto -O3 -strict' \ -nc_inc="$NETCDF_INCLUDE" -nc_lib="$NETCDF_LIB" -smp -spmd -notest CAM 3.0 Summary Profile
Main CAM MPI Routine Timings (sec)
PIORAW Disk Read/Write Bandwidth TestThe NERSC PIORAW benchmark tests read and write bandwidth, using one file per MPI task. The standard Bassi test uses 32 tasks, a buffer size of 2 MB and individual file sizes of 2.5 GB.
MEMRATEThe MEMRATE benchmarks is a facsimile of the STREAMS/StarSTREAM benchmark. NERSC uses the TRIAD routine to characterize memory bandwidth. The TRIAD is a simple loop of the form: do i=1,n w(i) = u(i) + scale*v(i) end do This measures the ability to "stream" data to and from main memory. The result is quoted in MB/sec per MPI process. The measurement is taken using all 8 CPUs per node concurrently, which puts an extreme stress on the memory subsystem. MEMRATE Summary Profile
FFTW Full Configuration TestThe purpose of the Full Configuration Test is to demonstrate the ability of the system to efficiently run applications that utilize all compute processors. The FCT is a scalable parallel code based on FFTW. The 3D array size is equal to 2048x2048x2048. Scaling on Seaborg |
Page last modified: Tue, 04 Mar 2008 20:45:50 GMT Page URL: http://www.nersc.gov/nusers/systems/bassi/code_profiles.php Web contact: webmaster@nersc.gov Computing questions: consult@nersc.gov Privacy and Security Notice |