NERSC logo National Energy Research Scientific Computing Center
  A DOE Office of Science User Facility
  at Lawrence Berkeley National Laboratory
 

Bassi Performance

This page contains performance measurements on a sample of well-known benchmarks codes made on Bassi and performance comparisons as a function of changing compiler and run-time parameters. Comparisons are made to other NERSC machines: Seaborg, an IBM SP Power3 System (375 MHz) that was decommissioned in January 2008; Jacquard, a 2.2 GHz Opteron/Infiniband cluster, and Franklin, a Cray XT4, with dual-core 2.6 GHz Opteron processors.

Some useful performance measurement tools are available on Bassi.

Bassi performance measurements and dependencies

Benchmark Results

NAS Parallel Benchmarks (NPB)

NPB 2.3 CLASS B Serial

Serial benchmarks are measured on a "packed node." On Bassi 8 simultaneous instanaces of the benchmark are executed on a single node. The numbers in this table are based on averages over many ongoing measurements.
Benchmark Bassi Wall secs Bassi MFlops/sec Bassi Speedup Notes
vs. Seaborg vs. Jacquard
FT 90.2 1014 8.4 1.8
MG 13.5 1446 9.2 1.8
SP 555 639 9.5 1.3

NPB 2.4 CLASS D Parallel

Parallel benchmarks are run on "packed nodes." On Bassi 8 tasks are executed on a each node. The numbers in this table are based on averages over many ongoing measurements.
BenchmarkTasksTimeMFlops/sec/proc Speedup vs. Seaborg Notes
FT641718198.9mpxlf_r -O3 -qtune=auto -qarch=auto
FT25646756mpxlf_r -O3 -qtune=auto -qarch=auto
MG6436.413388.7mpxlf_r -O3 -qtune=auto -qarch=auto
MG2569.591269mpxlf_r -O3 -qtune=auto -qarch=auto
SP647975819.7mpxlf_r -O5 -qtune=auto -qarch=auto -qlargepage
SP256178649mpxlf_r -O5 -qtune=auto -qarch=auto -qlargepage
CG64361158mpxlf_r -O3 -qtune=auto -qarch=auto
CG25694151mpxlf_r -O3 -qtune=auto -qarch=auto
LU645581117mpxlf_r -O3 -qtune=auto -qarch=auto
LU2561401115mpxlf_r -O3 -qtune=auto -qarch=auto
BT648941019mpxlf_r -O3 -qtune=auto -qarch=auto
BT2561971155mpxlf_r -O3 -qtune=auto -qarch=auto

Memrate

A facsimile of the STREAMS/StarSTREAM benchmark. The Single TRIAD measures memory bandwidth from one processor; the Multi TRIAD mesures the bandwidth per processor from all 8 processors simultaneously.
BenchmarkMBytes/sec/proc Speedup vs. Seaborg Notes
Single TRIAD72075.1 xlf90_r -O3 -qhot -qstrict -qarch=auto -qtune=auto -qunroll=yes -qextname -qalias=nopteovrlp -qdebug=REFPARM
mpcc_r -O3 -qarch=auto -qtune=auto -qunroll=yes -qalias=allptrs -Q
Multi TRIAD700016

MPI Test

64 tasks on 8 nodes.
BenchmarkResult Speedup vs. Seaborg Notes
Max Point-to-Point Latency4.5 us6.6 mpcc_r -o mpitest -O3 -lm
Min Point-to-Point Bandwidth3073 MB/sec9.7
Naturally Ordered Ring Bandwidth1845 MB/sec13.1
Randomly Ordered Ring Bandwidth257 MB/sec6.1

PIORAW Disk Read/Write Bandwidth Test

The NERSC PIORAW benchmark tests read and write bandwidth, using one file per MPI task. The standard Bassi test uses 32 tasks, a buffer size of 2 MB and individual file sizes of 2.5 GB.

Filesystem MPI Tasks Nodes Aggregate Read BW (MB/sec) Read BW per task (MB/sec) Aggregate Write BW (MB/sec) Write BW per task (MB/sec)
/scratch 32 32 4401 138 4114 129
/scratch 32 16 4400 138 4110 129
/scratch 32 8 3796 119 3727 116
/scratch 32 4 1874 58.6 2760 86.2

Application Benchmarks

Benchmark Tasks Bassi Wall secs Bassi Speedup Notes
vs. Seaborg vs. Jacquard
NERSC SSP-NCSb Benchmarks
CAM 3.0 16 1387 4.6 N/A cami_ 0000-09-01 _128x256_L26 _c040422
CAM 3.0 16 MPI x 2 OMP 730 4.7 N/A
CAM 3.0 16 MPI x 4 OMP 395 4.8 N/A
GTC 64 162 5.2 1.1
PARATEC 64 591 6.1 2.1
NERSC 5 Benchmarks
MILC M 64 138 7.5 3.3
MILC L 256 1496 6.4 1.7
GTC M 64 1553 5.3 1.2
GTC L 256 1679 5.8 1.3
PARATEC M 64 451 7.3 1.9 686 atom Si cluster
PARATEC L 256 854 8.0 1.9
GAMESS M 64 5837 3.2 0.9 41 atoms (943 basis functions) MP2 Gradient
GAMESS L 384 4683 9.0 N/A
MADBENCH M 64 1094 7.3 2.4 32K pixels, 16 bins
MADBENCH L 256 1277 6.6 1.9
PMEMD M 64 538 3.9 1.1
PMEMD L 256 475 6.4 1.6
CAM 3 M 64 1886 4.2 N/A inite-volume core, Grid D
CAM 3 L 256 527 4.6 N/A

Performance dependence on configuration parameters

Being compiled.

Large Page Memory vs. Small Page Memory

Benchmark Tasks Nodes Large Page Performance Small Page Performance Units Large/Small
NPB MG 2.4 CLASS C 16 2 1380 1000 MFlops/sec/task 1.38
NPB MG 2.3 CLASS B SERIAL 1 1 1457 1057 MFlops/sec/task 1.38
NPB SP 2.4 CLASS C 16 2 712 614 MFlops/sec/task 1.16
NPB FT 2.4 CLASS C 16 2 842 745 MFlops/sec/task 1.13
NPB FT 2.3 CLASS B SERIAL 1 1 1062 1097 MFlops/sec/task 0.97
GTC (NERSC NCSb SSP) 64 8 752 706 MFlops/sec/task 1.07
PARATEC (NERSC NCSb SSP) 64 8 4749 4394 MFlops/sec/task 1.08
CAM 3.0 (NERSC NCSb SSP) 16 2 501 494 MFlops/sec/task 1.01

HPS Bulk Transfer (RDMA)

POWER 5 on HPS has a "bulk transfer" or "RDMA" (Remote Direct Memory Access) mode that improves message passing bandwidth for large messages. It is enabled in LoadLeveler script with the keyword:

#@bulkxfer=yes

The graph below shows the point-to-point bandwidth as a fuction of message size with the default settings of MG_EAGER_LIMIT=32K and MP_BULK_MIN_MSG_SIZE=4096. RDMA will never be used for a message size of 4KB or less.

HPS performance
Click on graph for larger PDF version.
Benchmark Tasks Nodes With Bulk Xfer Without Bulk Xfer MB/sec With/Without
Point-to-Point Bandwidth (4MB buffer) - - 3100 1700 MFlops/sec/task 1.8

MPI Latency

When running parallel codes that do not explicitly create threads (multiple MPI tasks are OK), set the environment variable MP_SINGLE_THREAD to improve HPS latency.

SettingMPI Point to Point Internode Latency
MP_SINGLE_THREAD=yes4.5 microseconds
unset or MP_SINGLE_THREAD=no5.1 microseconds

Performance dependence on compiler options

Presentation from June 2006 NUG meeting (PDF Format).

IBM Publications


LBNL Home
Page last modified: Wed, 09 Apr 2008 17:26:30 GMT
Page URL: http://www.nersc.gov/nusers/systems/bassi/perf.php
Web contact: webmaster@nersc.gov
Computing questions: consult@nersc.gov

Privacy and Security Notice
DOE Office of Science