NERSCPowering Scientific Discovery Since 1974

Bassi Performance

This page contains performance measurements on a sample of well-known benchmarks codes made on Bassi and performance comparisons as a function of changing compiler and run-time parameters. Comparisons are made to other NERSC machines: Seaborg, an IBM SP Power3 System (375 MHz) that was decommissioned in January 2008; Jacquard, a 2.2 GHz Opteron/Infiniband cluster, and Franklin, a Cray XT4, with dual-core 2.6 GHz Opteron processors.

Bassi performance measurements and dependencies

 

Benchmark Results

NAS Parallel Benchmarks (NPB)

NPB 2.3 CLASS B Serial

Serial benchmarks are measured on a "packed node." On Bassi 8 simultaneous instanaces of the benchmark are executed on a single node. The numbers in this table are based on averages over many ongoing measurements.

Benchmark Bassi Wall secs Bassi MFlops/sec Bassi Speedup Notes
vs. Seaborg vs. Jacquard
FT 90.2 1014 8.4 1.8
MG 13.5 1446 9.2 1.8
SP 555 639 9.5 1.3

NPB 2.4 CLASS D Parallel

Parallel benchmarks are run on "packed nodes." On Bassi 8 tasks are executed on a each node. The numbers in this table are based on averages over many ongoing measurements.

BenchmarkTasksTimeMFlops/sec/proc Speedup vs. Seaborg Notes
FT 64 171 819 8.9 mpxlf_r -O3 -qtune=auto -qarch=auto
FT 256 46 756
mpxlf_r -O3 -qtune=auto -qarch=auto
MG 64 36.4 1338 8.7 mpxlf_r -O3 -qtune=auto -qarch=auto
MG 256 9.59 1269
mpxlf_r -O3 -qtune=auto -qarch=auto
SP 64 797 581 9.7 mpxlf_r -O5 -qtune=auto -qarch=auto -qlargepage
SP 256 178 649
mpxlf_r -O5 -qtune=auto -qarch=auto -qlargepage
CG 64 361 158
mpxlf_r -O3 -qtune=auto -qarch=auto
CG 256 94 151
mpxlf_r -O3 -qtune=auto -qarch=auto
LU 64 558 1117
mpxlf_r -O3 -qtune=auto -qarch=auto
LU 256 140 1115
mpxlf_r -O3 -qtune=auto -qarch=auto
BT 64 894 1019
mpxlf_r -O3 -qtune=auto -qarch=auto
BT 256 197 1155
mpxlf_r -O3 -qtune=auto -qarch=auto

Memrate

A facsimile of the STREAMS/StarSTREAM benchmark. The Single TRIAD measures memory bandwidth from one processor; the Multi TRIAD mesures the bandwidth per processor from all 8 processors simultaneously.

BenchmarkMBytes/sec/proc Speedup vs. Seaborg Notes
Single TRIAD 7207 5.1 xlf90_r -O3 -qhot -qstrict -qarch=auto -qtune=auto -qunroll=yes -qextname -qalias=nopteovrlp -qdebug=REFPARM
mpcc_r -O3 -qarch=auto -qtune=auto -qunroll=yes -qalias=allptrs -Q
Multi TRIAD 7000 16

MPI Test

64 tasks on 8 nodes.

BenchmarkResult Speedup vs. Seaborg Notes
Max Point-to-Point Latency 4.5 us 6.6 mpcc_r -o mpitest -O3 -lm
Min Point-to-Point Bandwidth 3073 MB/sec 9.7
Naturally Ordered Ring Bandwidth 1845 MB/sec 13.1
Randomly Ordered Ring Bandwidth 257 MB/sec 6.1

PIORAW Disk Read/Write Bandwidth Test

The NERSC PIORAW benchmark tests read and write bandwidth, using one file per MPI task. The standard Bassi test uses 32 tasks, a buffer size of 2 MB and individual file sizes of 2.5 GB.

Filesystem MPI Tasks Nodes Aggregate Read BW (MB/sec) Read BW per task (MB/sec) Aggregate Write BW (MB/sec) Write BW per task (MB/sec)
/scratch 32 32 4401 138 4114 129
/scratch 32 16 4400 138 4110 129
/scratch 32 8 3796 119 3727 116
/scratch 32 4 1874 58.6 2760 86.2

Application Benchmarks

Benchmark Tasks Bassi Wall secs Bassi Speedup Notes
vs. Seaborg vs. Jacquard
NERSC SSP-NCSb Benchmarks
CAM 3.0 16 1387 4.6 N/A cami_ 0000-09-01 _128x256_L26 _c040422
CAM 3.0 16 MPI x 2 OMP 730 4.7 N/A
CAM 3.0 16 MPI x 4 OMP 395 4.8 N/A
GTC 64 162 5.2 1.1
PARATEC 64 591 6.1 2.1
NERSC 5 Benchmarks
MILC M 64 138 7.5 3.3
MILC L 256 1496 6.4 1.7
GTC M 64 1553 5.3 1.2
GTC L 256 1679 5.8 1.3
PARATEC M 64 451 7.3 1.9 686 atom Si cluster
PARATEC L 256 854 8.0 1.9
GAMESS M 64 5837 3.2 0.9 41 atoms (943 basis functions) MP2 Gradient
GAMESS L 384 4683 9.0 N/A
MADBENCH M 64 1094 7.3 2.4 32K pixels, 16 bins
MADBENCH L 256 1277 6.6 1.9
PMEMD M 64 538 3.9 1.1
PMEMD L 256 475 6.4 1.6
CAM 3 M 64 1886 4.2 N/A inite-volume core, Grid D
CAM 3 L 256 527 4.6 N/A

 

Performance dependence on configuration parameters

Being compiled.

Large Page Memory vs. Small Page Memory

Benchmark Tasks Nodes Large Page Performance Small Page Performance Units Large/Small
NPB MG 2.4 CLASS C 16 2 1380 1000 MFlops/sec/task 1.38
NPB MG 2.3 CLASS B SERIAL 1 1 1457 1057 MFlops/sec/task 1.38
NPB SP 2.4 CLASS C 16 2 712 614 MFlops/sec/task 1.16
NPB FT 2.4 CLASS C 16 2 842 745 MFlops/sec/task 1.13
NPB FT 2.3 CLASS B SERIAL 1 1 1062 1097 MFlops/sec/task 0.97
GTC (NERSC NCSb SSP) 64 8 752 706 MFlops/sec/task 1.07
PARATEC (NERSC NCSb SSP) 64 8 4749 4394 MFlops/sec/task 1.08
CAM 3.0 (NERSC NCSb SSP) 16 2 501 494 MFlops/sec/task 1.01

HPS Bulk Transfer (RDMA)

POWER 5 on HPS has a "bulk transfer" or "RDMA" (Remote Direct Memory Access) mode that improves message passing bandwidth for large messages. It is enabled in LoadLeveler script with the keyword:

#@bulkxfer=yes

The graph below shows the point-to-point bandwidth as a fuction of message size with the default settings of MG_EAGER_LIMIT=32K and MP_BULK_MIN_MSG_SIZE=4096. RDMA will never be used for a message size of 4KB or less.

HPS performance
Click on graph for larger PDF version.

Benchmark Tasks Nodes With Bulk Xfer Without Bulk Xfer MB/sec With/Without
Point-to-Point Bandwidth (4MB buffer) - - 3100 1700 MFlops/sec/task 1.8

MPI Latency

When running parallel codes that do not explicitly create threads (multiple MPI tasks are OK), set the environment variable MP_SINGLE_THREAD to improve HPS latency.

SettingMPI Point to Point Internode Latency
MP_SINGLE_THREAD=yes 4.5 microseconds
unset or MP_SINGLE_THREAD=no 5.1 microseconds

 

Performance dependence on compiler options

Presentation from June 2006 NUG meeting (PDF Format).

 

IBM Publications

  • IBM System p5, @eserver p5, pSeries, OpenPower and IBM RS/6000 Performance Report, January 11, 2006