Bassi Performance
This page contains performance measurements on a sample of well-known benchmarks codes made on Bassi and performance comparisons as a function of changing compiler and run-time parameters. Comparisons are made to other NERSC machines: Seaborg, an IBM SP Power3 System (375 MHz) that was decommissioned in January 2008; Jacquard, a 2.2 GHz Opteron/Infiniband cluster, and Franklin, a Cray XT4, with dual-core 2.6 GHz Opteron processors.
Bassi performance measurements and dependencies
- Benchmark Results
- Performance dependence on Configuration Parameters
- Performance dependence on Compilation Options
- IBM Publications
Benchmark Results
NAS Parallel Benchmarks (NPB)
NPB 2.3 CLASS B Serial
Serial benchmarks are measured on a "packed node." On Bassi 8 simultaneous instanaces of the benchmark are executed on a single node. The numbers in this table are based on averages over many ongoing measurements.
Benchmark | Bassi Wall secs | Bassi MFlops/sec | Bassi Speedup | Notes | |
---|---|---|---|---|---|
vs. Seaborg | vs. Jacquard | ||||
FT | 90.2 | 1014 | 8.4 | 1.8 | |
MG | 13.5 | 1446 | 9.2 | 1.8 | |
SP | 555 | 639 | 9.5 | 1.3 |
NPB 2.4 CLASS D Parallel
Parallel benchmarks are run on "packed nodes." On Bassi 8 tasks are executed on a each node. The numbers in this table are based on averages over many ongoing measurements.
Benchmark | Tasks | Time | MFlops/sec/proc | Speedup vs. Seaborg | Notes |
---|---|---|---|---|---|
FT | 64 | 171 | 819 | 8.9 | mpxlf_r -O3 -qtune=auto -qarch=auto |
FT | 256 | 46 | 756 | mpxlf_r -O3 -qtune=auto -qarch=auto | |
MG | 64 | 36.4 | 1338 | 8.7 | mpxlf_r -O3 -qtune=auto -qarch=auto |
MG | 256 | 9.59 | 1269 | mpxlf_r -O3 -qtune=auto -qarch=auto | |
SP | 64 | 797 | 581 | 9.7 | mpxlf_r -O5 -qtune=auto -qarch=auto -qlargepage |
SP | 256 | 178 | 649 | mpxlf_r -O5 -qtune=auto -qarch=auto -qlargepage | |
CG | 64 | 361 | 158 | mpxlf_r -O3 -qtune=auto -qarch=auto | |
CG | 256 | 94 | 151 | mpxlf_r -O3 -qtune=auto -qarch=auto | |
LU | 64 | 558 | 1117 | mpxlf_r -O3 -qtune=auto -qarch=auto | |
LU | 256 | 140 | 1115 | mpxlf_r -O3 -qtune=auto -qarch=auto | |
BT | 64 | 894 | 1019 | mpxlf_r -O3 -qtune=auto -qarch=auto | |
BT | 256 | 197 | 1155 | mpxlf_r -O3 -qtune=auto -qarch=auto |
Memrate
A facsimile of the STREAMS/StarSTREAM benchmark. The Single TRIAD measures memory bandwidth from one processor; the Multi TRIAD mesures the bandwidth per processor from all 8 processors simultaneously.
Benchmark | MBytes/sec/proc | Speedup vs. Seaborg | Notes |
---|---|---|---|
Single TRIAD | 7207 | 5.1 | xlf90_r -O3 -qhot -qstrict -qarch=auto -qtune=auto -qunroll=yes -qextname -qalias=nopteovrlp -qdebug=REFPARM mpcc_r -O3 -qarch=auto -qtune=auto -qunroll=yes -qalias=allptrs -Q |
Multi TRIAD | 7000 | 16 |
MPI Test
64 tasks on 8 nodes.
Benchmark | Result | Speedup vs. Seaborg | Notes |
---|---|---|---|
Max Point-to-Point Latency | 4.5 us | 6.6 | mpcc_r -o mpitest -O3 -lm |
Min Point-to-Point Bandwidth | 3073 MB/sec | 9.7 | |
Naturally Ordered Ring Bandwidth | 1845 MB/sec | 13.1 | |
Randomly Ordered Ring Bandwidth | 257 MB/sec | 6.1 |
PIORAW Disk Read/Write Bandwidth Test
The NERSC PIORAW benchmark tests read and write bandwidth, using one file per MPI task. The standard Bassi test uses 32 tasks, a buffer size of 2 MB and individual file sizes of 2.5 GB.
Filesystem | MPI Tasks | Nodes | Aggregate Read BW (MB/sec) | Read BW per task (MB/sec) | Aggregate Write BW (MB/sec) | Write BW per task (MB/sec) |
---|---|---|---|---|---|---|
/scratch | 32 | 32 | 4401 | 138 | 4114 | 129 |
/scratch | 32 | 16 | 4400 | 138 | 4110 | 129 |
/scratch | 32 | 8 | 3796 | 119 | 3727 | 116 |
/scratch | 32 | 4 | 1874 | 58.6 | 2760 | 86.2 |
Application Benchmarks
Benchmark | Tasks | Bassi Wall secs | Bassi Speedup | Notes | |
---|---|---|---|---|---|
vs. Seaborg | vs. Jacquard | ||||
NERSC SSP-NCSb Benchmarks | |||||
CAM 3.0 | 16 | 1387 | 4.6 | N/A | cami_ 0000-09-01 _128x256_L26 _c040422 |
CAM 3.0 | 16 MPI x 2 OMP | 730 | 4.7 | N/A | |
CAM 3.0 | 16 MPI x 4 OMP | 395 | 4.8 | N/A | |
GTC | 64 | 162 | 5.2 | 1.1 | |
PARATEC | 64 | 591 | 6.1 | 2.1 | |
NERSC 5 Benchmarks | |||||
MILC M | 64 | 138 | 7.5 | 3.3 | |
MILC L | 256 | 1496 | 6.4 | 1.7 | |
GTC M | 64 | 1553 | 5.3 | 1.2 | |
GTC L | 256 | 1679 | 5.8 | 1.3 | |
PARATEC M | 64 | 451 | 7.3 | 1.9 | 686 atom Si cluster |
PARATEC L | 256 | 854 | 8.0 | 1.9 | |
GAMESS M | 64 | 5837 | 3.2 | 0.9 | 41 atoms (943 basis functions) MP2 Gradient |
GAMESS L | 384 | 4683 | 9.0 | N/A | |
MADBENCH M | 64 | 1094 | 7.3 | 2.4 | 32K pixels, 16 bins |
MADBENCH L | 256 | 1277 | 6.6 | 1.9 | |
PMEMD M | 64 | 538 | 3.9 | 1.1 | |
PMEMD L | 256 | 475 | 6.4 | 1.6 | |
CAM 3 M | 64 | 1886 | 4.2 | N/A | inite-volume core, Grid D |
CAM 3 L | 256 | 527 | 4.6 | N/A |
Performance dependence on configuration parameters
Being compiled.
Large Page Memory vs. Small Page Memory
Benchmark | Tasks | Nodes | Large Page Performance | Small Page Performance | Units | Large/Small |
---|---|---|---|---|---|---|
NPB MG 2.4 CLASS C | 16 | 2 | 1380 | 1000 | MFlops/sec/task | 1.38 |
NPB MG 2.3 CLASS B SERIAL | 1 | 1 | 1457 | 1057 | MFlops/sec/task | 1.38 |
NPB SP 2.4 CLASS C | 16 | 2 | 712 | 614 | MFlops/sec/task | 1.16 |
NPB FT 2.4 CLASS C | 16 | 2 | 842 | 745 | MFlops/sec/task | 1.13 |
NPB FT 2.3 CLASS B SERIAL | 1 | 1 | 1062 | 1097 | MFlops/sec/task | 0.97 |
GTC (NERSC NCSb SSP) | 64 | 8 | 752 | 706 | MFlops/sec/task | 1.07 |
PARATEC (NERSC NCSb SSP) | 64 | 8 | 4749 | 4394 | MFlops/sec/task | 1.08 |
CAM 3.0 (NERSC NCSb SSP) | 16 | 2 | 501 | 494 | MFlops/sec/task | 1.01 |
HPS Bulk Transfer (RDMA)
POWER 5 on HPS has a "bulk transfer" or "RDMA" (Remote Direct Memory Access) mode that improves message passing bandwidth for large messages. It is enabled in LoadLeveler script with the keyword:
#@bulkxfer=yes
The graph below shows the point-to-point bandwidth as a fuction of message size with the default settings of MG_EAGER_LIMIT=32K and MP_BULK_MIN_MSG_SIZE=4096. RDMA will never be used for a message size of 4KB or less.
Click on graph for larger PDF version.
Benchmark | Tasks | Nodes | With Bulk Xfer | Without Bulk Xfer | MB/sec | With/Without |
---|---|---|---|---|---|---|
Point-to-Point Bandwidth (4MB buffer) | - | - | 3100 | 1700 | MFlops/sec/task | 1.8 |
MPI Latency
When running parallel codes that do not explicitly create threads (multiple MPI tasks are OK), set the environment variable MP_SINGLE_THREAD to improve HPS latency.
Setting | MPI Point to Point Internode Latency |
---|---|
MP_SINGLE_THREAD=yes | 4.5 microseconds |
unset or MP_SINGLE_THREAD=no | 5.1 microseconds |
Performance dependence on compiler options
Presentation from June 2006 NUG meeting (PDF Format).
IBM Publications
- IBM System p5, @eserver p5, pSeries, OpenPower and IBM RS/6000 Performance Report, January 11, 2006