Bassi Performance

This page contains performance measurements on a sample of well-known benchmarks codes made on Bassi and performance comparisons as a function of changing compiler and run-time parameters. Comparisons are made to other NERSC machines: Seaborg, an IBM SP Power3 System (375 MHz) that was decommissioned in January 2008; Jacquard, a 2.2 GHz Opteron/Infiniband cluster, and Franklin, a Cray XT4, with dual-core 2.6 GHz Opteron processors.

Bassi performance measurements and dependencies

Benchmark Results

NAS Parallel Benchmarks (NPB)

NPB 2.3 CLASS B Serial

Serial benchmarks are measured on a "packed node." On Bassi 8 simultaneous instanaces of the benchmark are executed on a single node. The numbers in this table are based on averages over many ongoing measurements.

Benchmark	Bassi Wall secs	Bassi MFlops/sec	Bassi Speedup		Notes
Benchmark	Bassi Wall secs	Bassi MFlops/sec	vs. Seaborg	vs. Jacquard	Notes
FT	90.2	1014	8.4	1.8
MG	13.5	1446	9.2	1.8
SP	555	639	9.5	1.3

NPB 2.4 CLASS D Parallel

Parallel benchmarks are run on "packed nodes." On Bassi 8 tasks are executed on a each node. The numbers in this table are based on averages over many ongoing measurements.

Benchmark	Tasks	Time	MFlops/sec/proc	Speedup vs. Seaborg	Notes
FT	64	171	819	8.9	mpxlf_r -O3 -qtune=auto -qarch=auto
FT	256	46	756		mpxlf_r -O3 -qtune=auto -qarch=auto
MG	64	36.4	1338	8.7	mpxlf_r -O3 -qtune=auto -qarch=auto
MG	256	9.59	1269		mpxlf_r -O3 -qtune=auto -qarch=auto
SP	64	797	581	9.7	mpxlf_r -O5 -qtune=auto -qarch=auto -qlargepage
SP	256	178	649		mpxlf_r -O5 -qtune=auto -qarch=auto -qlargepage
CG	64	361	158		mpxlf_r -O3 -qtune=auto -qarch=auto
CG	256	94	151		mpxlf_r -O3 -qtune=auto -qarch=auto
LU	64	558	1117		mpxlf_r -O3 -qtune=auto -qarch=auto
LU	256	140	1115		mpxlf_r -O3 -qtune=auto -qarch=auto
BT	64	894	1019		mpxlf_r -O3 -qtune=auto -qarch=auto
BT	256	197	1155		mpxlf_r -O3 -qtune=auto -qarch=auto

Memrate

A facsimile of the STREAMS/StarSTREAM benchmark. The Single TRIAD measures memory bandwidth from one processor; the Multi TRIAD mesures the bandwidth per processor from all 8 processors simultaneously.

Benchmark	MBytes/sec/proc	Speedup vs. Seaborg	Notes
Single TRIAD	7207	5.1	xlf90_r -O3 -qhot -qstrict -qarch=auto -qtune=auto -qunroll=yes -qextname -qalias=nopteovrlp -qdebug=REFPARM mpcc_r -O3 -qarch=auto -qtune=auto -qunroll=yes -qalias=allptrs -Q
Multi TRIAD	7000	16

MPI Test

64 tasks on 8 nodes.

Benchmark	Result	Speedup vs. Seaborg	Notes
Max Point-to-Point Latency	4.5 us	6.6	mpcc_r -o mpitest -O3 -lm
Min Point-to-Point Bandwidth	3073 MB/sec	9.7
Naturally Ordered Ring Bandwidth	1845 MB/sec	13.1
Randomly Ordered Ring Bandwidth	257 MB/sec	6.1

PIORAW Disk Read/Write Bandwidth Test

The NERSC PIORAW benchmark tests read and write bandwidth, using one file per MPI task. The standard Bassi test uses 32 tasks, a buffer size of 2 MB and individual file sizes of 2.5 GB.

Filesystem	MPI Tasks	Nodes	Aggregate Read BW (MB/sec)	Read BW per task (MB/sec)	Aggregate Write BW (MB/sec)	Write BW per task (MB/sec)
/scratch	32	32	4401	138	4114	129
/scratch	32	16	4400	138	4110	129
/scratch	32	8	3796	119	3727	116
/scratch	32	4	1874	58.6	2760	86.2

Application Benchmarks

Benchmark	Tasks	Bassi Wall secs	Bassi Speedup		Notes
Benchmark	Tasks	Bassi Wall secs	vs. Seaborg	vs. Jacquard	Notes
NERSC SSP-NCSb Benchmarks
CAM 3.0	16	1387	4.6	N/A	cami_ 0000-09-01 _128x256_L26 _c040422
CAM 3.0	16 MPI x 2 OMP	730	4.7	N/A
CAM 3.0	16 MPI x 4 OMP	395	4.8	N/A
GTC	64	162	5.2	1.1
PARATEC	64	591	6.1	2.1
NERSC 5 Benchmarks
MILC M	64	138	7.5	3.3
MILC L	256	1496	6.4	1.7
GTC M	64	1553	5.3	1.2
GTC L	256	1679	5.8	1.3
PARATEC M	64	451	7.3	1.9	686 atom Si cluster
PARATEC L	256	854	8.0	1.9
GAMESS M	64	5837	3.2	0.9	41 atoms (943 basis functions) MP2 Gradient
GAMESS L	384	4683	9.0	N/A
MADBENCH M	64	1094	7.3	2.4	32K pixels, 16 bins
MADBENCH L	256	1277	6.6	1.9
PMEMD M	64	538	3.9	1.1
PMEMD L	256	475	6.4	1.6
CAM 3 M	64	1886	4.2	N/A	inite-volume core, Grid D
CAM 3 L	256	527	4.6	N/A

Performance dependence on configuration parameters

Being compiled.

Large Page Memory vs. Small Page Memory

Benchmark	Tasks	Nodes	Large Page Performance	Small Page Performance	Units	Large/Small
NPB MG 2.4 CLASS C	16	2	1380	1000	MFlops/sec/task	1.38
NPB MG 2.3 CLASS B SERIAL	1	1	1457	1057	MFlops/sec/task	1.38
NPB SP 2.4 CLASS C	16	2	712	614	MFlops/sec/task	1.16
NPB FT 2.4 CLASS C	16	2	842	745	MFlops/sec/task	1.13
NPB FT 2.3 CLASS B SERIAL	1	1	1062	1097	MFlops/sec/task	0.97
GTC (NERSC NCSb SSP)	64	8	752	706	MFlops/sec/task	1.07
PARATEC (NERSC NCSb SSP)	64	8	4749	4394	MFlops/sec/task	1.08
CAM 3.0 (NERSC NCSb SSP)	16	2	501	494	MFlops/sec/task	1.01

HPS Bulk Transfer (RDMA)

POWER 5 on HPS has a "bulk transfer" or "RDMA" (Remote Direct Memory Access) mode that improves message passing bandwidth for large messages. It is enabled in LoadLeveler script with the keyword:

#@bulkxfer=yes

The graph below shows the point-to-point bandwidth as a fuction of message size with the default settings of MG_EAGER_LIMIT=32K and MP_BULK_MIN_MSG_SIZE=4096. RDMA will never be used for a message size of 4KB or less.

Click on graph for larger PDF version.

Benchmark	Tasks	Nodes	With Bulk Xfer	Without Bulk Xfer	MB/sec	With/Without
Point-to-Point Bandwidth (4MB buffer)	-	-	3100	1700	MFlops/sec/task	1.8

MPI Latency

When running parallel codes that do not explicitly create threads (multiple MPI tasks are OK), set the environment variable MP_SINGLE_THREAD to improve HPS latency.

Setting	MPI Point to Point Internode Latency
MP_SINGLE_THREAD=yes	4.5 microseconds
unset or MP_SINGLE_THREAD=no	5.1 microseconds

Performance dependence on compiler options

Presentation from June 2006 NUG meeting (PDF Format).

IBM Publications

IBM System p5, @eserver p5, pSeries, OpenPower and IBM RS/6000 Performance Report, January 11, 2006

Bassi Performance