Biowulf at the NIH
NAMD Benchmarks

Benchmark 1: Apoa1 benchmark from the NAMD suite. 500 steps, 92K atoms, 12A cutoff + PME every 4 steps.
(Jan 2009).

apoa1

All parallel jobs on the Biowulf cluster should run at least 70% efficiency, to ensure maximum utilization of the cluster resources. Based on this set of benchmarks, the apoa1 and similar jobs should be submitted to about 8 p2800, o2200 or o2800 nodes (16 processors), or up to 16 Infiniband nodes (32 processors). Other types of jobs may scale differently; see the Biowulf NAMD page for examples.

To find the most appropriate number of nodes for a specific type of job, it is essential to run one's own benchmarks.

# processors Wallclock time in seconds (Efficiency )
p2800 gige
2.8 GHz Xeon
Gigabit Ethernet
Intel compiler
o2200 gige
2.2 GHz Opteron
Gigabit Ethernet
Intel compiler
o2800 gige
2.8 GHz Opteron
Gigabit Ethernet
Intel compiler
o2800 ib
2.8 GHz Opteron
Infiniband
Pathscale compiler
1 1970 (100) 1631 (100) 1163 (100) 1125 (100)
2 1047 (94) 844 (97) 612 (95) 575 (98)
4 547 (90) 447 (91) 322 (90) 298 (94)
6 378 (87) 313 (87) 234 (83) 199 (94)
8 300 (82) 249 (82) 177 (82) 150 (94)
10 253 (78) 211 (77) 189 (61) 130 (87)
12 204 (80) 169 (80) 129 (75) 102 (92)
14 193 (73) 158 (74) 129 (64) 95 (85)
16 178 (69) 145 (70) 120 (61) 87 (81)
18 140 (78) 116 (78) 87 (74) 71 (88)
20 134 (74) 119 (69) 84 (69) 67 (83)
24 118 (70) 106 (64) 79 (61) 54 (88)
28 103 (69) 86 (68) 65 (64) 50 (81)
32 98 (63) 87 (58) 62 (62) 47 (75)

Benchmark 2: Water Sphere simulation, courtesy Jeff Forbes, NIAMS. (August 2006)

watersphere1
watersphere2

Based on these benchmarks, to obtain at least 70% efficiency, this job could be run on about 24 processors (12 nodes) on p2800, o2200, or o2800 nodes. The efficiency drops much more slowly on the Infiniband nodes, so the job could use up to 40 or 50 processors (20-25 nodes) on the Infiniband nodes.

# processors Wallclock time in seconds (Efficiency)
p2800 gige
2.8 GHz Xeon
Gigabit Ethernet
prebuilt 32-bit binaries
o2200 gige
2.2 GHz Opteron
Gigabit Ethernet
prebuilt 64-bit binaries
o2800 gige
2.8 GHz Opteron
Gigabit Ethernet
prebuilt 64-bit binaries
o2800 ib
2.8 GHz Opteron
Infiniband
Pathscale compilers
1 7011 (100) 5207 (100) 3355 (100) 3117 (100)
2 3590 (98) 2659 (98) 1754 (6) 1593 (98)
4 1838 (95) 1377 (95) 924 (91) 816 (96)
6 1342 (87) 1045 (83) 649 (86) 589 (88)
8 991 (88) 775 (84) 490 (86) 424 (92)
10 799 (88) 613 (85) 402 (83) 343 (91)
12 713 (82) 531 (82) 336 83) 282 (92)
14 578 (87) 456 (82) 292 (82) 243 (92)
16 525 (83) 399 (82) 264 (79) 214 (91)
20 433 (81) 343 (76) 218 (77) 181 (86)
24 375 (78) 290 (75) 186 (75) 151 (86)
28 321 (78) 273 (68) 167 (72) 129 (86)
32 295 (74) 265 (61) 147 (71) 116 (84)
36 255 (76) 248 (58) 139 (67) 103 (84)
40 243 (72) 238 (55) 129 (65) 93 (84)


forbes_sphereWaterbox simpulation graph

Water sphere and water box simulations courtesy Jeff Forbes, NIAMS

In the two jobs above, note that the water sphere simulation scales well to about 36 processors. Beyond that, the efficiency falls to below 60% which is generally considered poor efficiency. The water box simulation scales to about 16 processors before the efficiency drops to below 60%. This is similar to the apoa1 example above. (We now recommend that jobs run at least at 70% efficiency)


dashdorj1Efficiency charts

Equilibrated model of integral membrance complex, courtesy Nara Dashdorj, LCP, NIDDK.

A total of 346,358 atoms including water, lipids, and protein with several prosthetic groups. Cutoff 12.0, fullElectFrequency 4, nonbondedFreq 2, stepspercycle 20. In these benchmarks, the job scales to about 60 processors (30 nodes) on the p2800 Xeons, and to about 50 processors (25 nodes) on the o2200 Opterons. This is typical behaviour; with higher processor speeds the communications become more of a bottleneck, and so the job does not scale as well on the faster processors. (We now recommend that jobs run at least at 70% efficiency)