Science Requirements
DOE Special Programs Data Archives Internal Projects Hosted Projects Hosted Meetings HPC Centers Services Completed Projects |
The Sustained System Performance (SSP) BenchmarkWilliam T.C. Kramer, John M. Shalf, and Erich Strohmaier Background
NEW The NERSC-6 SSP NEW Formal description of SSP A formal description of the SSP, including detailed formulae, is now available. This is a portion of the soon-to-be-published Ph.D. dissertation, Kramer, W.T.C., 2008, "PERCU: A Holistic Method for Evaluating High End Computing Systems," Department of Electrical Engineering and Computer Science, University of California, Berkeley. BackgroundMost plans and reports recently discuss only one of four distinct purposes benchmarks are used. The obvious purpose is selection of a system from among its competitors, something that is the main focus of this paper. This purpose is well discussed in many workshops and reports. The second use of benchmarks is validating the selected system actually works the way expected once it arrives. This purpose may be more important than the first reason. The second purpose is particularly key when systems are specified and selected based on performance projections rather than actual runs on the actual hardware. The third use of benchmarks, seldom mentioned, is to assure the system performs as expected throughout its lifetime [1], (e.g. after upgrades, changes, and regular use.) Finally, benchmarks are used to guide system designs, something covered in detail in a companion paper from Berkeley’s Institute for Performance Studies (BIPS). HPC procurements require more sophisticated methods to gauge the effectiveness of the system than the “speeds and feeds” that are typically supplied by hardware vendors or from simple, one dimensional tests. HPC systems are typically presented based on raw hardware characteristics -- the interconnect bandwidth, the peak floating point performance of the processors, the number of processors, the memory bandwidth, etc. The HEC Community is left with the situation that we sometimes have a good idea of how much work a system can produce if everything is perfect, but little idea of what it will do in the real world. Without additional information, the systems may end up being compared based on their peak performance even though they are likely to deliver only a fraction of their peak performance. Selecting a system with benchmarks and tests is akin to a constrained optimization problem. There is serious concern however whether a strict optimization approach goes too far - back to quantitative scoring rather than an informed evaluation. Having worked in both realms, the qualitative approach produces superior results and is much more likely to move to the revolutionary change rather than a slow, evolutionary system improvement. Rigid weighing and scoring of values will be a step backwards in this regard. Using qualitative judgment based on quantitative date works best. Hence any methods investigated have to be on the “fuzzy” side of Operations Research. HEC Systems are too complex for simple assessment. These systems have components based at least in part on commodity technology. Since HEC systems are most often used to run highly parallel, tightly coupled application codes, the interaction of components plays a disproportional role in how well the systems meet expected productivity levels. The overriding question for an HEC procurement team is: "What performance will this system deliver to our workload?" A procurement team needs a normalizing metric that tells them how efficiently the architecture can execute their typical workload given the proposed system's peak performance so that different systems can be compared on an equal footing. The general methodology for approaching the problem was outlined in a 1982 technical report by Ingrid Bucher and Joanne Martin entitled "Methodology for Characterizing a Scientific Workload." These principles are embodied in the construction of the NAS kernel benchmarking program (Bailey & Barton, 1985), and are consistent with the approach taken for the Sustained System Performance (SSP) metric developed by NERSC for its procurements. The methodology outlined in 1982 by Bucher and Martin is as follows:
As stated, the implementation of this methodology suffers from a number of pragmatic problems:
The NERSC Approach to Procurement BenchmarksWith these issues in mind, we introduce the methodology developed for NERSC procurements to develop a normalizing metric for inter-machine comparisons.
The NERSC-5 SSPThe effectiveness of a benchmarking metric for predicting delivered performance is founded on its accurate model of the target workload. A static benchmark suite will eventually fail to provide an accurate means for assessing systems. Several examples, including LINPACK, show that over time, kernel benchmarks become less of a discriminating factor. This is because once a simple benchmark gains traction in the community, system designers customize their designs to do well on that benchmark. The Livermore Loops, SPEC, LINPACK, NAS Parallel Benchmarks, etc. all have this issue. It is clear LINPACK now tracks peak performance in the large majority of cases. Erich Strohmaier showed through statistical methods, that within 5-7 years, only three of the eight NPBs were true distinguishers of system performance. Thus it should not be a goal that the kernel benchmarks that they are long lived – except as regression tests to make sure the features that make them work well stay in the design scope. There will have to be a constant introduction/validation of the “primary” tests that will drive the design features for the future, and a constant “retirement” of the benchmarks that are no longer strong discriminators. Consequently, the SSP metric continues to evolve to stay current with current workloads and future trends. We will now describe the 5th generation of the SSP metric that is being used for the NERSC-5 procurement. The table below shows the set of applications used in NERSC-5 SSP. The procurement team looks at all benchmark values and assesses their implications. It also uses kernel benchmarks such as the NAS Parallel Benchmarks, CPUbench, Membench and IObench. But the SSP provides the best overall expectation of performance and price performance for a proposed solution. It also is attractive to vendors who prefer composite tests over many discrete tests.
The SSP value is now calculated as the geometric mean of the Flop rate of each program. In the past, a weighed harmonic mean and a straight arithmetic mean were used at NERSC, but recent experience indicates the geometric mean, while giving somewhat lower value, provides a better balance amongst the codes used and is a better predictor. It should also be noted that the SSP does not have to be made up only of full applications. SSP can be used with kernels and indeed the first time SSP was used at NERSC, it was calculated as the average of the NAS Parallel Benchmarks rates. The capability of a system is then represented by the Sustained Performance (SSP) of a system integrated over a given time period. The SSP number (in Tflop/s-years) indicates the effective average performance of the system on a center's scientific workload at any given point in time. In order to enable a comparison between systems that are bid as part of a procurement process, the "capability" of the system is the total area under the SSP curve over a given time period (NERSC uses 3 years). Since the capability of the system can be quantified for the entire workload at any point in time, and indeed, throughout any period, it is then possible to assess the price performance, or “value” of the system by dividing the SSP metric by the cost of the system – basically Tflop/s-years per $. This gives an important and straight forward way to determine the system with the best value out of all the system proposed. Different vendors introduce technology at different times, and it may be to sites advantage to have current technology installed and then have a predetermined upgrade to new technology that has higher performance. That is, having phased improvements of the system in order to have the best price performance. In order to account for different delivery dates and phase scales, the calculation for the area under the curve uses a common start and end date for all bid systems. This normalizes for systems that are delivered "late" and also takes into account the staged delivery of systems to the site. A vendor can make up for a later delivery of a system by increasing the total size of the delivered system and/or providing faster technology. Either will compensate for the loss in area under the SSP curve caused by the later delivery. Because of Moore’s Law, this may be an advantage to both parties.
The NERSC-6 SSPThe NERSC-6 SSP benchmarks were released on September 4, 2008. A brief description appears below. More info will be available soon on the SDSA web pages. To obtain the codes go to the NERSC-6 RFP page.
The Effective System Performance (ESP) MetricHigh performance scientific computer systems are traditionally compared using individual job performance metrics. However, such metrics tend to ignore high-level system issues, such as how effectively a system can schedule and manage a varied workload, how rapidly the system can launch jobs, and how quickly it can recover from a scheduled or unscheduled system outage. The productive work that can be extracted from a computational system is dependent not only on computational performance but also on the software infrastructure. In particular, resource management functionality (e.g. scheduling, job launch and checkpoint/restart) has become an increasingly important issue given the difficulty of managing large-scale parallel computers. Therefore, the SSP metric must be adjusted in two ways. The first adjustment is by a metric that takes into account the effective throughput of the job scheduling system. For instance, a job scheduler that must schedule jobs on a torus interconnect (egg. and XT3 or a BG/L system) may have additional overheads to make room for new jobs or may not be able to densely pack the torus with work. Similarly, some job launchers have considerable startup overhead for launching parallel jobs that cuts into the effective throughput of the system. So the scheduler candidate system must be subjected to a simulated workload in order to estimate the efficiency with which it can schedule the resources of the system. NERSC refers to this metric as the ESP (Effective System Performance) [4]. ESP [5] has several characteristics that set it apart from traditional throughput tests. First, it is specifically designed to simulate “a day in the life” of a supercomputer. It has scheduling shifts that mimic the typical supercomputer center which often changes priority of job times between daytime and night. ESP also requires shifting between multiple jobs running and a single “full configuration” job and in fact requires the full configuration jobs run with different scheduling parameters. Finally it attempts to estimate system management effort involved in running large scale computers.
The second adjustment to SSP is variability – which increasingly causes lost capability in HPC systems [6]. NERSC uses the variability shown by the SSP metric as a prime indicator of how variable the system is. Both the ESP metric and variation are assessed with proposals, and during the life of the system. The SSP metric is used to assure the system still operates as expected after upgrades and through the life of the system. ConclusionThe SSP metric is the most important performance metric of the procurement and contract. Vendors are required to supply a promised aggregate life-time integral of the SSP metric on the supplied system. On the other hand, the vendor has flexibility in how best to provide the required performance although at NERSC any major change requires concurrence. This means that the SSP metric has to be measured in regular intervals during the life-time of the system. On-going use of benchmarks ensures that all effects on performance from system upgrades or deterioration, system software and compiler upgrades, and communication library changes are reflected in the actual measured SSP value. Hence, the Sustained System Performance metric, along with quantifying the impact of Effective System Performance and variability, provides an excellent approximation of how well HEC will serve the scientific community. Notes
LBNL-58868 |
Page last modified: Fri, 12 Sep 2008 17:13:59 GMT Page URL: http://www.nersc.gov/projects/ssp.php Web contact: webmaster@nersc.gov Computing questions: consult@nersc.gov Privacy and Security Notice |