To help ensure that we are meeting our clients' needs, this year we established a set of ten performance goals pertaining to our systems and service. We developed these goals to set expectations for our own performance, then obtained our clients' endorsement of these goals as meaningful and useful.
Providing the necessary administrative support to keep NERSC functioning smoothly is a demanding job. Among those who have excelled in this arena are Eric Essman and Norma Early.
We now proactively gauge just how well we are doing in meeting these common expectations. We have tried to ensure that they reflect our efforts from a clients' perspective, as opposed to an internal one. For example, a measurement of system availability needs to reflect the number of hours a machine is available to our clients, not how long it takes to identify a problem and initiate corrective action on our end. Our performance goals cover the following areas:
Systems | Overall Availability | Scheduled Availability | MTBI (Hours) | MTTR (Hours) |
---|---|---|---|---|
Vector | 97.7% (95%) | 99.3% (96%) | 242 (96) | 5.3 (4.0) |
Parallel | 95.2% (85%) | 98.0% (90%) | 43 (96) | 2.7 (4.0) |
Storage | 98.9% (95%) | 99.2% (96%) | 85 (96) | 0.7 (4.0) |
File Servers | 99.9% (96%) | 99.9% (96%) | 1460 (316) | 1.0 (8.0) |
SAS | 99.9% (97%) | 99.9% (99%) | 973 (490) | 1.0 (4.0) |
The table "System Availability Details" above shows various aspects of our systems' reliability since NERSC moved from Livermore to Berkeley Lab. Figures in bold represent the measured time, while goals are shown in parentheses and red type. Scheduled availability refers to the amount of time the systems are expected to be available (accounting for scheduled maintenance and upgrades), while gross availability is based on 24 hours a day, seven days a week. MTTR (mean time to restoral) refers to the amount of time between a system failure and the point at which full service is restored to clients. Measured performance exceeded our aggressive goals in most cases, with the exception of time between interruptions for parallel and storage systems and time to restoral for vector systems. We are working to improve our performance in those areas.
NERSC's service goals are to respond to clients' problems within four working hours and to resolve at least 90 percent of those problems within two working days. Spot checks confirm that NERSC meets the goal of responding to problems within four hours. Between July 1, 1996, and May 15, 1997, 75.3 percent of all problems were resolved within two days. As the NERSC staff got up to speed, however, we made significant progress in meeting the 90 percent goal: between March 11 and May 15, 1997, 93.1 percent of all problems were resolved within two days.
Not all problems can be resolved within two days. Reasons for putting a problem on hold include software requests, ongoing coding projects, bugs waiting for a vendor-supplied fix, and a client not responding to a request for input within two days. Problems not resolved within 72 working hours are automatically escalated for more in-depth review to ensure that outstanding problems are addressed.
NERSC staff periodically review problems and client requests to ascertain areas needing attention with an eye toward fixing them to minimize disruptions in service.
Through the Red Carpet Program, led by members of NERSC's Scientific Computing Group, NERSC staff are building individual working relationships with clients at other national labs and universities tackling such issues as cleaning up nuclear waste, supporting international research in magnetic fusion energy, designing particle accelerators, and understanding the structure of the smallest building blocks of matter.
NERSC staff are holding site meetings with each Grand Challenge team to determine what services and support are needed. Services so far include providing training for new clients, integrating separate physics software packages into a cohesive program, developing new algorithms, and offering programming tips.
NERSC has also created a specialized group to support a particular field of science. The High Energy and Nuclear Physics (HENP) Systems Group works with physicists around the globe to help develop solutions to the formidable computing challenges faced by the next generation of HENP experiments. The group provides access to and assistance with a combination of production systems such as the PDSF, advanced prototype storage systems such as HPSS and DPSS, and research and development projects such as contributing to the HENP Grand Challenge.
To address forefront scientific issues in high-energy and nuclear physics, complex
experiments are being carried out by large collaborations to detect and analyze increasingly
large numbers of final-state particles and/or events. The High-Energy and Nuclear Physics
Support Group, led by Craig Tull, is helping scientists get the science out of massive
amounts of data generated by experiments such as STAR (Solenoidal Tracker at RHIC).
The challenge for the coming years is to provide cost-effective, high-performance
computing capabilities and unprecedented data access which will allow widely distributed
collaborations to process and analyze hundreds of
terabytes of data per year.
Some of our client service experiments have worked well, while others sent us back to the drawing board. For example, to reduce travel time and costs, we tried using ISDN-based videoconferencing as a training tool. On our end, there were problems in learning how to present material via video, and clients had difficulty scheduling facilities, especially as we tried to scale up the sessions to reach more sites. As a result, we have decided to rely on Web-based technologies and are developing ways to provide reliable Web-based video. This will allow clients to tap our expertise at their desktop and on their own schedule.
NERSC staff members participate in various organizations that set the pace for new technology development. For example, the head of our Systems Group is a member of Silicon Graphics' customer advisory board; members of our Mass Storage Group serve on the HPSS executive and technical committees; and our Future Technologies Group leader serves on the MPI standards committee.
NERSC staff also share their expertise through software releases, articles in technical journals, tutorials and presentations at professional conferences and workshops, and invited talks at universities, laboratories, and high-tech industries.