Comprehensive Scientific Support

Many of the important breakthroughs in computational science are expected to come from large, multidisciplinary, multi-institutional collaborations working with advanced codes and large datasets, such as the SciDAC and INCITE collaborations. These teams are in the best position to take advantage of terascale computers and petascale storage, and NERSC provides its highest level of support to these researchers. This support includes specialized consulting support; special service coordination for queues, throughput, increased limits, etc.; specialized algorithmic support; special software support; visualization support; conference and workshop support; Web sever and Web content support for some projects; and CVS servers and support for community code management.

Providing this level of support sometimes requires dogged determination. That was the case when the Center for Extended MHD Modeling, a SciDAC project, reported that one of their plasma simulation codes was hanging up intermittently. NERSC consultant Mike Stewart (Figure 1) tested the code on Seaborg in both heavily and lightly loaded scenarios, at various compiler optimization levels, on different combinations of nodes, and on NERSC’s SP test system. He submitted the bug to IBM as a “severity 2” problem; but after a week passed with no response, he asked the local IBM staff to investigate. They put Mike in touch with an IBM AIX developer, at whose request he ran more test cases. Mike even wrote a FORTRAN code to mimic the C test case, and it displayed the same problem.

Using TotalView debugging software, Mike finally pinpointed the location of the bug in an MPI file read routine. With this new information, IBM assigned the problem to an MPI developer, who worked with Mike for several hours by phone to fix the bug. IBM then released a revised MPI runtime library that allowed the plasma code to run without interruption. Without Mike’s persistence, the researchers’ work might have been delayed for months.

Figure 1
NERSC consultant Mike Stewart’s persistence paid off in the timely fix of an obscure software bug.

NERSC staff are not content to wait for users to report problems. Consultant Richard Gerber (Figure 2) has taken extra initiative to strengthen support for SciDAC and other strategic projects by staying in touch with principal investigators and by attending project meetings, where he discusses the state of their research efforts and ways in which NERSC can contribute to their success. From these discussions, Richard has undertaken a number of successful projects to improve code performance and to resolve problems, such as finding a way to install a logistical networking node that meets NERSC security requirements.

One particularly difficult problem involved the SciDAC Accelerator Science and Technology project’s 3D beam modeling code, which was scaling poorly on the expanded Seaborg. Richard’s initial analysis showed that the code’s runtime performance depended strongly on the number of tasks per node, not on the number of nodes used. He then pinpointed the problem as the default threading strategy used by IBM to implement the FORTRAN random number intrinsic function, and he enlisted the help of other NERSC consultants to find a solution. Jonathan Carter and Mike Stewart found two obscure references to this function in IBM documentation that pointed the way to a solution. Richard used this information to conduct some tests, and found that a runtime setting that controls the number of threads used in the intrinsic function can increase code performance by a factor of 2 when running 16 tasks on one node. Using this setting, the beam modeling code is now performing better than ever.

Figure 2
Richard Gerber has made significant contributions to several research projects by improving code performance and helping scientists use NERSC resources effectively.

With the variety of codes that run on NERSC systems, NERSC tries to provide users with flexibility and a variety of options, avoiding a “one size fits all” mentality. For years NERSC and other high performance computing (HPC) sites have been urging IBM to support multiple versions of compilers, since not all codes get the best performance from the same compiler version. When IBM provided scripts to install versions newer than the default compiler—but not older versions—NERSC staff saw an opportunity they could take advantage of (Figure 3). David Skinner analyzed these complicated scripts and with great insight was able to engineer this process for older compilers as well. Majdi Baddourah built extensive test cases and showed ingenuity in resolving the problems these test cases exposed. David Paul integrated the multiple compilers into the system administration structure. Together they made the use of multiple compilers fail-safe for users, and they gave helpful feedback to IBM on improvements in compiler support.

Figure 3
David Skinner, David Paul, and Majdi Baddourah (not shown) helped NERSC users get the best performance out of their codes by implementing support for multiple compilers.

To maintain quality services and a high level of user satisfaction, NERSC conducts an annual user survey. In 2003, 326 users responded, the highest response level in the survey’s six year history. Overall satisfaction with NERSC was 6.37 on a 7-point satisfaction scale. Areas with the highest user satisfaction (6.5–6.6) were HPSS reliability, timely consulting response, technical advice, HPSS uptime, and local area network. Areas with the lowest user satisfaction (4.7–5.0) were Access Grid classes, visualization software and services, and training.

In answer to the question “What does NERSC do well?” respondents pointed to NERSC’s good hardware management practices, user support and responsive staff, documentation, and job scheduling and throughput. The question “What should NERSC do differently?” prompted concerns about favoring large jobs at the expense of smaller ones, as well as requests for more resources devoted to interactive computing and debugging, more compute power overall, different architectures, mid-range computing support, vector architectures, better documentation, and more training. Complete survey results are available.

During the past year, NERSC instituted several changes based on the results of the 2002 survey:

Clients, Sponsors, and Advisors
Capability and Scalability
Comprehensive Scientific Support
Connectivity
Advanced Development

Top