USGCRP logo & link to home

Updated 12 October, 2003

High-End Climate Science: Development of
Modeling and Related Computing Capabilities
5. Issues of Computational Systems
Report to the USGCRP from an ad hoc Working Group on Climate Modeling, December 2000

 

Table of Contents

Subcommittee on Global Change Research, Participating Agencies and Executive Offices

Ad hoc Working Group on Climate Modeling

Foreword

Executive Summary

  1. Background

  2. Summary of Findings

  3. Summary of recommen- dations

  4. Final Comments

Charge to the Working Group 

Main Report

  1. Purpose

  2. Current Situation

  3. Scope of Document / Underlying Definitions and Assumptions

  4. Elements of Climate Science

  5. Issues of Computational Systems

  6. Human resources

  7. Management / Business Practices / Institutional models

  8. Recommen- dations

  9. Reference Documents

  10. Endnotes

Full Report (PDF)

[previous section]

[next section]

Adequate computational resources are at the core of developing a high-end modeling capability to meet the Nation's needs. Ten years ago, supercomputing was purchasable commodity, with the particulars of climate-science computation requiring only incremental adaptation, compared with other disciplines, to the computing environment. The U.S. policy on high performance computing moved to focus on distributed-memory architectures using processor elements that have a market base broader than scientific applications.[9] In the U.S., the development of shared-memory computers with specialized vector processors, which had been the standard used for climate science for a number of years, ceased. However, the sustained development of the shared-memory vector technology continued in Japan.[10] A major source of tension in the climate-science enterprises of the U.S. arises from the wide-spread access to the Japanese supercomputers by scientists in virtually every country but the U.S.

Many millions of dollars have been spent on the development of applications software to use the distributed memory computers available to U.S. scientists. The success of these developments has been mixed. Certain subsystems of an end-to-end application can be made to run very fast. However, complete end-to-end systems that consider the climate-science problems of data ingest, simulation, assimilation, quality assurance, diagnostics, and push of output products to customers are generally not successful.[11] The uncertain success of U.S. endeavors is sharpened when the competitive aspects of climate-science activities are considered; namely, climate-science centers in many other countries have more complete models, use more observations in their studies, and produce more simulations and assimilated data sets than U.S. counterparts. Further, there is the perception that the best ideas from the diverse U.S. research activities are implemented more readily at non-U.S. centers. Bottom line: the usability of Japanese supercomputers is much higher than that of U.S. computers, and they are therefore, pragmatically, "faster."

The U.S. climate-science community is faced with a difficult, perhaps intractable, problem. With the present national strategic focus on distributed-memory computers, a tremendous expenditure on software is required. This includes not only applications software, but also the systems software needed to make the computer systems run.[12] There is, increasingly, evidence that the ability of climate-science applications to utilize distributed memory computers is limited.[13] The human resources needed for the software effort are difficult to define, but it is on the order of the resources spent in the scientific effort. Given that, at the least, the development of high-performance applications software is extraordinarily difficult, the U.S. is in the position of needing to spend comparable dollars to those currently spent on scientific development in a high-risk activity. If successful, then it will take 3-5 years to develop comparable capabilities as presently available in other countries.[14] However, long-term competitiveness requires availability of computational platforms with comparable usability and performance as those computers available to other scientists. To maintain a large software activity that our competitors do not have to maintain is a major fiscal inefficiency.

5.1) Software

Improved software and improved management of software lies at the foundation of the development of a high-end climate-science capability. The central role of software was introduced in Section 4.3. For the sake of organization the issues of software development will be subdivided into two major groups with the recognition that the two groups must be managed as a single entity with the primary goal to deliver a specific suite of products. This software infrastructure is represented in Figure 1 which shows the need to both isolate and layer activities so that they can be addressed in appropriate ways as well as integrating the activities towards shared goals.

Click on image for larger version.
Figure 1: Schematic of software infrastructure. The purpose of the software infrastructure is to provide better integration and transition among the scientific and computational elements of the climate and weather modeling community. There are two major subsets of the software infrastructure. The first allows scientists at multiple institutions to work together concurrently in a controlled environment. The second allows computational and modeling issues to be brought together in an effective way.

 

5.1.1) Software Infrastructure to Allow more Effective Scientific Collaboration

Currently climate scientists at major U.S. centers work in both large and small groups that are organized around specific programmatic goals. The management of software varies in the different centers as well as within different groups within the centers. In most cases, individual scientists and their co-workers develop their software with significant independence and little attention to formalized software engineering conventions. In some institutions a group of software engineers exists to unite the development of the software for specific applications, in others, the software is cobbled together on an as needed basis. Thus, major centers in the U.S. find themselves encumbered with large suites of difficult-to-manage software.

Nationally, the process that has been used to develop the software for major climate-science applications appears ad hoc. The impact of this on the ability of organizations to work together is large and negative. For example, consider the interaction of an individual university collaborator with a core activity at a national modeling center. The collaborator usually takes a version of the model and performs changes and experiments in a local research environment.[15] After some time, the collaborator might have developed an algorithm suitable for incorporation into the core model. Then either the collaborator or center personnel have to take the candidate algorithm and redo the work of installing the algorithm and the testing with the new "current" version of the core model at the center.[16] From the perspective of the university collaborator, there is a different interface with each modeling center. Therefore, to test the impact of, for instance, an algorithm for convective rainfall on weather forecasting, seasonal prediction, and climate data assimilation, the effort to make the collaborations can far exceed the resources spent on the actual scientific research.

The shortage of human resources leaves each modeling center with serious deficiencies in the development of an end-to-end climate-science capability. Therefore, more effective collaborations are critical to focusing adequate intellectual resources on a specific product-oriented problem. While the interaction between an individual researcher and a center, described above, might be fraught with manageable inconvenience, the interaction between centers presents insurmountable obstacles. Subtle algorithmic nuances arise that are, for instance, linked to an institution's history and increase the cost of the collaboration to overwhelm the perceived benefit of the collaboration. The ground is littered with well-intentioned collaborative efforts between individuals and between centers that have fallen victim to the overhead cost of collaboration. Virtually always, choosing to develop a given capability anew, internally, appears more cost effective than any benefit gained through collaboration.

A software infrastructure is needed that allows:

  • Concurrent development by multiple scientists at multiple institutions in a controlled [17] environment.
  • Identification of a clear path of migration from discovery-driven research activities to core product-driven activities.
  • Partnering of software, hardware, and climate-science activities to assure optimization among computational resources, scientific quality, and scientific completeness.

It would be na�ve to expect that simply defining a set of standards and guidelines for the construction of models might have a large impact on the integration of diverse modeling capabilities. The development of a software infrastructure requires the commitment to develop software management processes. These processes will require adaptation of the principles of software engineering to scientific development, with the focus on the end-to-end software system to provide specific products. The success of the software infrastructure will require commitment both from managers and practitioners. It is critical that software development processes be scaled to activities that include multiple principal investigators at multiple institutions. If this is not done, then there is little hope of developing the needed high-end climate-science capability.

5.1.2) Software to Allow more Effective Use of Computational Platforms

Traditionally climate scientists have worked with software specialists on code optimization. The realities of today's high-performance computing environment is that partnerships between software specialists and climate scientists are needed to assure that codes are capable of being run, maintained, and ported. That is, fundamental issues of software design and implementation have to be built in from the beginning, and computational decisions have to be considered on par with scientific algorithm decisions. If software issues and scientific issues are not treated concurrently, then the viability of codes for high-end modeling is in question. There are two major categories of software that need to be developed in order to provide a high-end climate modeling capability: applications software and systems software.

5.1.2.1) Applications Software

Applications software is the code that represents the scientific and statistical algorithms that are run to produce model simulations and assure their accuracy and merit. When vector computers were the work horse of scientific computing, scientists wrote code for a relatively stable computational environment that had well defined rules to enhance code performance. Software specialists from both scientific organizations and computer companies provided analysis of code performance and optimization. In many instances, possible improvements to code performance were isolated and tested by the software specialists and then provided to the scientists, who incorporated the suggestions if they deemed them worthwhile.

The move to commodity-based distributed computing (see Section 5.2) has changed the development of applications software dramatically. First, many of the tenets of traditional vector programming, which are deeply embedded in existing codes, are no longer valid. The hardware technology changes require going from successful vector software to more complex software for the distributed-memory computers. High performance of this new software is not assured. Second, the architectures towards which the applications are targeted are no longer stable. Different types of processors are being connected together with a variety of communications strategies.

The challenges facing the development of successful applications software are enormous. There has been significant investment in applications software at many U.S. institutions. Some applications have been successfully implemented. However, as the details of data use, schedule, and validation have been encountered in product-driven applications, a consistent high level of performance is not realized. There is now substantial evidence that the interprocessor communications requirements of many climate-science applications limit their actual performance on commodity-based machines to be much less than the theoretical performance specifications.[18] In addition, when considering end-to-end suites of application software, there are sequential processes and load imbalances that are in conflict with the parallelism of the problem.

We stand at the point of needing to develop techniques for software specialists and scientists to work together. Therefore, the infrastructure discussed in Section 5.1.1 must allow concurrent development not only by scientists, but also by software specialists. Codes need to be designed that partition the computational and scientific aspects of the code with well-defined, controlled interfaces. This should provide a more robust interface to the technology and allow adaptation to new technologies while buffering the impact on the scientific algorithms.

5.1.2.2) Systems Software

Systems software refers to that software needed to allow the hardware platform to be used. For this document, systems software will refer to a range of functions such as compilers, operating systems, debuggers, schedulers, and math and statistical libraries. These functions were supplied by vendors or third parties and were a purchasable commodity when Cray vector computers dominated the market. Today, the high-performance computing market is not large enough to provide vendor incentives to develop a robust suite of systems software on the premise that it will attract a specific customer base. Therefore, it is increasingly incumbent upon the applications communities to develop the systems software necessary to make their applications run.

Further complications come from the instability of technology. Increasingly there are interactions of application software with both the hardware and the systems software. Development of strategies to reduce these interactions and to accommodate technological changes is another challenge that must be faced.

5.2) Hardware/Impact of Technology Decisions

As stated above, the high-end computing industry in the U.S. has been transformed by the revolution in information technology over the last decade. Unlike other applications of information technology, the commercial potential of high-end computer, or supercomputer, systems is relatively limited. The vast majority of high-end computing systems are installed in government laboratories or academic institutions, with a smaller number being purchased by private industry for their research and development activities. With government support under the High Performance Computing and Communication (HPCC) Program, computer manufacturers in the U.S. were encouraged to design new architectures for high-end computers from mass-produced microprocessors. Microprocessor speeds were doubling every 18 months (Moore's Law), and it seemed reasonable that within a few years, that aggregate power could be applied to the most difficult computational problems. The new systems were envisioned to ultimately contain many hundreds to thousands of nodes, each containing a single processor or a cluster of a few processors, connected by a high-speed, integrated communications network. This new paradigm for high-end computing was labeled "massively parallel." The push towards massively parallel, distributed memory computing is still central to the policies of the Federal Information Technology for the Twenty First Century (IT2 ) and Accelerated Strategic Computing Initiative (ASCI) programs.

While the U.S. aggressively pursued this strategy for high-end computing systems, the Japanese manufacturers NEC and Fujitsu continued the incremental development of parallel vector architectures composed of tens to hundreds of more powerful special purpose processors that share large regions of high-speed memory. The components of these high-end computing systems are designed specifically for scientific and engineering applications. Their development has been more evolutionary than revolutionary as the basic concept for these machines was pioneered by Cray in the early 1980s when it introduced its X/MP series supercomputers.

A Department of Commerce trade tariff ruling and political considerations essentially preclude the importation of Japanese high-end computing systems into the U.S. The legal test case for this policy was the acquisition of a computer system specifically for climate simulations by NCAR. While it is counterproductive to debate the costs and benefits of a policy that is unlikely to change, it is instructive to examine the differing perspectives of it that are held by the climate science and information technology communities. Not surprisingly, such an examination reveals a large divergence in objectives that has resulted in a culture gap between the fields. More importantly, unless the communities (and their sponsors) can agree on common goals, the prospect that investments in Information Technology research will benefit climate science is not good, despite the best intentions of well-meaning individuals in both camps.

5.2.1) Different Objectives Result in Different Metrics for Success

The pace of Information Technology (IT) innovation and application has been rapid over the last decade. Because the technology timescales are so short, the emphasis within the IT research community has been to take an idea quickly from concept through proof-of-principle and possibly the prototype phase. Coupled with this approach has been the tendency to extrapolate the near term technological trends into the future to drive the research agenda. Consequently, the metrics used to measure success within the IT community are based on the objective of demonstrating the potential of new technology. For high-end computing, the relevant performance metrics are based on three factors, theoretical peak speed, efficiency, and scalability. Theoretical peak speed is a hardware metric determined completely by the rated speed of the individual processors and the number of them that can be made to work together in a high-end system. Efficiency measures both hardware and software performance and is the fraction of that peak capability that can be tapped by an application on a given number of processors. Scalability is mostly a software metric that measures the increase in throughput rate that is achieved by an application as more processors are added to work on a problem. When combined, these three measures do provide an estimate of throughput performance. The tendency, however, has been to consider each of these factors as independent measures of progress. This philosophy is apparent in the well-known Top 500[19] rankings that are published twice each year. These technology demonstrations have often been limited to a specific suite of applications software that only partially span the range of applications that require high-end capability.[20] There is an implicit pre-selection of applications that are likely to perform well on these metrics, and the presumption that technology which has been proven on these applications will be generally useful. Actual performance numbers presented in Figure 2 show the wide discrepancy found in real applications.

Click on picture for larger version.
Figure 2: Measured performance of applications on the 644 processor Cray T3E at the National Energy Research Scientific-Computing Center (NERSC). This figure shows the sensitivity of performance to the specific application. The Linpack application is used to determine the performance reported in the TOP500 list. The TOP500 list is frequently cited to justify claims of computer speed, with the implication that this speed is relevant to a complete range of applications. These performance curves show that the Linpack benchmark does not represent a general application environment accurately. The average application at NERSC performs at 11.6 % of the theoretical peak. Within climate-modeling centers in the U.S. a 10% performance goal is often set. Numbers in excess of 33% are common on the Japanese vector computers, which coupled with their much higher single processor speed, leads to a performance-usability gap.

 

Over the same decade, climate researchers have seen their computational needs increase exponentially. While much of this demand has been met by the proliferation of powerful workstation technology, the most complete and sophisticated modeling experiments still require computing resources at the very high-end. Accordingly, the relevant computer technology metrics for climate scientists are measures of increases in capability, i.e. to run previously impossible or impractical simulations, and increases in throughput, which is a direct measure of model productivity. Many climate scientists assert that theoretical peak performance, efficiency and scalability are often used out of context, without explicit recognition that throughput is the critical measure for most applications. Based on throughput metrics, the climate science community has identified both a performance gap and a usability gap between the high-end computing technology developed in the U.S. with that developed in Japan. This performance-usability gap is represented explicitly in Figure 3, which is provided by ECMWF.

Click on image for larger version
Figure 3: This figure shows the throughput of the forecast model component of the Integrated Forecast System (IFS). RAPS stands for Real Applications of Parallel Systems. This figure highlights the performance-usability advantages of the Japanese vector computers relative to the distributed memory T3E, which uses non-vector alpha processors. The VPP is from Fujitsu, the SX from NEC, and the T3E from Cray. The number of forecast days per computational day is plotted against the number of processor elements. Resolution is triangular truncation 213 (resolved waves) and 31 vertical levels. Comparable performance to the NEC and Fujitsu can be achieved for the modeling component of the system on the T3E. However, the T3E requires many more processors. Significant additional capability can be obtained on the NEC and Fujitsu by adding, for example, 10 more processors. From the graph, similar capability on the T3E would require more than 600 additional processors. Even if it is possible to obtain this performance, the software investment is significantly higher in the case for the T3E. Similar benchmarks have been run for the IBM SP systems and have yet to reach performance levels comparable to any of these systems. This indicates the difficulty of porting codes from one distributed memory machine to another. Figure provided by European Center for Medium-range Weather Forecasts.

 

5.2.2) Conflict between Climate Science and Information Technology

The NCAR computer system procurement was the seminal event that exposed the rift between two communities that previously had been cooperative and symbiotic. The NCAR benchmarks and acceptance criteria were based on the capability and throughput metrics described above (see, Capacity of U.S. Climate Modeling to Support Climate Change Assessment Activities, National Academy Press, 1998). The decision to buy a NEC computer system was met with shock and disbelief by many within the U.S. IT research community.

Although there have been several workshops and meetings, the groups appear to have been talking past each other since that time. On one hand, the IT community charges that the climate modelers have not embraced the potential of the new technology. Their argument continues that the climate modeling benchmarks are based on antiquated vector-friendly algorithms and codes. They further maintain that progress depends on the climate community making the investment in new algorithms and codes that are scalable and efficient on the new architectures, which the IT specialists foresee increasing by more than three orders of magnitude in the next ten years.

On the other hand, climate scientists maintain that despite ten years of development, the massively parallel designs have yet to become competitive with vector machines in many applications that require the movement of large amounts of information across processors. For example, the well-known NAS Parallel Benchmark suite achieves only 5% of peak theoretical performance on the Cray T3E at the National Energy Research Supercomputer Center, even though it gets nearly 77% of peak with the LINPACK Top 500 benchmark (see Figure 2). The climate community does, indeed, utilize modern codes, and there has been significant monetary and intellectual investment in the development of codes for parallel computers. While some of the developments have been successful, overall the performance has been poor on U.S. high-end computing machines. This is discouraging and does not motivate continued investment in seemingly futile undertakings. This view is reinforced when they see their international colleagues making greater progress while spending less time and money on software development.

While the Japanese high-end computing manufacturers have been stable and have followed a straightforward, predictable design path, there has been little stability in the U.S. high-end computing market. In the early 1990s, Intel Supercomputer and Thinking Machines dominated the market; neither is still in business. In the middle of the decade SGI, Cray (SGI subsequently acquired Cray) and IBM made product offerings that serve as the core of the U.S. installed high-end base today. Nevertheless, the workhorse Cray T3E line, which occupies half of the top twenty places in the current Top 500 list, has been discontinued and SGI recently sold its Cray division. Further, both Compaq and Sun unveiled plans in the last year to enter the high-end market. This turnover has led to usability problems, as immature operating systems, compilers, libraries and other software infrastructure components have undergone too few product cycles to be robust in a production-computing environment. IT researchers note that this competition and perceived turmoil is indicative of a healthy IT market and drives progress and innovation in the long term, which is true. It is also true, however, that the uncertainty and instability of the marketplace makes it difficult to build the robust production environment needed for climate research over the next several years when the science and policy communities will demand both more and higher quality information from the modelers. The bottom line is: development of high-end computing platforms suitable for supporting product-oriented applications has become a research activity, with all the associated uncertainties and risk.

At the basis of the strategic decision to go towards massively parallel computing was the assumption that the incremental development of vector computing platforms had reached the end of the line. This decision was as much economical as technological. The evolutionary development of vector machines in Japan followed a strategy that U.S. vendors (i.e. Cray) had rejected. There are indeed economic considerations about the viability of the Japanese vector computers. Fujitsu will no longer market vector machines, leaving NEC as the lone vector supercomputer. No matter what the ultimate fate of the vector computing business, however, the Japanese vendors have already delivered ample computing platforms to assure the gap between U.S. and non-U.S. climate-science centers will increase for at least the next five years. The emergence of a new U.S. effort centered around the Tera purchase of Cray from SGI and the formation of Cray Inc. does little to improve the situation in the short-term. The long-term effects are beyond our ability to evaluate. There is some optimism as large-cache fast processors are finding a broader market, which will allow substantial throughput to be achieved without requiring scaling to many hundreds of processors.

5.2.3) Solution is to Re-establish Cooperation Around Common Objectives

If the U.S. is to remain among the intellectual leaders in climate modeling, there needs to be recognition that the issues of both the climate community and the IT community have merit. The IT vision is clearly long-term, and it is not clear that high-end climate community can make it through the near-term crisis and maintain an intellectual critical mass. There has to be directed investment to take the proof-of-concept activities of the IT community and develop stable hardware and software environments suitable for supporting product-oriented activities. The technology experts and the science experts need to agree on a common set of both near-term and long-term objectives, then develop a workable strategy to achieve them. One possible solution is to look back to 1960s and 1970s when scientific supercomputing first established itself as a necessary tool for science and engineering. Discipline-oriented centers such as NCAR and GFDL became magnets for mathematicians and computer engineers who were attracted by the challenges of improving the primitive technology and solving real world problems. As a result, these centers became hotbeds for supercomputing research that had utility beyond the immediate applications. These centers would complement the multi-discipline, many-user supercomputing centers, as they could tailor their configuration and management to the needs of a smaller and less diverse research community. Proper sponsorship of these centers requires that the current friction between the communities be replaced by a healthy and cooperative alignment of both scientific and technological goals. Strong, innovative and forward-thinking leadership will be required for this approach to succeed.

5.3) Characteristics of climate-science computing

Given the inconsistent success of executing climate-science algorithms on distributed memory parallel computers, it is worth noting the parameters that define the computational problem. As with many scientific applications, climate-science problems require fast computers as well as high capacity, fast-access mass storage systems. Compared with other scientific applications, climate-science data impact the computational problem in both direct and indirect ways. Indirectly, the long history of weather observations has led to the development of complex physical parameterizations that represent subscale processes. These parameterizations are not executed uniformly across the discrete domain of the models; that is, their execution is dependent on local conditions. The results of their execution might then need to be communicated to other parts of the geophysical (and computational) domain. In total, the demand to represent localized physical processes in global models introduces difficult load balance and communications problems that reduce their potential to scale to many processors.

Data usage, also, has a number of direct impacts. Climate-science data require high capacity storage and are often heterogeneous in their format. Assimilation of these data brings a whole new level of difficult challenges. Assimilation algorithms are, themselves, often more computationally intensive than climate models. The assimilation process interrupts simulations for the data insertion, which requires overhead as the process starts and restarts after every few hours of simulation. In its routine use of data assimilation, weather and climate science is unique. Currently, all relevant organizations are having difficulty achieving massive scalability of assimilation systems.

There are other external factors that have a profound impact on the computational environment. First, climate-science products often have to be delivered on schedule in order for their utility to be realized. While this is obvious for weather forecasts, time-critical requirements arise in climate and chemical assessment activities. Time criticality directly impacts the capability requirements of the computational platform and focus performance metrics more definitively on throughput rather than processor speed. Second, the impact of weather and climate is on regional and human scales. Therefore the results of global modeling activities are required to have precise information on scales that are much smaller than can be directly simulated.

Computational activities to address climate-science computing need to consider all of these aspects of the computational problem. This requires significant attention to software engineering, systems engineering and systems design, which is outside of the scope of the research programs that normally support Earth and computer science.

 

[previous section]

[next section]


US CCSP  logo & link to home USGCRP logo & link to home
US Climate Change Science Program / US Global Change Research Program, Suite 250, 1717 Pennsylvania Ave, NW, Washington, DC 20006. Tel: +1 202 223 6262. Fax: +1 202 223 3065. Email: information@usgcrp.gov. Web: www.usgcrp.gov. Webmaster: WebMaster@usgcrp.gov