Search Magazine     
   
Features Next Article Previous Article Comments Review Home

SOFTWARE TOOLS:
Scalable Systems Software for
Terascale Computer Centers


The Scalable Systems Software project coordinated by ORNL is fundamentally changing the way future high-end systems software is developed to make it more cost effective, robust, and scalable to multi-teraflop supercomputers.

Click image for larger view.
Stephen Shevlin (foreground) and Tom Dunigan (standing) discuss with Pratul Agarwal the image on the computer monitor which resulted from a simulation of a protein. The IBM Power4 (Cheetah) supercomputer at CCS was used to perform multi-scale modeling of vibrations in the protein cyclophilin A, which is related to HIV infections.
 

 

System administrators and managers of terascale computer centers are facing a crisis. The nation's premiere scientific computing centers all use incompatible, ad hoc sets of systems tools that were not designed to scale to the multiteraflop systems being installed in supercomputer centers today. One solution would be for each computer center to take their homegrown software and rewrite it to be scalable. But this approach would incur a tremendous duplication of effort and delay the availability of terascale computers for scientific discovery.

The purpose of the Scalable Systems Software project is to provide a much more timely and cost-effective solution by pulling together representatives from the major computer centers and industry and having them collectively define standardized interfaces between system components. At the same time this group can produce a fully integrated suite of systems software components that can be used by the nation's largest scientific computing centers.

The scalable systems software suite is being designed to support computers that scale to very large physical sizes without requiring that the number of support staff scale along with the machine. This strategy goes beyond just creating a collection of separate scalable components. By defining a software architecture and interfaces between system components, the Scalable Systems Software research is creating an interoperable framework for the components.
 


Click image for larger view.

One of our top award winners associated with software tool development at CCS is Jack Dongarra, who directs UT's Innovative Computing Laboratory. He recently won two R&D 100 Awards, was elected a member of the National Academy of Engineering, and earned a Fernbach Award. He annually compiles a list of the Top 500 supercomputers based on peak performance.
 

This makes it much easier and cost effective for supercomputer centers to adapt, update, and maintain the components in order to keep up with new hardware and software. A well defined interface allows a site to replace or customize individual components as needed. Defining the interfaces between components across the entire system software architecture provides an integrating force between the system components as a whole and improves the long-term usability and manageability of terascale systems at supercomputer centers across the country.

Systems interfaces are being standardized using a process similar to that employed to successfully define the message passing standard (MPI). This process is an open forum of university, lab, and industry representatives who meet regularly to propose and vote on pieces of the standard. The figure at the bottom of this page represents the significant progress to date on producing scalable components and defining standardized interfaces between them. The bold lines represent working interfaces. The light lines represent interfaces in progress. The colors of the components represent which of the four multi-lab working groups inside the project is responsible for it.

In November 2003 the first release of a complete, integrated set of scalable systems components was made. This distribution utilized the popular OSCAR packaging and install technology. A second release is scheduled in March 2004. This past year the system administrators at Argonne National Laboratory decided to switch their "Chiba City" cluster to use our scalable systems suite exclusively. In January 2004 the suite underwent scale tests on the 5160 processor Titanium cluster at the National Center for Supercomputer Applications. Our research has developed software to provide communication service between components over multiple protocols as well as a flexible authentication scheme to provide security to the overall system. Research continues to harden the working prototypes, improve integration, and increase scalability to the target of 10,000 processor systems.

The coordinator for this project is Al Geist, an ORNL corporate fellow. The participating organizations include seven Department of Energy laboratories, three National Science Foundation supercomputer centers, and five supercomputer vendors. The DOE labs are ORNL, ANL, Ames, Lawrence Berkeley, Los Alamos, Pacific Northwest, and Sandia national laboratories. The NSF sites are the NCSA, Pittsburgh Supercomputer Center, and the San Diego Supercomputer Center. The vendors are IBM, Silicon Graphic, Cray, Hewlett Packard, and Intel.

What is the impact of this project? The Scalable Systems Software project is a catalyst for fundamentally changing the way future high-end systems software is developed and distributed. It will reduce facility management costs by: reducing the need to support home-grown software, making higher quality systems tools available, and providing the ability to get new machines up and running faster and keep them running. The project will also facilitate more effective use of machines by scientific applications by providing scalable job launch, standardized job monitoring and management software, and allocation tools for the cost-effective management and utilization of terascale computer resources.

Click image for larger view.
System components presently under development and their interfaces. Dark lines represent working interfaces.
 

Search Magazine
   
Features Index Next Article Previous Article Comments Review Home

Web site provided by Oak Ridge National Laboratory's Communications and External Relations.
[ORNL Home] [CAER Home] [Privacy and Security Disclaimer]