Science Highlights
Newsroom Newsletter Technical Reports Annual Reports User Surveys Planning Documents Image Gallery SC Conferences Presentations Announcements to Users Status & Statistics |
Blue Planet: Extending IBM Power Technology and Virtual Vector Processing(Excerpt from "Creating Science-Driven Computer Architecture: A New Path to Scientific Leadership") Currently the most dominant architecture for high end computing is the IBM Power series, which by itself accounts for 55% of the Top20 computers in the world. IBM has a large and diverse installation base for the Power line, giving it a tremendous advantage in funding research and development of new functionality. However, this diversity contributes to the problem that the Power architecture is not specifically tuned to the needs of the scientific market. A key component of this proposal is to enhance the Power architecture for scientific computing while in parallel implementing a system by the end of 2005 that has at least twice the sustained performance of the Earth Simulator at roughly half the hardware cost! The details of this approach are:
The details of the three-pronged parallel approach are described in the following three subsections. Blue Planet System ArchitectureBlue Planet will take the IBM technology due for implementation in the second half of CY 2005 and expand its capability in multiple ways as an immediate and highly reliable enhancement to the scientific computational power of the DOE science community. Specifically, the system will be built on Power 5+ CPUs, which will run at approximately 2+ GHz, and the Federation switch. At that rate, each CPU is theoretically capable of 8�10 Gflop/s. However, using new, previously unplanned functionality as well as special packaging will allow the system to achieve a much higher sustained percent of peak on true scientific codes from multiple disciplines. The new features are:
Figure 2. Blue Planet architecture. The resulting system will be able to sustain 40 to 50 Tflop/s on applications from several disciplines. The Blue Planet system will have 16,384 CPUs, each with 8�10 Gflop/s; 2,048 eight-processor SMP nodes; 8,192 switch links; and 2.5 PB of shared parallel disk storage. The amount of memory on the system will be configured to assure maximum main memory bandwidth. Currently memory technology trends indicate this has to be 256 TB of memory to attain this maximum bandwidth configuration�a capacity that is very expensive and not required by the applications (many of the applications scale in memory use by N2 while they scale in computation by N3). If memory bandwidth can be maintained at 64 TB or 128 TB of memory, the system will be delivered with the most cost-effective amount. Blue Planet will have more memory bandwidth, more interconnections, and lower interconnect latency than was previously planned by IBM. It will be delivered in FY 2006, possibly in two phases. The need for phasing may come from component availability, since all aspects of the system will be new and production lines will be ramping up for the expected demand. The first phase will be delivered in the second half of CY 05, with the second phase following within nine months. Blue Planet requires approximately 6 MW of power for the computer and peripherals and requires 1,700 to 2,000 tons of cooling. The entire system will fit within 12,000 square feet of computer space and can be housed in the existing LBNL Oakland Scientific Facility without constructing additional space. The Virtual Vector ArchitectureThe basic intent of the Virtual Vector Architecture (ViVA) facility is to allow customers/ applications to run high performance parallel/vector style code on a traditional high data bandwidth SMP. What is described below is currently an unsupported function within IBM, involving compilers, operating systems, Hypervisors, firmware, processor/systems, and productization. On the positive side, the Power 5 processor/system design does have basic functionality to support ViVA. ViVA would further enhance the Blue Planet system and take IBM in a new direction. Power 5 chips have the ability to synchronize the CPUs using a hardware communication link for barrier synchronization. This hardware feature is currently not planned for exploitation because there is not an identified requirement within the existing markets. However, the synchronization feature can be used to harness individual CPUs into a "virtual vector" unit. This is the same concept implemented in the Cray X1 MSP CPU, which has four separate 3.2 Gflop/s SSPs that synchronize for vector processing. The basic intent of ViVA is to allow customers/applications to run high performance parallel/vector style code on a traditional high data bandwidth SMP. Initially, ViVA�s goal is not to improve memory bandwidth per se. It does greatly enhance the ability of compilers (and programmers) to exploit fine-grained parallelism automatically, as is done with compilers that run on existing vector systems. The result should be to increase the proportion of applications that have higher sustained performance, thereby making the system much more cost effective for a wider range of scientific applications. ViVA will be implemented on the Blue Planet system through software that uses the Power 5 architectural features. It will be evaluated and available to the applications that benefit from it. If the evaluation and use of ViVA shows benefit, not only will it enhance application performance of Blue Planet beyond what is described above, but it is conceivable that further vector-like support will be possible in future generations of the IBM Power architecture. If the ViVA experiment is not as successful, Blue Planet will perform with no lower performance than that described above. Designs with Future Enhancements for Scientific ComputingThe third part of the effort is the longest term. Cooperative IBM-DOE design of future Power CPUs will be initiated. The design cycle of a complex chip like the Power 5 takes five to six years. Year one is typically the high-level design and feature selection, followed by two or more years of implementation and then two or more years of testing, prototyping, and assembly design. Building on the experiences with the Blue Planet and ViVA experiment, DOE Lab staff will cooperatively work with IBM to further enhance the Power 6/7 and other future system component designs. Teams of DOE computational and computer specialists will work with IBM hardware and software designers to further improve memory and interconnect bandwidth in the Power series. The goal will be to field a petaflop/s (peak) computer capable of 20�25% sustained rates on diverse application by 2009. The initial work will be done in a series of "lockdown" meetings with designers, where DOE�s scientific application requirements will be analyzed and understood. Design alternatives will be developed and evaluated for their potential to better meet the requirements. The result will be a series of more detailed meetings to review and resolve the design details. DOE applications and representative code kernels will be evaluated by instrumenting and understanding the behavior of the codes on existing hardware. Special performance profiling tools�some existing only in IBM labs�will be used to gain an improved understanding of the codes. Based on the results of these studies, models of the codes will be developed and run on software simulators for the proposed hardware design. The outcome should be to identify and hopefully resolve performance bottlenecks at the design stage of future-generation Power processors rather than after delivery, as a traditional evaluation does. Examples of further candidate improvements that will be evaluated and in some cases selected for implementation are:
In addition to the issues involved with design and development of future Power CPUs and improvements, significant software challenges existing in order to make Blue Planet operate well at the scale proposed. These challenges will have to be addressed in the same cooperative manner as outlined above for the long term in order to assure the implemented solutions are the most effective possible for the applications. One example of these issues is that the communication software is very sensitive to interference from interrupts and the asynchronous nature of dispatching of application threads by individual operating systems on each node, especially in a large-scale system. This interference is especially disruptive for global operations across large numbers of processors such as MPI collective communications. New solutions and programming models to help synchronize dispatch cycles and eliminate the overhead of collective operations will need to be developed. This may occur through hardware accelerators that attach directly to the switch in addition to software. Better synchronized individual OS activity across the nodes of the system will need to be made more robust and usable. New programming models (e.g., UPC) will need to take direct advantage of such adapter hardware functions, which provide improved memory access across nodes. The zero copy transport protocols will have to be made more efficient and robust to eliminate the memory bandwidth bottleneck during transport. Very low latency communication will need processor assist in terms of fast synchronization lock instructions and for barrier synchronization, which are being developed/considered. Likewise, scaling a shared file system to 2.5 PB and 2,048 nodes will require significant redesign and enhancement. The target of 1 GB/s of IO performance per IO server will need significant enhancements to zero copy transport, distributed locking design (for parallel access), and the disk allocation manager. Efficient metadata serving will be a significant challenge at such scale. It will be critical to ensure that robust fault tolerance from disk and node failures is built into the system, which requires major enhancements to the high availability features. The ability to efficiently back up a system with 2.5 TB of disk needs to be addressed. External access to the cluster file system data at high speed is another critical problem that needs to be addressed, particularly in the Grid applications. Overall, there is a wide range of areas that must be addressed initially with the Blue Planet system and then improved for the petaflop/s system. These issues cannot be done at small scale, so a system the size of Blue Planet is the only way to gain confidence and experience that the improvements will work on the petaflop/s system, regardless of which processor type is used. |
Page last modified: Tue, 18 May 2004 21:26:24 GMT Page URL: http://www.nersc.gov/news/reports/blueplanetmore.php Web contact: webmaster@nersc.gov Computing questions: consult@nersc.gov Privacy and Security Notice |