NERSC logo National Energy Research Scientific Computing Center
  A DOE Office of Science User Facility
  at Lawrence Berkeley National Laboratory
 

Blue Planet: Extending IBM Power Technology and Virtual Vector Processing

(Excerpt from "Creating Science-Driven Computer Architecture: A New Path to Scientific Leadership")

Currently the most dominant architecture for high end computing is the IBM Power series, which by itself accounts for 55% of the Top20 computers in the world. IBM has a large and diverse installation base for the Power line, giving it a tremendous advantage in funding research and development of new functionality. However, this diversity contributes to the problem that the Power architecture is not specifically tuned to the needs of the scientific market. A key component of this proposal is to enhance the Power architecture for scientific computing while in parallel implementing a system by the end of 2005 that has at least twice the sustained performance of the Earth Simulator at roughly half the hardware cost! The details of this approach are:

  1. Blue Planet System: Work with IBM to enhance IBM�s current Power 5 plan in order to deploy a system with approximately 150 Tflop/s peak performance that will be able to sustain 40 to 50 Tflop/s sustained performance on at least several real scientific codes. Each of the 2,048 nodes will consist of eight "single core" CPUs that provide double the memory bandwidth of the standard Power 5 CPUs and will have their own dedicated L1, L2, and L3 caches. The CPUs will have a peak performance of roughly 8�10 Gflop/s.

    The Blue Planet system will have 16,384 CPUs, the maximum main memory bandwidth possible, 8,192 switch links, and 2.5 petabytes (PB) of shared, parallel storage in FY 2006. This system will have more memory bandwidth, more interconnections, and lower interconnect latency than IBM had previously planned.

  2. Work with IBM to develop a new capability of Virtual Vector Architecture (ViVA) which harnesses the eight individual Power 5 CPUs in a node into a single 60�80 Gflop/s vector unit. This is the equivalent of what the Cray X1 does using four individual single-streaming processors (SSPs) within one multi-streaming processor (MSP). ViVA will be implemented on the Blue Planet System and has the potential to further improve the performance of codes.

  3. Building on experiences with Blue Planet and ViVA, cooperatively work with IBM to further enhance the Power 6/7 and other future processor designs. Teams of DOE computational and computer specialists will work with IBM processor designers toward the goal of further improving memory and interconnect bandwidth in the Power series. The goal will be to field a petaflop/s (peak) computer capable of 20�25% sustained rates on diverse application by 2009.

The details of the three-pronged parallel approach are described in the following three subsections.

Blue Planet System Architecture

Blue Planet will take the IBM technology due for implementation in the second half of CY 2005 and expand its capability in multiple ways as an immediate and highly reliable enhancement to the scientific computational power of the DOE science community. Specifically, the system will be built on Power 5+ CPUs, which will run at approximately 2+ GHz, and the Federation switch. At that rate, each CPU is theoretically capable of 8�10 Gflop/s. However, using new, previously unplanned functionality as well as special packaging will allow the system to achieve a much higher sustained percent of peak on true scientific codes from multiple disciplines. The new features are:

  • A new packaging of eight "single core" modules per node so that each CPU has its own dedicated L1, L2, and L3 caches (Figure 2). This configuration provides 60�80 Gflop/s nodes by using the single core chips. Each node will have twice as many memory buses�one GX bus per CPU�as IBM�s standard offering for 8-way nodes, which will enhance both memory and interconnect performance. Unlike the currently planned Power 4 based eight-way dual-core CPU nodes, these CPUs will run at the maximum clock speed achievable. The nodes will have twice the memory bandwidth to main memory and a three-tier cache system.

  • Each Federation network will be expanded by a factor of 4 from 1,024 links (512 nodes) to 4,096 links (2,048 nodes). This requires adding a third stage to the switch and improving the entire software to scale to at least 2,048 nodes. The system will have two IBM Federation networks, so it will have a total of 8,192 switch links, yielding considerable improvement in network bandwidth.

  • At the same time the Federation switch scales by a factor of 4, IBM will combine improvements in the hardware and software to decrease MPI latency. The scaling and latency improvements of the switch define Federation+, a midlife improvement to IBM�s switch technology. IBM has made midlife improvements to processors but never before to switch technology.

  • Operating system, compiler, and library technology will take full advantage of the increased scale and performance of the Blue Planet system.

Figure 2. Blue Planet architecture.

The resulting system will be able to sustain 40 to 50 Tflop/s on applications from several disciplines. The Blue Planet system will have 16,384 CPUs, each with 8�10 Gflop/s; 2,048 eight-processor SMP nodes; 8,192 switch links; and 2.5 PB of shared parallel disk storage. The amount of memory on the system will be configured to assure maximum main memory bandwidth. Currently memory technology trends indicate this has to be 256 TB of memory to attain this maximum bandwidth configuration�a capacity that is very expensive and not required by the applications (many of the applications scale in memory use by N2 while they scale in computation by N3). If memory bandwidth can be maintained at 64 TB or 128 TB of memory, the system will be delivered with the most cost-effective amount.

Blue Planet will have more memory bandwidth, more interconnections, and lower interconnect latency than was previously planned by IBM. It will be delivered in FY 2006, possibly in two phases. The need for phasing may come from component availability, since all aspects of the system will be new and production lines will be ramping up for the expected demand. The first phase will be delivered in the second half of CY 05, with the second phase following within nine months.

Blue Planet requires approximately 6 MW of power for the computer and peripherals and requires 1,700 to 2,000 tons of cooling. The entire system will fit within 12,000 square feet of computer space and can be housed in the existing LBNL Oakland Scientific Facility without constructing additional space.

The Virtual Vector Architecture

The basic intent of the Virtual Vector Architecture (ViVA) facility is to allow customers/ applications to run high performance parallel/vector style code on a traditional high data bandwidth SMP. What is described below is currently an unsupported function within IBM, involving compilers, operating systems, Hypervisors, firmware, processor/systems, and productization. On the positive side, the Power 5 processor/system design does have basic functionality to support ViVA.

ViVA would further enhance the Blue Planet system and take IBM in a new direction. Power 5 chips have the ability to synchronize the CPUs using a hardware communication link for barrier synchronization. This hardware feature is currently not planned for exploitation because there is not an identified requirement within the existing markets. However, the synchronization feature can be used to harness individual CPUs into a "virtual vector" unit. This is the same concept implemented in the Cray X1 MSP CPU, which has four separate 3.2 Gflop/s SSPs that synchronize for vector processing.

The basic intent of ViVA is to allow customers/applications to run high performance parallel/vector style code on a traditional high data bandwidth SMP. Initially, ViVA�s goal is not to improve memory bandwidth per se. It does greatly enhance the ability of compilers (and programmers) to exploit fine-grained parallelism automatically, as is done with compilers that run on existing vector systems. The result should be to increase the proportion of applications that have higher sustained performance, thereby making the system much more cost effective for a wider range of scientific applications.

ViVA will be implemented on the Blue Planet system through software that uses the Power 5 architectural features. It will be evaluated and available to the applications that benefit from it. If the evaluation and use of ViVA shows benefit, not only will it enhance application performance of Blue Planet beyond what is described above, but it is conceivable that further vector-like support will be possible in future generations of the IBM Power architecture. If the ViVA experiment is not as successful, Blue Planet will perform with no lower performance than that described above.

Designs with Future Enhancements for Scientific Computing

The third part of the effort is the longest term. Cooperative IBM-DOE design of future Power CPUs will be initiated. The design cycle of a complex chip like the Power 5 takes five to six years. Year one is typically the high-level design and feature selection, followed by two or more years of implementation and then two or more years of testing, prototyping, and assembly design.

Building on the experiences with the Blue Planet and ViVA experiment, DOE Lab staff will cooperatively work with IBM to further enhance the Power 6/7 and other future system component designs. Teams of DOE computational and computer specialists will work with IBM hardware and software designers to further improve memory and interconnect bandwidth in the Power series. The goal will be to field a petaflop/s (peak) computer capable of 20�25% sustained rates on diverse application by 2009.

The initial work will be done in a series of "lockdown" meetings with designers, where DOE�s scientific application requirements will be analyzed and understood. Design alternatives will be developed and evaluated for their potential to better meet the requirements. The result will be a series of more detailed meetings to review and resolve the design details.

DOE applications and representative code kernels will be evaluated by instrumenting and understanding the behavior of the codes on existing hardware. Special performance profiling tools�some existing only in IBM labs�will be used to gain an improved understanding of the codes. Based on the results of these studies, models of the codes will be developed and run on software simulators for the proposed hardware design. The outcome should be to identify and hopefully resolve performance bottlenecks at the design stage of future-generation Power processors rather than after delivery, as a traditional evaluation does.

Examples of further candidate improvements that will be evaluated and in some cases selected for implementation are:

  • VMX2 external connections
  • I/O decentralization
  • FFT breakthroughs�libraries and possibly hardware acceleration engines
  • Ability for an application to use 100% of bisection bandwidth
  • A striped-down version of MPI (MPI lite) that will have some restrictions for major improvements in performance
  • Microkernel OS running on the compute nodes, as exists on the T3E
  • Improved daemon control that synchronizes their running to have far less impact on applications
  • Hardware collectives (e.g., reduce all) supported with improved hardware and software
  • Synchronized time-outs of tasks to have far less impact on applications
  • Unified Parallel C
  • Advanced cooling
  • Non-segmented addressing
  • Improved OS for scientific programming (AIX or Linux)

In addition to the issues involved with design and development of future Power CPUs and improvements, significant software challenges existing in order to make Blue Planet operate well at the scale proposed. These challenges will have to be addressed in the same cooperative manner as outlined above for the long term in order to assure the implemented solutions are the most effective possible for the applications.

One example of these issues is that the communication software is very sensitive to interference from interrupts and the asynchronous nature of dispatching of application threads by individual operating systems on each node, especially in a large-scale system. This interference is especially disruptive for global operations across large numbers of processors such as MPI collective communications. New solutions and programming models to help synchronize dispatch cycles and eliminate the overhead of collective operations will need to be developed. This may occur through hardware accelerators that attach directly to the switch in addition to software. Better synchronized individual OS activity across the nodes of the system will need to be made more robust and usable. New programming models (e.g., UPC) will need to take direct advantage of such adapter hardware functions, which provide improved memory access across nodes. The zero copy transport protocols will have to be made more efficient and robust to eliminate the memory bandwidth bottleneck during transport. Very low latency communication will need processor assist in terms of fast synchronization lock instructions and for barrier synchronization, which are being developed/considered.

Likewise, scaling a shared file system to 2.5 PB and 2,048 nodes will require significant redesign and enhancement. The target of 1 GB/s of IO performance per IO server will need significant enhancements to zero copy transport, distributed locking design (for parallel access), and the disk allocation manager. Efficient metadata serving will be a significant challenge at such scale. It will be critical to ensure that robust fault tolerance from disk and node failures is built into the system, which requires major enhancements to the high availability features. The ability to efficiently back up a system with 2.5 TB of disk needs to be addressed. External access to the cluster file system data at high speed is another critical problem that needs to be addressed, particularly in the Grid applications.

Overall, there is a wide range of areas that must be addressed initially with the Blue Planet system and then improved for the petaflop/s system. These issues cannot be done at small scale, so a system the size of Blue Planet is the only way to gain confidence and experience that the improvements will work on the petaflop/s system, regardless of which processor type is used.


LBNL Home
Page last modified: Tue, 18 May 2004 21:26:24 GMT
Page URL: http://www.nersc.gov/news/reports/blueplanetmore.php
Web contact: webmaster@nersc.gov
Computing questions: consult@nersc.gov

Privacy and Security Notice
DOE Office of Science