NERSC logo National Energy Research Scientific Computing Center
  A DOE Office of Science User Facility
  at Lawrence Berkeley National Laboratory
 

Blue Gene/L

(Excerpt from "Creating Science-Driven Computer Architecture: A New Path to Scientific Leadership")

(See also the IBM Blue Gene home page.)

IBM announced a significant new effort in December 1999 to develop a computer system capable of sustaining near petaflop/s on molecular modeling problems critical to life science research. This project, called Blue Gene, has evolved in the last three years to include a family of more general-purpose computer systems being developed in IBM research in collaboration with DOE laboratories and university groups.

The Blue Gene family (BG/X) of computers is a departure from existing IBM large-scale systems in several ways. First, it is aimed at extreme scalability, with the first system in the series called BG/L targeting 65,536 nodes at full scale and future systems envisioning even larger scale. Second, the BG/X series is based on extreme levels of system integration leveraging systems-on-a-chip (SoC) technology, which enables the collocation of CPUs, memory, and network interfaces to be integrated on a single die. SoC technology enables high performance and low power. It also reduces the number of discrete component types in the system, lowering manufacturing cost and improving overall reliability. Finally, the system uses a new software strategy that has minimal operating system functionality at each compute node, with most operating systems functions moved to a separate set of processors specially configured to provide OS services and I/O services. Each of these new directions contributes to put the BG/X family on the path to sustained petaflop/s capabilities for much lower cost than present products. Current indications are that the general approach represented by BG/X has the potential to achieve sustained petaflop/s performance on some applications by end of 2007 for about $200 M. Sustained price/performance for applications on BG/L is likely to be one-half to one-fourth that of other approaches.

The first system (i.e., BG/L) to be developed in this family has a planned peak performance of 180 teraflop/s and is scheduled to have first availability in late 2004. The five year roadmap for the Blue Gene project has two follow-on machines to BG/L under consideration. BG/P with a target peak performance range of 1000 Tflop/s (300 Tflop/s sustained) and provisional availability in late 2006/ early 2007 and BG/Q with a target peak performance range of 3000 Tflops (1000 Tflops/s sustained) and availability targeted for late 2007/early 2008. There is also a near term variant of the BG/L system planned called BG/D which incorporates denser memory packaging to permit configurations with more external memory per node. ANL has established a preliminary plan with IBM to co-develop and deploy a full-scale BG/L system at ANL in late 2004/early 2005 to the DOE Office of Science community, potentially followed by a BG/P system in 2006/2007 deployed at LBNL for the production user community followed by a BG/Q system deployed at ANL in late 2007/2008.

The Blue Gene project is leveraging experience gained with the DOE-supported joint project with IBM, Columbia University and BNL to develop and deploy the 10 Tflops/s QCDOC machine. Blue Gene differs from the QCDOC machine in several important ways. BG/L uses a "double" floating-point unit, capable of two fused multiply-adds per cycle compared to QCDOC�s single floating-point unit. The BG/L network is faster and more sophisticated as it supports point-to-point routing (via a 3D torus) where as the QCDOC network is strictly nearest neighbor. The BG/L also has dramatically improved internal memory bandwidth (22 GB/s vs. 8 GB/s) from the embedded DRAM to the CPU. And it has faster network connections and implements a more advanced communications stack. The QCDOC project and ANL have agreed to work together to directly leverage the relevant experience gain that that project.

While the original application focus of the Blue Gene project was to address protein folding problems, the current design point is truly general purpose with industry standard PowerPC instruction sets, standard product-level compilers, support for MPI-based message passing and remote memory access using put/get operations, high-bandwidth access to on-chip memory, and high interconnect bisection bandwidth (Table 2 above). All of these features make the machine architecture much more in line with the directions recently taken by commodity clusters. Indeed, we believe that the BG/X series of machines can be viewed as one potential evolutionary successor to such clusters, enabling in many cases well written application software developed for commodity clusters to be ported with few changes.

Each of BG/L�s 64K compute nodes have a relatively slow clock rate (700 MHz), which contributes both to lower cost and low power consumption. This low power consumption enables the systems to be air cooled and permits dense packaging, which together enable 1,024 computing nodes per rack. The current design point for BG/L requires about ~1 MW of power for 180 Tflop/s peak systems, ~300 tons of cooling, and less than 4,000 sq ft of floor space. These numbers are in many cases an order of magnitude better than competing proposals.

Since all inter-node networking functions are integrated onto the same chip as the processor and embedded DRAM, there is no need for a separate switch and switching fabric, which also contributes to improving reliability and lower cost. The BG/L node has two identical processors. In normal operation, one processor is fully dedicated to message passing, and the other is reserved for the computational work. Under software control, both can be used for processing, effectively doubling the amount of available computational power. For our estimates, we assume the standard mode of processing and only count the performance of one processor in computing peak performance numbers.

Blue Gene Architecture Overview

BG/L is a scalable system with the maximum size of 65,536 compute nodes; the system is configured as three-dimensional torus (64 � 32 � 32). Each node is implemented on a single CMOS chip (which contains two processors, caches, 4 MB of EDRAM, and multiple network interfaces) plus external memory. This chip is implemented with IBM�s system-on-a-chip technology and exploits a number of pre-existing core intellectual properties (IPs). The node chip is small compared to today�s commercial microprocessors, with an estimated 11.1 mm square die size when implemented in 0.13 micron technology. The maximum external memory supported is 2 GB per node; however the current design point for BG/L specifies 256 MB of DDR-SDRAM per node. There is the possibility of altering this specification, and that is one of the design alternatives we propose to evaluate during our assessment of the design for DOE SC applications.

The systems-level design puts two nodes per card, 16 cards per node board, and eight node boards per 512-node midplane (Figure 1). Two midplanes fit into a rack. Each processor is capable of four floating point operations per cycle (i.e., two multiply-adds per cycle). This yields 2.8 Tflop/s peak performance per rack. A complete BG/L system would be 64 racks.

Figure 1. Blue Gene system design.

In addition to each group of 64 compute nodes, there is a dual processor I/O node which performs external I/O and higher-level operating system functions. There is also a host system, separate from the core BG/L complex, which performs certain supervisory, control, monitoring, and maintenance functions. The I/O nodes are configured with extra memory and additional external I/O interfaces. Our current plan is to run a full Linux environment on these I/O nodes to fully exploit open-source operating systems software and related scalable systems software. The more specialized operating system on the compute nodes provides basic services with very low system overhead, and depends on the I/O nodes for external communications and advanced OS functions. Process management and scheduling, for example, would be cooperative tasks involving both the compute nodes and OS services nodes. Development of these OS services can precede the availability of the BG/L hardware by utilizing an off-the-shelf 1,000-node Linux cluster, with each node having a separate set of processes emulating the demands of compute nodes. ANL has previously proposed to DOE the need for a large-scale Linux cluster for scalable systems software develop and that system would be very suitable to support the development of these OS services as well.

BG/L nodes are interconnected via five networks: a 3D torus network for high-performance point-to-point message passing among the compute nodes, a global broadcast/combine network for collective operations, fast network for global barrier and interrupts, and two Gigabit Ethernets, one for machine control and diagnostics and the other for host and fileserver connections

Collaborative Development with IBM

We have developed with IBM a comprehensive strategy for collaborative development that addresses the critical needs of moving the Blue Gene project from a research project to a project that could provide high-level production support to critical applications. We outline that plan here.

Simulation software framework. A set architecturally accurate simulators have been developed for the BG/L project to enable porting and tuning of applications prior to the availability of hardware. We plan to deploy these simulators and make them available to our application partners and work with them to have BG/L ready codes and libraries well in advance of the systems. We also plan to work closely with IBM to extend these simulators to incorporate changes to the architecture over time, and specifically to track the evolution of the architecture for BG/P and BG/Q. The simulations run on Linux clusters, and we propose to deploy a series of large-scale Linux systems to support the use of these simulators by the community. These large-scale Linux systems will also enable the development of OS services and I/O frameworks for the BG/L system in advance of hardware.

Systems software development. Argonne has already been working closely with IBM on the development of two critical items of systems software for BG/L, the message passing system based on MPICH/ADI-3 and process management. We plan to put in place a much more comprehensive systems software development effort that will target the OS functions needed to support BG/L codes that today run on Linux clusters. To do this we will need to develop mechanisms to offload OS functions from the compute node to the OS services nodes. These developments will be pursued in conjunction with the broader community and will leverage the SciDAC systems software projects and existing open-source activities. The Argonne/Berkeley collaboration will also provide the community with open access to development testbeds already being deployed at ANL for this purpose.

Applications development. There are several major challenges for effective use of BG/X for scientific applications. Applications have to parallelize to a degree substantially beyond the current levels of control concurrency. Most current high-end applications do not scale much beyond 1,000 nodes. To effectively use BG/L, applications will need to scale to 64K nodes. Thus significant work is needed to improve applications scalability. Applications need to be optimized for the multi-level memory hierarchy of BG/L. BG/L has very high internal memory bandwidth, ~11GB/s between the on-chip RAM and the register set of each of the two CPUs (22 GB/s aggregate). Given the rather modest 2.8 Gflop/s peak performance of the CPU, this is almost 4 bytes per flop, which is the among the highest relative burst memory bandwidth of any existing system (Table 2 above). However that chip memory is relatively small at 4 MB. Off chip memory bandwidth is still relatively high at 5.5 GB/s, but to achieve high sustained node performance, applications will have to effectively manage this memory. Finally, applications will have to be modified to fit in the relatively small memory footprint of the node (256 MB) and to effectively exploit the distributed OS services model. One promising path to addressing the applications needs on BG/L is to build highly tuned numerical libraries such as PETSc (which is used by Jardin�s fusion code) that have been tuned to BG/L that address the memory size and memory hierarchy issues. This approach can shorten the time to get many applications up on BG/L and provide a faster path to BG/P and BG/Q. For many of the application projects we have selected for initial development, we believe both of these challenges can be met, but it will be require sustained cooperation between IBM, our laboratory based computer scientists, algorithm developers, university partners, and applications scientists with access to significant scalability testbeds, simulators, and early hardware platforms.

Performance analysis and tuning. As mentioned earlier, determining and enhancing the expected performance of real applications on BG/L and follow-on platforms will be a major activity of this project. We intend to leverage the SciDAC Performance Evaluation Research Center and other projects to develop predictive performance models of our selected applications and to use these models to influence design decisions for both software and future hardware platforms. By focusing on a multi-year development trajectory rather than a single platform, we can feed forward lessons and insights from previous generations into the next systems. Our strategy depends on being able to use BG/L as a testbed for the development of applications and insight for BG/P, and BG/L and BG/P for BG/Q. We expect there will be considerable effort required to both tune applications on the real hardware to achieve maximal performance and to debug the process and methods used in simulation and performance estimation. A critical element of the strategy is to move towards a future architecture design methodology that is more quantitatively informed by real results from real applications rather than rely on general rules of thumb, as historically been the case. This approach requires the sustained participation of computer scientists, applications scientists and computer designers over a period of many years.

Projections forward into BG/P and BG/Q. We believe that the proposed BG/X approach is one of the most viable for achieving cost-effective, sustained petaflop/s by the end of the decade. A critical factor in the approach is that we have a multi-year commitment to deploy a series of machines in the same family that permit us to leverage investments made in applications, systems software, analysis and simulation tools, and development methods. If our work with BG/L proves successful, we would anticipate deploying a BG/P class machine (peak petaflop/s with a goal of sustaining several hundred teraflop/s) in the production computing infrastructure of DOE Office of Science, while working towards the development of a machine capable of sustained petaflop/s.


LBNL Home
Page last modified: Tue, 18 May 2004 21:56:05 GMT
Page URL: http://www.nersc.gov/news/reports/bluegene.php
Web contact: webmaster@nersc.gov
Computing questions: consult@nersc.gov

Privacy and Security Notice
DOE Office of Science