Los Alamos National Laboratory
Lab Home  |  Phone
 
 

Science 1663

Roadrunner—Computing in the Fast Lane

A new hybrid supercomputer that will use a video game chip to propel performance to petaflop/s speeds.

Photo of the programming team that proved Roadrunner could achieve its performance goals.
Left to right: Sriram Swaminarayan, Ben Bergen, John Turner, Mike Lang, Tim Kelley, and Jamaludin Mohd-Yusof, represent the programming teams that proved Roadrunner could achieve its performance goals.
















Abstract:
The drive for more scientific computing power is running into a brick wall. Old speed-up tricks aren't working, and a new paradigm is needed to sustain the rapid growth in computing power associated with Moore's law. Scientists at Los Alamos are blazing that trail with Roadrunner, a new hybrid supercomputer that will use a video game chip to propel performance to petaflop/s speeds.

For the last two decades, the number of transistors on a computer chip has doubled every 18 months, a phenomenon known as Moore's law. The net thousand-fold increase has caused a parallel increase in computer performance, fueling a technological revolution in which ever-more-sophisticated electronic devices are transforming both economies and cultures.

The computer industry hopes to keep up with Moore's law, given the market incentives. The private sector wants greater access to information and multimedia entertainment—video learning tools, movies, video games.

Scientists want faster, more-powerful high-performance supercomputers to simulate complex physical, biological, and socioeconomic systems with greater realism and predictive power. The world's fastest supercomputer is Blue Gene/L at Livermore National Laboratory. It performs at nearly 500 trillion arithmetic operations a second (500 teraflop/s).

Los Alamos scientists are shooting for double that performance, or one petaflop/s, which is 1,000 trillion calculations per second. They plan to do it with the Roadrunner supercomputer scheduled for installation at Los Alamos starting this summer, with full operation targeted for early 2009.

Hitting that target is no small task; the building blocks of supercomputers—chips (microprocessors)—have reached their speed limits, endangering the future of Moore's law for scientific computing.

Screeching to a Halt?

A microprocessor depends on two things for its performance. Its internal clock speed determines how fast it does the number crunching—the logical and arithmetic operations. The number of transistors that make up its logic circuits determines how “smart” it is, that is, how many different types of operations it can perform simultaneously.

Modern supercomputers have thousands of identical computer nodes, each containing a microprocessor and a separate memory. The nodes are connected to form a cluster and work simultaneously (in parallel) on a single problem.

Graph depicting Moore's law.
Moore’s law, the doubling of transistors on a chip every 18 months (top curve), continues unabated. Three other metrics impacting computer performance (bottom three curves) have leveled off since 2002: the maximum electric power a microprocessor can use, the clock speed, and performance (operations) per clock tick.

Increases in supercomputer performance have come in part from increases in the number of nodes in the cluster but, more important, from increases in microprocessor clock speed. As transistors were reduced in size and placed closer together, it was possible to turn them on and off faster and thus increase the clock speed. Today's transistors in high-end microprocessors have shrunk to 65 nanometers (billionths of a meter) and are running at gigahertz clock speeds (billions of clock ticks per second).

Transistors are still shrinking, and the number per microprocessor is still growing, but performance measured in arithmetic operations per second has flattened out since 2002.

"The biggest reason for the leveling off is the heat dissipation problem,” says Ken Koch, one of the leaders of the Roadrunner project. “With transistors at 65-nanometer sizes, the heating rate would increase 8 times whenever the clock speed was doubled, outstripping our ability to carry the heat away by standard air cooling.” It would be like running a car at high speed with no water in the radiator.

Another huge obstacle to increased performance is the memory barrier. In the not-too-distant past, the time to fetch data from the node memory and load it into the processing units (called the “compute core”) of a microprocessor was comparable to the time it would take that core to do the number crunching. Now the number crunching is 50 times faster than the time to fetch and load data. The time spent in data retrieval and communications can no longer be ignored.

Clearly the old solution for increasing supercomputer performance—miniaturizing circuits and using faster clocks—is breaking down.

Video Games Open a New Path

“We replace our high-performance supercomputers every 4 or 5 years,” says Andy White, longtime leader of supercomputer development at Los Alamos. “They become outdated in terms of speed, and the maintenance costs and failure rates get too high.”

With this turnover rate, you might think the Lab would be a major force in computer development. “That used to be the case,” says White, “but today we're small potatoes. Even a decade ago, when the Department of Energy purchased Blue Mountain for Los Alamos and Blue Pacific for Livermore at a total cost of roughly $200 million, a German bank was spending much more, signing an $8-billion agreement for information technology services with IBM. And the personal computer market has exploded since then, bringing prices for commodity microprocessors way down. Since the late 1990s, because of cost, we've had little choice but to build supercomputers from off-the-shelf rather than custom-made components.”

That's why in 2002, when Los Alamos scientists were planning for their next-generation supercomputer, they looked at the commodity market for a way to make an end run around the speed and memory barriers looming in the future.

What they found was a joint project by Sony Computer Entertainment, Toshiba, and IBM to develop a specialized microprocessor that could revolutionize computer games and consumer electronics, as well as scientific computing.

Photo of Alex and Andrew Turner playing a home video game.
Alex and Andrew Turner playing a game of “MX vs. ATV Untamed” on the Cell-powered Sony PlayStation 3. A similar chip will power Roadrunner.

These corporations invested $400 million dollars over 4 years to produce the Cell Broadband Engine (the “Cell”), a powerhouse microprocessor carrying 240 million transistors. Its first application would be in the Sony PlayStation 3, a top-end video-game console.

Today's video games are like interactive movies, complete with elaborate, computer-generated backgrounds and interacting characters that are more and more realistic.

Illustration of a memory controller PC card.
The Cell microprocessor contains a Power PC compute core that oversees all the system operations and a set of eight simple processing elements, known as SPEs, that are optimized for both image processing and arithmetic operations at the heart of numerical simulations. Each is specialized to work on multiple data items at a time (a process called vector processing, or SIMD), which is very efficient for repetitive mathematical operations on well-defined groups of data.

The Cell was designed with enough computer power to enhance interactivity, allowing video games to be even less scripted. It has eight specialized processing elements (SPEs) that get around the speed barrier by working together. They can generate dynamic image sequences in record time, sequences that reflect the game player's intention and even have the correct physics.

The Cell gets around the memory barrier as well. It does so by having a small, fast local (on-chip) memory plus a memory engine for each SPE and an ultra high speed bus to move data within the Cell. The local memories store exactly the data and instructions needed to perform the next computations while all eight memory engines act like runners, simultaneously retrieving from off-chip memory the data that will be needed for computations further down the line.

Optimized for maximum computation per watt of electricity, the Cell looked like a good bet for accelerating supercomputing performance. Los Alamos knew, however, that the Cell would need some modifications for petaflop/s scientific computing. IBM was willing to work on the enhancements.

A Hybrid That Raises Skepticism

Detailed illustration showing Roadrunner compute nodes.
Roadrunner is a cluster of approximately 3,250 compute nodes interconnected by an off-the-shelf parallel-computing network. Each compute node consists of two AMD Opteron dual-core microprocessors, with each of the Opteron cores internally attached to one of four enhanced Cell microprocessors. This enhanced Cell does double-precision arithmetic faster and can access more memory than can the original Cell in a PlayStation 3. The entire machine will have almost 13,000 Cells and half as many dual-core Opterons.

Named after the fleet-of-foot New Mexico state bird, the Roadrunner supercomputer is a hybrid, containing not one type of microprocessor but two.

Its main structure is a standard cluster of microprocessors (in this case AMD Opteron dual-core microprocessors). Nothing new here except that each chip has two compute cores instead of one. The hybrid element enters the picture when each Opteron core is internally attached to another type of chip, the enhanced Cell (the PowerXCell 8i), which has been designed specially for Roadrunner. The enhanced Cell can act like a turbocharger, potentially boosting the performance up to 25 times over that of an Opteron compute core alone.

The rub is that achieving a good speedup (from 4 to 10 times) is not automatic. It comes about only if the programmers can get all the Cell and Opteron microprocessors and their memories working together efficiently.

“The typical challenge in programming a code is to solve the mathematical equations in the fewest arithmetic (data processing) operations. Here the challenge was different—to minimize the flow of data while simultaneously taking full advantage of the computational power of the Cell,” explains Sriram Swaminarayan, a Los Alamos computational physicist. “At first most of us were skeptical it could be done.”

VPIC simulation on Opteron-Cell nodes.
Recent VPIC simulations on Opteron-Cell nodes reveal that magnetic reconnection (energy transfer from magnetic fields to plasma particles) involves “kinking” the current layer and forming rope-like structures.

Los Alamos had just over 12 months after signing the original contract with IBM, until October 2007, to decide whether it would purchase the Cells and proceed with its investment in this new hybrid computer architecture.

“Before making that decision, we needed to see if the Cell's stringent programming demands could be dealt with successfully on the types of codes that are important to nuclear weapons stockpile stewardship and other national security missions,” comments John Turner, leader of the Roadrunner Algorithms and Applications team.



The Programming Experience

“The Cell is designed to move data in, compute on it, and move it out faster than an ordinary microprocessor but only if the programmer writes code in a specific way,” explains Koch. For the Cell, the programmer must know exactly what's needed to do one computation and then specify that the necessary instructions and data for that one computation are fetched from the Cell's off-chip memory in a single step. They are then stored in the on-chip memories of each of the Cell's eight SPEs. IBM's Peter Hoftstee, the Cell's chief architect, describes this process as “a shopping list approach,” likening off-chip memory to Home Depot. You save time if you get all the supplies in one trip, rather than making multiple trips for each piece just when you need it.

Simulation of exploding star using Milagro supernova calculations.
Simluiation of exploding star using Milagro supernova calculations.
Simluiation of exploding star using Milagro supernova calculations.
Simluiation of exploding star using Milagro supernova calculations.
If chosen to run on Roadrunner, supernovae calculations using Milagro will be the first to determine the real influence of radiation flow on the light signals from these exploding stars.

The small size of the on-chip memories is an additional challenge. The programmer must divide the computation (the data and instructions) into chunks appropriate for the on-chip memories, then feed the Cell many small chunks in an assembly line fashion; otherwise, the Cell will not cause a speedup.

Turner organized teams of physicists and computer scientists to restructure codes for a spectrum of important application areas, re-implementing sections as necessary for the new architecture. The teams tested the rewritten codes first on a single Opteron compute core plus one Cell microprocessor and then in parallel on up to 24 such Opteron-Cell pairs.

The major application areas addressed were radiation transport (how radiation deposits energy in and moves through matter), neutron transport (how neutrons move through matter), molecular dynamics (how matter responds at the molecular level to shock waves and other extreme conditions), fluid turbulence, and the behavior of plasmas (ionized gases) in relation to fusion experiments at the National Ignition Facility at Livermore National Laboratory. The corresponding codes represented a range of methods for solving equations on a computer.

In the end, each code achieved a substantial speedup when run on a Cell-accelerated Opteron compute node in comparison with execution on a single Opteron compute core, without the Cell. The VPIC code, which simulates plasmas in magnetic fields, is a prime example. It ran 6 times faster on the Opteron-Cell node than on the Opteron alone. That increase will allow researchers to tackle some scientific grand-challenge problems.

Successfully accelerating the Monte Carlo code called Milagro took many months, several false starts, and modification of 10 to 30 percent of the code. Monte Carlo codes, which simulate radiation transport, are very expensive computationally. As the October decision time drew near, Milagro was also executing 6 times faster with the Cell than without, a crucial achievement for the acceptance of Roadrunner.

Bringing It All Together

But testing the codes on single nodes or small groups of them could provide only part of the performance picture. “The performance of systems such as Roadrunner comes from a complex interplay of system architecture and the workload characteristics of the applications,” says Adolfy Hoisie, who leads system and applications performance activities for the Roadrunner project.

Simluation of Rayliegh-Taylor turbulence.
High-resolution simulations (30 billion cells) of Rayleigh-Taylor turbulence reveal details of the mixing layer between two fluids.

To project how the codes would perform on the full system, with its average of 3,250 compute nodes containing about 13,000 Cells, computer scientists in the Performance and Architecture Lab (PAL) brought out all their novel and proven performance-modeling methodology.

PAL's work was instrumental not only in accurately indicating the potential performance advantages of Roadrunner compared with other architectures but also in quantifying and guiding various system design decisions that ensured top performance on the codes of greatest interest to the Lab.

By the time October 2007 rolled around, Los Alamos was confident that the entire range of science problems, from radiation transport to molecular dynamics, would run on Roadrunner at accelerated speeds, from 4 to 9 times faster than on the Opteron cluster alone.

Los Alamos scientists are now confident that Roadrunner will become the world's fastest supercomputer. It will be a tremendous asset to the computer simulations performed at the Laboratory for the nuclear weapons program as well as for scientific grand challenges. Important codes are expected to run at 200 to 500 teraflop/s. Roadrunner will also be the first computer to run the universally recognized code used to test supercomputer performance—LINPACK—at over 1 petaflop/s.

Hybrid Computing—The Wave of the Future

After the late-summer delivery and a check-out period, Roadrunner will be opened to unclassified science applications for 4 to 6 months. A call for proposals within Los Alamos has already been issued. Turner says, “We expect to see proposals in cosmology, antibiotic drug design, HIV vaccine development, astrophysics, ocean or climate modeling, turbulence, and we hope many others.” Afterward, Roadrunner will be moved to the classified computing network and used to improve specific physics models for the nuclear weapons program and for validating the accuracy of answers from earlier, less-detailed nuclear weapons simulations.

Photo of cooling towers at Metropolis Center.
Cooling towers at the Los Alamos Metropolis Center for Modeling and Simulation dissipate the heat generated by the power-hungry supercomputers.

By 2010 the Lab's scientists plan to use Roadrunner to help definitively quantify uncertainties in simulations of nuclear weapons performance as well as to reduce those uncertainties in key areas. This is an important milestone in maintaining confidence in the nation's nuclear weapons stockpile without actual nuclear testing.

In the meantime, Roadrunner is seen as the first in a new wave of hybrid supercomputers that will be used for scientific grand challenges around the world. Turner sums up the Los Alamos outlook this way, “We broke new ground in learning to program Roadrunner, but the people who did it found it fun. Our task now is to transfer the knowledge we gained so that new users of hybrid supercomputers won't have to go down the blind alleys we explored.”

Key words - Supercomputer, Moore’s law, Cell Broadband Engine, Sony Playstation 3, hybrid architecture, Opteron-Cell node, programming, Monte Carlo codes, VPIC, Milagro, Performance and Architecture Lab, petaflop/s

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA

Inside | © Copyright 2007-8 Los Alamos National Security, LLC All rights reserved | Disclaimer/Privacy | Web Contact