table of contents advances in computational science research news the NERSC center

the nersc center

High Performance Systems for Large-Scale Science

Meeting the Challenge of Running Larger Jobs

As home to one of the largest supercomputers available for unclassified research — the IBM SP “Seaborg” with 6,080 computing processors — the NERSC Center has moved aggressively to devote a greater share of its processing time to jobs running on 512 or more processors, in accordance with DOE and Office of Management and Budget goals.

Among these jobs was a calculation of an entire year’s worth of simulated data from the Planck satellite, which ran on 6,000 processors in just two hours (go here). This was the first time data has ever been processed on this many of Seaborg’s processors at once.

Figures 3 and 4 show NERSC's progress over the course of the year in raising the proportion of large jobs. Achieving this goal required major efforts from NERSC staff in both systems and applications.

Figure 3. In 2004 NERSC continued to make improvements to Seaborg’s ability to schedule larger job sizes.

Figure 4. Toward the end of 2004, jobs with a processor count of 512 or more peaked around 78% of system utilization — a dramatic improvement from earlier in the year.


From the perspective of the Computational Systems Group, the main issue posed by large jobs is scheduling. “Given the nature of a system like Seaborg, this is a difficult task,” said group leader Jim Craw.

Just as nature abhors a vacuum, Seaborg is programmed to dislike idle nodes. Once processors are freed up, the system tries to fill them up with the next appropriate job in the queue. Left unchecked, this would result in lots of small jobs running, filling the nodes and rarely freeing up enough processors to run the larger jobs.

Craw’s group created a customized scheduling algorithm to run the LoadLeveler workload management system, giving priority to larger jobs, allowing them to work their way to the head of the queue. The system first calculates how many nodes the large job will need, then determines when the required number of nodes will be available. In the meantime, the system keeps all the nodes utilized by assigning smaller jobs that will be completed before all the needed nodes are available.

While it represents a challenge to the staff, the need for such prioritizing is a testimony to the success of NERSC as an HPC center of choice. Because there is consistently more demand for computing time than can be allocated, NERSC needs to maintain a very high utilization rate, meaning that as many nodes are in use for as many hours as possible.

“It really boils down to a balancing act between small jobs and large jobs,” Craw said. “We want to fill all the small holes as soon as we can, but we also need to have them available for large jobs. To achieve this, over the years we’ve had to do a lot of tuning of the LoadLeveler scheduling software.”

The LoadLeveler customization was just one part of a major upgrade to Seaborg’s system software. The migration from the AIX 5.1 operating system to version 5.2 was a significant undertaking for both NERSC and IBM. NERSC wanted the numerous improvements and fixes available only in AIX 5.2, including current and future enhancements to LoadLeveler, the GPFS file system, the Parallel Environment, and checkpoint/restart capability. However, NERSC was the first site to migrate to AIX 5.2 on a system with Power3 processors, and there were several difficulties that had to be overcome. But in true NERSC character, and with lots of help from IBM, the Systems staff solved all the problems and successfully deployed AIX 5.2.

“September 2004 was a very intensive month,” Craw recalled, “with many NERSC and IBM staff working around the clock to resolve problems encountered. In the end, NERSC’s staff and users provided excellent bug reports and feedback to IBM so they could improve the migration process and make it easier for other sites in the future.”

The Computational Systems Group has also been working to deploy checkpoint/restart functionality on Seaborg. This functionality, which brings all jobs to a stopping point and restarts them from that point when the system is restored, will primarily be used for system administration tasks, such as shortening drain time before system or node maintenance.

Over the past two years, NERSC staff have been working with IBM to resolve a significant number of technical and design issues before checkpoint/restart can be put into production use. Additionally, NERSC staff had to make changes to the job submission mechanism used on Seaborg and developed a custom program to enable all LoadLeveler jobs to be checkpointable before they are submitted to the batch system.

“These changes are transparent to the users, so they will not notice any difference in how the batch system accepts, queues, and runs their jobs,” said Jay Srinivasan, the project leader. “However, most or all jobs submitted to LoadLeveler will be checkpointable, which will allow us to use checkpoint/restart to migrate jobs from nodes and shorten queue drain times before system maintenance.” Once checkpoint/restart for LoadLeveler proves to be reasonably stable, it is expected to go into production use during the first half of 2005.

Also being deployed early in 2005 is a new Fibre Channel disk array that provides an additional 12 TB of storage for users’ home directories on Seaborg. Additionally, user home areas were reduced from 13 independent file systems to only three, recouping about 2 TB of unused free space from the separate areas. Together, these two changes allowed user home storage quotas to be increased by 50%. The disks freed up by the new array will be added to existing GPFS scratch and other file systems.

On the applications side, “Running larger jobs is a matter of removing bottlenecks,” said David Skinner of the User Services Group. “There is always some barrier to running at a higher scale.”

On Seaborg, choosing the right parallel I/O strategy for a large job can make a big difference in that job’s performance. “If you have a code that runs on 16 tasks, there are a lot of ways to do I/O that will perform roughly the same,” Skinner said. “But when you scale up to 4,000 tasks, there is a lot of divergence between the different I/O strategies.”

There are two frequently encountered bottlenecks to scaling that come from the computational approach itself. NERSC consultants address removing these bottlenecks by rethinking the computational strategy and rewriting portions of the code.

The first bottleneck is synchronization, in which all of the calculations in a code are programmed to meet up at the same time. As the code scales to more tasks, this becomes more difficult. Skinner likens it to trying to arrange a lunch with various numbers of people. The larger the desired group, the harder it is to get everyone together at the same time and place.

“People think in a synchronous way, about closure,” Skinner said. “But in a code, you often don’t need synchronization. If you remove this constraint, the problems can run unimpeded as long as necessary.”

The other bottleneck is in load balancing. By dividing a large scientific problem into smaller segments — and the more uniform the segments, the better — the job can often scale better, Skinner said. “Domain decomposition is important,” he added.

Storage Upgrades Reduce Costs

Since relocating to Berkeley Lab almost 10 years ago, one of NERSC’s goals has been to consistently introduce new technologies without creating bottlenecks to users’ ongoing work. In a program of planned upgrades to tapes with greater density, the Mass Storage Group (Figure 5) has not only achieved this goal, but will save $3.7 million over the course of five years. Over that same time, from 2003 to 2007, the amount of data stored will grow from just over 1 petabyte to 11.3 petabytes.

       
Nancy Meyer Matthew Andrews Shreyas Cholia Damian Hazen
       
Wayne Hurlbert Nancy Johnston    
Figure 5. By taking advantage of new technologies, the Mass Storage Group saves money while meeting the ballooning need for data storage.

 

As computational science becomes increasingly data intensive, the need for storage has ballooned. When the NERSC Center moved into the Oakland Scientific Facility in 2000, the archival data were stored on 8,000 tape cartridges. Four years later, the silos housed 35,000 cartridges. While this increase is dramatic, the real growth indicator is that the amount of data stored on each cartridge has also risen by a factor of 10.

This planned switch to higher-density cartridges is at the heart of the Mass Storage Group’s investment strategy. The strategy of saving money by adopting the latest technology as soon as it is available flies in the face of conventional wisdom, which holds that by waiting for the technology to become pervasive, the price will drop. Because the storage group deploys newer technology incrementally rather than all at once, the center benefits from the technological improvements but does not get locked into one format.

“This allows us to grow to the next dense technology or build the current dense technology gradually by 10 to 15 drives a year, rather than replacing hundreds of drives all at once,” said Mass Storage Group Lead Nancy Meyer.

“Our rule of thumb is to maintain the archive with 10 percent of the data going to the lower density cartridges and repack 90 percent onto the higher density cartridges. This frees up the lower density cartridges and we can reuse them without purchasing more media. And when data is growing by petabytes per year, this adds up to significant savings.”

NERSC’s investment strategy is already paying off, saving an estimated $135,000 in 2004. By 2007, the annual saving is estimated at $2.3 million. In all, the program is expected to save $3,739,319 over five years.

At the start of the plan, data were stored on a mix of cartridges, each capable of holding either 20 or 60 gigabytes. To keep up with data growth, the group could either stick with the current cartridges — and buy a lot of them — or adopt the newly available 200 GB cartridges. While this switch would also require the center to buy new tape drives to read the denser tapes, the long-term costs showed that higher density meant lower overall costs.

For example, in 2004, data grew by 832,000 terabytes. Buying enough 60 GB cartridges would have cost $497,952, while buying enough higher-density tapes and tape drives cost $135,000 less than the lower-density option. In 2005, when the group expects to upgrade to 500 GB cartridges, the estimated savings will be $308,000, even after accounting for new tape drives.

“It’s important to note that these savings only reflect the cost of the tapes,” said Nancy Meyer, leader of the Mass Storage Group. “It doesn’t include the money saved by not adding more silos or moving other systems to create more adjacent floor space.”

Other computing centers, according to Meyer, take the more traditional approach. Not only is this more expensive in the long run, but it also results in wholesale rather than incremental upgrades, and these can be more disruptive to users.

NERSC users, Meyer said, don’t see much difference when storage equipment is upgraded. Smaller files, amounting to about 10 percent of the total data, are still stored on 20 GB cartridges. Finding a file on one of these tapes takes about 10 seconds. Locating a file on the denser tape takes about a minute, possibly 90 seconds. “While it takes a little longer to locate files, we think it’s a fair tradeoff for almost $4 million in savings,” Meyer said.

PDSF Boosts Processors, Performance

Researchers using the PDSF cluster managed by NERSC benefited from a system upgrade in 2004, giving them access to more processing power and a higher-speed network connection for accessing archived data. The new hardware, shipped in September, included 48 nodes of dual Xeon processors and 10 nodes of dual Opteron processors. These were added to the existing 550-processor cluster.

The system’s connection to NERSC’s internal network was upgraded from a 2 Gigabit switch to a 10 Gigabit switch. This will improve the transfer rate of data between PDSF and NERSC’s HPSS archival storage.

Finally, some of the PDSF disk drives were also upgraded, adding 12 terabytes of disk capacity for a total of 135 TB.

DaVinci Expands Visualization and Analysis Capabilities

In response to requests from the NERSC community, the Center has expanded its interactive analysis capability by acquiring a new platform called DaVinci. DaVinci will be NERSC’s primary resource for general purpose, interactive data-intensive visualization and analysis work. It provides a large amount of high performance scratch storage, an expandable architecture that supports shared- or distributed-memory parallelization, a 64-bit operating system, and a large amount of memory.

At the end of 2004, DaVinci, an SGI Altix system, consisted of eight 1.4 GHz Itanium2 processors, 48 GB of RAM, 3 TB of attached storage, and a combination of bonded Gigabit Ethernet and 10 Gigabit network interface connectors. Significant expansion of the system is expected during 2005, subject to the availability of funding.

DaVinci is well suited for interactive visualization and analysis tasks, which require a different system balance than NERSC’s primary computational resources — a balance biased towards larger memory sizes and higher I/O throughput per processor. The Altix’s single-system-image architecture makes all system resources, including I/O bandwidth and memory, available to any process in the system. When the analysis task is serial in implementation — as is the case with many commercial, legacy, and “quick and dirty” codes — more memory for the task translates directly into increased processing capability. The combination of a large, fast scratch file system and high internal bandwidth makes DaVinci perfect for data-intensive applications.

DaVinci is expected to go into full production in May 2005, when it will be used primarily for interactive, data-intensive visualization and analysis applications. During off-peak hours, it will be available for use by batch applications to help fulfill allocation requests. The interactive vs. batch use policy will be outlined when full production is announced.

DaVinci will be outfitted with the same visualization applications present on the older Escher visualization server, which will be decommissioned shortly after DaVinci goes into full production mode. The visualization tools include an array of libraries and applications for plotting and charting, imaging, command-line and GUI-based interactive development environments, and point-and-click applications.