Log In
  
 
Home My Page SciDAC Projects Collaborations
 

 

Purpose: To identify best practices related to power management at HPC centers.

Audience: HPC center managers and key staff responsible for HPC facilities and system software.

Power management has been identified as a key issue for future systems. This workshop will address the current practices and issues related to controlling and reducing power required by facilities and systems. An important question is whether the power challenges can be met by evolving current practices, facilities and systems, or if major new efforts must be undertaken now to prepare for the systems expected later in the decade.

This workshop is intended to facilitate collaborative progress on questions such as:

  • Planning and monitoring the various power aspects of HPC facilities.
  • Metrics we (should) collect to improve our understanding.
  • Power-aware RAS activities.
  • Feasibility of power-down, or "sleep" of some system components.
  • System software features needed to enable power conservation.
  • Hardware features to expose.
  • Improvements in power distribution and cooling configurations.
  • Power-aware system-wide scheduling techniques and incentives.

This workshop will not address topics such as:

  • Research in "green" hardware systems.
  • Chip design.
  • Power-aware application libraries (e.g., for math or visualization).
  • Research in power-aware compilers.

This event, the fourth in a series of best-practices workshops, brings together HPCC staff and managers to address issues related to operation of and planning for HPC systems.

Goals:

  • Foster a shared understanding of power management issues in the context of HPC centers.
  • Identify top challenges and open issues.
  • Share best practices and lessons learned.
  • Establish communication paths for managerial and technical staff at multiple sites to continue discussion on these topics.
  • Discuss roles and benefits of HPCC stakeholders.
  • Present findings to DOE and other stakeholders.


 

Agenda

Day 1 (Tuesday, September 28)
7:30–8:15 Breakfast and registration
8:15–8:30 Welcome: Kim Cupps, LLNL, and Yukiko Sekine, DOE SC
8:30–9:00 The Exascale Initiative, Mark Seager, LLNL
9:00–10:15 Overview of planning and activities
     NNSA Facility Planning, Sander Lee, DOE NNSA
     Office of Science Facility Planning, Dan Hitchcock, DOE SC
     Facilities and Plan for the Japanese Earth Simulator II, Ken'ichi Itakura, JAMSTEC
     European Activities, Ladina Gilly, CSCS
     Energy Efficient HPC Working Group Activities, Natalie Bates
     Update on Green 500 Activity, Erich Strohmaier, LBNL
10:15–10:20 Instructions for breakout sessions
10:20–10:45 Break
10:45–12:15 Day 1 breakouts (see complete descriptions below)
     1a: Facilities—Power distribution and cooling configurations from facility to racks
     1b: Facility Metrics—Metering and monitoring the computer center
     1c: Power-aware OS features and scheduling
     1d: Leveraging and encouraging power and cooling innovations in the commodity ecosystem
12:15–1:15Lunch
1:15–2:45 Day 1 breakouts (cont.)
2:45–3:15Break
3:15–3:30 Report from Best Practices Third Workshop, David Skinner, LBNL
3:30–5:30 Day 1 breakout reports and discussion
5:30–6:30 Break before dinner
6:30 Working dinner
     The Challenge of the Barcelona HPC Facility, Sergi Girona, BSC
Day 2 (Wednesday, September 29)
7:30–8:15Breakfast
8:15–9:30

Plenary panel: Unique Cooling Solutions for Dense HPC Systems, Jim Rogers, ORNL chair
     Mike Ellsworth (IBM), Unique Cooling Solutions for Dense HPC Systems
     Alan Goodrum (HP), Leveraging the Commercial Market to Power the Exascale Data Center,
     Doug Kelley (Cray), HPC Best Practices: Power and Cooling Solutions
     John Lee (Appro), Unique Cooling Solutions for Dense HPC Systems
     Tim McCann (SGI), Unique Cooling Solutions for Dense HPC Systems (PMBP)

9:30–12:30

(take brief break
about 10:30)
Day 2 breakouts (see complete descriptions below)
     2a: Power-related facility and equipment standards, ratings, and certifications
     2b: Alternative energy solutions
     2c: Power-aware system monitoring
     2d: Integrated (power-related) facility planning for system and network upgrades
12:30–1:30Lunch
1:30–3:30 Day 2 breakout reports and discussion
3:30–3:45 Break
3:45–4:45 Plenary workshop summary and next steps


Breakout Sessions

Day 1 Breakouts

1a: Facilities—Power distribution and cooling configurations from facility to racks. This breakout session will focus on the two main problems faced when operating an energy efficient data center—delivering power and providing adequate cooling.  The objective of this session is garnering participant experiences: what has been done in the past, where we are today, and what we see for the future. Topics relating to power delivery distribution will include voltage types (AC, DC), methods of power delivery to racks and equipment (e.g., overhead vs. underfloor), and conditioned vs. non-conditioned power. Subtopics would include discussions of experience with power overhead vs. underfloor, different PDUs (is a higher energy efficient transformer to prevent power losses worth the higher price?), and direct (water or refrigerant-based) vs. indirect (air) cooling and the use of hot or cold aisle containment (is use of tower water/air side economizers effective and what, if any, is the limitation of kW to rack where it is no longer effective?).  This session will also touch on risks/rewards associated with the costs of converting data centers from their current state.

1b: Facility Metrics—Metering and monitoring the computer center. This breakout session will focus on the information that is being collected in the data center to improve its effectiveness and efficiency.  Power Utilization Effectiveness (PUE) has become a "buzz word" in the industry, and although it is simple in concept, it can be difficult to accurately measure. Total facility power is usually a straightforward measurement, but total computing equipment power can be much more difficult to accurately determine. Participants in this session will discuss new and interesting approaches they are employing or developing at their sites, including their experience with various commercial products.  The discussion will include the participants' experience with air-side and water-side economizers as well as temperature set points and humidity controls. Instrumentation and graphical displays will be of considerable interest, and the cost trade-offs associated with improving PUE will also be considered.  The discussion will include how this technology will facilitate the integration of higher-density racks into the computing center and how the real-time data compares to the thermodynamic predictive models.

1c: Power-aware operating system features and scheduling. This breakout session will focus on both hardware and software issues related to achieving power efficiency. Example issues include but are not limited to: Advanced Power Management (APM) features available on current and future architectures (frequency scaling, sleep/low power states, dynamic voltage transitions); available OS interfaces to APM features; OS techniques to leverage APM features (independent of applications); OS interfaces exposed to enable higher level exploitation of APM features; OS abstraction of underlying APM features; what, if any, features to expose directly to the application; power/performance trade-offs; power aware scheduling; scheduling benefits and impacts of power aware scheduling. These issues are largely interdependent and must be considered from the system perspective. In addition, power efficiency issues and techniques necessary for HPC-class platforms likely differ greatly from commodity approaches developed for PC and enterprise class platforms. Our goal will be to identify obstacles and opportunities specific to HPC in this emerging area.

1d: Leveraging and encouraging power and cooling innovations in the commodity ecosystem. This breakout session will review current trends in power/cooling innovations in the commodity ecosystem and discuss the impacts to DOE HPC facilities planning. Because many ISPs, co-location facilities, and IT operations are becoming power and cooling limited, the commodity ecosystem is responding with innovative approaches to rack level power and cooling that lower the overall rack power consumption and lower the PUE. However, many of these approaches have facilities impacts. The most dramatic example of this is the new MSN Chicago Facility where computers and their power/cooling infrastructure are co-designed and delivered in a shipping container.

Day 2 Breakouts

2a: Power-related facility and equipment standards, ratings, and certifications. This breakout session will consider the impact of a number of standards, rating programs, training, and federal requirements—Energy Star for servers, storage, UPS, data centers; DOE Data Center Energy Practitioner program; DOE Save Energy Now Assessment tools; DOE-ASHRAE Data Center Energy Efficiency Awareness Training; ASHRAE standards - 90.1, 127; LEED™ criteria for data centers; Green Grid resources; federal requirements (Executive Orders, OMB consolidation request); carbon measurement tools—and project the impact on facilities housing the next generation of systems. For example, energy efficiency standards and federal mandates are becoming more aggressive while power and cooling requirements continue to grow. This could be considered a barrier or an opportunity for a paradigm shift that could radically alter the way systems are designed and deployed.

2b: Alternative energy solutions. This breakout session will be part presentation, part free discussion, and part brainstorming session, focusing on maximizing efficiency while minimizing environmental impact of HPC centers. How can HPC centers reduce cost and environmental impact by making creative use of local natural resources? Energy efficiency inside the data center is only part of the story. In keeping with the principle of reduce, reuse, recycle, we should be able to take advantage of local resources to increase efficiency either at new or existing locations. Are there creative ways to reduce PUE below 1?  Is a more meaningful way needed to express and measure the environmental effects of operating HPC centers?  We will explore approaches such as sustainable energy sources, use of ambient external air or water temperatures, and reuse of "waste" heat.

2c: Power-aware system monitoring. This breakout session will consider how system monitoring can provide useful data for making improvements and managing utilization as data centers become more power and cooling constrained. We will discuss what data is available and useful, how sites are managing the high volume of data, what data correlation sites are doing today, and potential useful correlations, along with challenges that exist in this area. One might correlate environmental data, such as power draw and rack temperatures, with science application running on the system. For example, you might develop an application "power score" that could be used for scheduling higher power score applications during lower cost power periods. In addition, with the sophisticated RAS systems available on the large HPC systems, it is possible to correlate error data (rates, types) with environmental data and the science applications. Finally, tying this system monitoring data together with facility data could bring additional insights and management techniques.

2d: Integrated (power-related) facility planning for system and network upgrades. This breakout session will focus on the necessary integrated facility planning required to meet the demands of future systems. With the increasing demands in power and cooling, the solutions for the upgrades spread across the entire spectrum of the facility, from the system layout to the facility infrastructure improvements and upgrades required in active HPC environments. Participants in this session will discuss new and interesting approaches that they are employing or developing at their sites, including their experience with various solutions in power distribution, free cooling, liquid cooling, networking, and environmental monitoring while maintaining flexibility and expandability of their current sites. The discussion will also include the participants' experience with integrated approaches in meeting the future demands while still serving the current demands of existing active systems.


 

Workshop Contacts

Technical Contacts: Kim Cupps, Mary Zosel

Administrative Contacts: Lori McDowell, Valorie McFann


 

Sponsored by the U.S. Department of Energy

Dan Hitchcock and Yukiko Sekine
Facilities Division, The Office of Advanced Scientific Computing Research (ASCR), Office of Science, U.S. Department of Energy

Robert Meisner and Sander Lee
Advanced Simulation and Computing, National Nuclear Security Administration, U.S. Department of Energy



The Department of Energy The Office of Science Scientific Discovery Through Advanced Computing
Contact: help@outreach.scidac.gov   |   Web Policies   |   Privacy Powered By GForge Collaborative Development Environment