Log In
  
 
Home My Page SciDAC Projects Collaborations
 

The organizers are please to announce that the Final Report is now available for download. Many thanks to all who attended and made this a very successful workshop.

The Department of Energy has identified the design, implementation, and usability of file systems and archives as key issues for current and future HPC systems. This workshop will address current best practices for the procurement, operation, and usability of file systems and archives. Furthermore, the workshop will address whether system challenges can be met by evolving current practices.

The organizers request a short position paper from each attendee to identify best practices in the area of file systems and archives. Each position paper should address a topic or topics related to best practices for the session they will be attending. The session organizers have identified the questions below as key topics the focus of each session, and suggest that your paper deal with one or more of them. Frame your discussion to identify what you think are best practices in use at your site related to the session topics. In addition, frame questions within your discussion to help elicit best practices from the other participants. This workshop will focus on breakout sessions in which participants present and evaluate their position papers. At the conclusion of the workshop participants will reconvene to summarize their findings into a report.

 

Purpose: To identify and share best practices related to the operation of file systems and archives at HPC centers, and to report the findings to DOE and the community.

Attendees: This workshop is a forum for HPC center managers and key technical staff (by invitation only), with representatives from:

  • DOE Headquarters (ASCR, BER, NNSA)
  • DOE Labs (ANL, LANL, LBNL, LLNL, ORNL, PNNL, SNL)
  • NSF (TACC, NCAR, NCSA, SDSC)
  • DoD
  • HEC Committee
  • NASA Ames
  • LRZ
  • CEA/DAM
  • JAXA
  • CSCS
  • Barcelona SC
  • U of Tokyo
  • Juelich SC
  • NICS/UTK
  • NSF
  • PSC
  • AWE(UK)

Goals:

  • Foster a shared understanding of file system issues in the context of HPC centers
  • Identify top challenges and open issues
  • Share best practices and lessons learned
  • Establish communication paths for managerial and technical staff at multiple sites to continue discussion on these topics
  • Discuss roles and benefits of HPC stakeholders
  • Present findings to DOE and other stakeholders


 

Agenda

Day 1 (Monday, September 26)
7:30-8:30 Breakfast and registration
8:30-9:00 Welcome: Jason Hick, LBL, and Yukiko Sekine, DOE SC (Club Room)
9:00-9:30 First presentation, first speaker, speaker affiliation
9:30-10:00 Presentation, Second Speaker, speaker affiliation
10:00-10:30 Morning break
10:30-11:00 Instructions for breakout sessions
11:00-12:00

Breakout sessions:

Business of Storage Systems (Club Room)

Administration of Storage Systems (Foothill B)

Reliability and Availability of Storage Systems (Foothill D)

Usability of Storage Systems (Foothill E)

12:00-1:00 Lunch (Foothill G)
1:00-3:00 Breakout sessions continue
3:00-3:30 Afternoon break
3:30-4:00 Business Breakout: collection of thoughts/outbrief to entire group
4:00-4:30 Administration Breakout: collection of thoughts/outbrief to entire group
4:30-5:00 Reliability Breakout: collection of thoughts/outbrief to entire group
5:00-5:30 Usability Breakout: collection of thoughts/outbrief to entire group
Day 2 (Tuesday, September 27)
7:30-8:30 Breakfast
8:30-9:00 Checkpoint and directions to breakout leaders
9:00-10:00 Breakouts continue
10:00-10:30 Morning break
10:30-12:00 Breakouts continue
12:00-1:00 Lunch
1:00-2:00 Breakouts continue
2:00-2:30 Business Breakout: collection of thoughts/ outbrief to entire group
2:30-3:00 Administration Breakout: collection of thoughts/ outbrief to entire group
3:00-3:30 Afternoon break
3:30-4:00 Reliability Breakout: collection of thoughts/ outbrief to entire group
4:00-4:30 Usability Breakout: collection of thoughts/ outbrief to entire group
4:30-5:30 Plenary workshop summary and next steps (report)


Breakout Sessions

Positon Papers

Here are the received position papers with an indication of their target session. In some cases more than one session may apply. If you would like your paper targeted to more or different sessions let us know.
file name Administration of Business of Reliability of Usability of
Administration_CEA_DAM_Deniel.pdf X
Administration_JAXA_Fujita.pdf X
Administration_LANL_Torres.pdf X
Administration_LBNL_Cardo.pdf X
Administration_LBNL_Hurlbert.pdf X
Administration_LLNL_Heer.pdf X
Administration_NICS_Braby.pdf X
All_ANL_Harms.pdf X X
All_NAVY_Combs.pdf X X X X
All_ORNL_Oral.pdf X X X X
Business_CSCS_Ulmer.pdf X
Business_DOD_Kendall.pdf X X
Business_Julich_Graf.pdf X
Business_LBNL_Uselton.pdf X
Business_LLNL_Cups.pdf X
Business_PNNL_Cowley.pdf X
Reliability_DoD_Gebhardt.pdf X
Reliability_LANL_Jebbanema.pdf X
Reliability_LLNL_Gary.pdf X X
Reliability_PNNL_Felix.pdf X X
Usability_ANL_Vishwanath.pdf X
Usability_AWE_Roberts.pdf X X
Usability_LANL_Bent.pdf X
Usability_LANL_Roschke.pdf X
Usability_LLNL_Hedges.pdf X
Usability_NCAR_Gillman.pdf X
Usability_PNNL_Glass.pdf X
Usability_SNL_Klundt.pdf X
Usability_UTokyo_Ishikawa.pdf X
 

The business of storage systems:

Sarp Oral (ORNL) and David Cowley (PNNL)

Club Room

For each layer of the storage hierarchy - file systems, archives, others - address these topics (and add to them):

  1. HPC facilities across DOE deploy and operate systems at unprecedented scale requiring advanced file system technologies. Achieving the requisite level of performance and scalability from file systems remains a significant challenge.
    1. What are your practices used to plan for future system deployments and system evolution over time?
    2. How do you establish requirements such as Bandwidth, Capacity, Metadata Operations/sec, MTTI, etc. for these systems?
  2. It is not uncommon for archival storage deployments to have lifespans approaching multiple decades. Growth of archival data at anumber of HPC sites is exponential.
    1. How do you effectively plan for exponential growth rates and archives that will need to serve multiple generations of machines throughout their life within a fixed budget profile?
    2. Are exponential growth rates sustainable? How do you mitigate if not?
  3. There are relatively few alternatives in parallel file system and archival storage system software that meet the requirements of major HPC facilities. Development of this software varies from proprietary closed source to collaborative open source solutions. Each model has benefits and drawbacks in terms of total cost of ownership, ability to evolve the software to meet specific requirements, and long term viability (risk) of the system.
    1. What model do you leverage for your file system or archival system software needs?
    2. What benefits/drawbacks do you see with these models?
  4. Storage system and tape archive technologies vary from high-end custom hardware developed specifically for the HPC environment to commodity storage platforms with extremely broad market saturation.
    1. Where do you leverage custom vs commodity storage hardware within your operational environment?
    2. Do you see opportunities to incorporate more commodity storage technologies within your environment in the future?
    3. What are the barriers to adopting commodity storage technologies and how can they be overcome in the future?
 

The administration of storage systems:

Susan Coghlan (ANL) and Jerry Shoopman (LLNL)

Foothill B

For each layer of the storage hierarchy - file systems, archives, others - address these topics (and add to them):

  1. Change control and configuration management.
    1. What specific configuration management tools, methods, or practices does your center use to validate hardware/software changes and releases to minimize production performance degradation or system downtime?
    2. Can you provide an example of how testing a change/release on a pre-production system provided unique insight into a configuration problem before it was detected by users?
  2. Ongoing system administration.
    1. What specific file system and/or archive metrics does your center measure and monitor on a regular basis, and how have those findings directed which types of operational tasks your center has automated to minimize frequency and impact of production incidents and optimize system performance and end-user experience?
    2. Can you provide an example where self-monitoring and detecting production incidents lead to an investigation of root-cause to reduce mean-time-to-resolution (MTTR) of future file system or archive outages?
  3. Technology refresh.
    1. What unique approach has your center taken to balance the end-user requirement to increase system availability while providing system architects the opportunity and access to the environment to satisfy the ongoing need to refresh underlying file system and archive components?
    2. How does your center expand capacity of either a file system or archive resource while minimizing user impact?
  4. Security management.
    1. What strategy is your center taking with, for example, OS patching or vulnerability scanning, to satisfy the ongoing and rigorous demands of computer security professionals?
    2. How does your strategy balance the growing need to provide highly-collaborative, distributed user communities with access to multiple levels of data sensitivity?
    3. Assuming your center provides a multi-zoned security architecture with various access control levels and technologies, how would you demonstrate to computer security that it is providing adequate protection and controls?

 

The reliability and availability of storage systems:

Mark Gary (LLNL) and Jim Rogers (ORNL)

Foothill D

For each layer of the storage hierarchy - file systems, archives, others - address these topics (and add to them):

  1. Resilient architectures and fault tolerance.
    1. What specific system architectural or configuration practices, decisions or changes (hardware and software) has your Center made that have demonstrably improved availability, reliability or performance of your file system or archival system?
    2. Where in the environment should redundant hardware be bought/deployed?
  2. Hardware and Software Maintenance.
    1. What is your philosophy or practice for executing system maintenance down times?
    2. How do those practices contribute to improved system availability and reliability outside of planned outages?
  3. Data integrity.
    1. What strategies do you use to ensure data integrity and what parts of the end-to-end compute/store/visualize/archive cycles does it cover?
    2. Do you employ end-to-end checksums within the end-to-end cycle and if so where?
  4. Off-hours Support and Availability.
    1. What mechanisms do you have in place to ensure reliable file system and archive operation during off-hours and, in the event of a facility event, such as power loss, chilled water loss, fire alarm, etc?
    2. What mechanisms are in place to quiesce storage and to protect it?
 

The usability of storage systems

Shane Canon (LBL) and John Noe (SNL)

Foothill E

For each layer of the storage hierarchy - file systems, archives, others - address these topics (and add to them):

  1. What are your major usability issues?
  2. What applications/tools have you developed or obtained elsewhere and deployed that have made your storage system more effective and useful for end users?
    1. What tools or methods are available to users for I/O related problem diagnosis? Describe experiences where use of diagnostics have resulted in improved outcomes. Are the available diagnostic capabilities sufficiently robust? Scalable?
    2. Please help us categorize the applications/tools in use.
    3. Which tools are used by end users and which primarily by system administrators?
  3. Discuss recent challenges in providing I/O service to your user community, and what practices/strategies were used to meet them.
    1. What trends in user requirements resulted in the need to address the challenges?
    2. How well are current solutions meeting the demands, or where are they falling short?
    3. Where might experience at other sites be helpful to your challenges?
    4. Which of your practices outlined above would you suggest for a best practices list?
  4. Large data movement - with respect to internal file and storage systems, do sites dedicate specific resources data movement internally?
    1. What (if any) direction is given to users for moving data around internally?
    2. What tools are available for helping users improve data transfer performance ?
  5. At what organizational level are users managing data organization, per user, per code, per project, or some larger unit? Are there common approaches or best practices that have been identified which are being leveraged to to aid these efforts?
  6. How does your site manage health monitoring of I/O services, and how is pertinent information transmitted to users? What feedback do users have on the content and timeliness of the information?
  7. What are your user training and documentation pratices?

 

Workshop Contacts

Technical Contacts: Jason Hick , Andrew Uselton


 

Sponsored by the U.S. Department of Energy

Dan Hitchcock and Yukiko Sekine
The Office of Advanced Scientific Computing Research (ASCR) , Office of Science , U.S. Department of Energy

Robert Meisner and Paul Henning
Advanced Simulation and Computing , National Nuclear Security Administration , U.S. Department of Energy



The Department of Energy The Office of Science Scientific Discovery Through Advanced Computing
Contact: help@outreach.scidac.gov   |   Web Policies   |   Privacy Powered By GForge Collaborative Development Environment