Petascale Data Storage Institute
Success Stories Click Here
Click Here for our Popular Open Source FSIO Tools
Click Here for our Popular Open Released Failure and Trace Date
Click Here for our New and Exciting Parallel Log Structured File System (PLFS)
Overview
Petascale computing infrastructures for scientific discovery make petascale demands on information storage capacity, performance,
concurrency, reliability, availability, and manageability. The last decade
has shown that parallel file systems can barely keep pace with high performance computing along these dimensions; this
poses a critical challenge when petascale requirements are considered. This proposal describes a Petascale Data Storage
Institute that focuses on the data storage problems found in petascale scientific computing environments with special
attention to community issues such as interoperability, community buy-in, and shared tools. Leveraging experience in
applications and diverse file and storage systems expertise of its members, the institute allows a group of researchers
to collaborate extensively on developing requirements, standards, algorithms, development and performance tools. Mechanisms
for petascale storage and results are made available to the petascale computing community. The institute holds
periodic workshops and develops educational materials on petascale data storage for science.
Collaborators
The Petascale Data Storage Institute is a collaboration between researchers
at Carnegie Mellon University, National Energy
Research Scientific Computing Center, Pacific Northwest National Laboratory, Oak Ridge National Laboratory, Sandia National
Laboratories, Los Alamos National Laboratory, University of Michigan, and the University of California
Santa Cruz.
LANL PDSI Statement of Work Items
- Project 1: Petascale Data Storage Outreach
- Year 1-2: Participate in HPC I/O and file storage systems curriculum development at participating institute universities including courses and parenthetical
degrees.
This is done through the LANL institutes program which consists of:
- Year 1-5: Sponsor SciDAC and HEC FSIO R&D workshops to
showcase, update, and coordinate the HPC industry and university, SciDAC and HEC FSIO related research and
development.
- Year 2-5: Participate in lecturing and university course delivery as a part of the HPC I/O and file storage system curriculum. This is accomplished by LANL delivering lectures and projects for University courses in the area of HEC/SciDAC FSIO.
- Lectures
- Lecture delivered to Graduate Storage Class at UCSC Fall 2006
- Lecture delivered to Graduate Storage Class at CMU Spring 2007
- Lecture delivered to Undergraduate Parallel Computing Class at Colorado School of Mines Fall 2007
- Lecture to be delivered to Graduate Reliability Class at UC Berkeley Fall 2007
- Class Projects
- Statistical analysis of LANL Supercomputer Failure, Usage, and Event Operations Data, Colorado School of Mines, Spring 2007
- Parallel Searching using Multiple Google Desktops, Spring 2007, Colorado Schoold of Mines
- Parallel Searching using Multiple Google Desktops, Spring 2008, Colorado Schoold of Mines
- Future: Implementation of a Parallel File Tree Walker, Spring 2008, Colorado Schoold of Mines
- Future: Implementation of a Parallel File Movement Utility, Spring 2008, Colorado Schoold of Mines
- Project 2: Petascale Storage Application Performance Characterization
- Year 1-2: Provide parallel I/O traces of unclassified parallel applications. These traces are scheduled to be available Spring 2008; look for them at our code and data release site.
- Year 1-2: Provide parallel I/O traces of synthetic parallel benchmarks as well as source for the benchmarks to enable base lining for parallel I/O trace
analysis and replay research. These traces are beginning to show up now and this area should be well populated by SC07 (Winter 2007-2008); look for them at our code and data release
site.
- Year 2-4: Provide unclassified and scrubbed parallel I/O traces of important LANL parallel applications. These traces are not scheduled to show up until late 2009.
- Year 2-4: Provide unclassified derived I/O kernels that faithfully reproduce I/O patterns of both important LANL and unclassified parallel science applications. These kernels are not scheduled to show up until late 2009.
- Year 2-5: Assist in validation of trace analysis, replay, and workload/system simulation by providing access to parallel computational resource and interfacing to real science applications. This will be done collaboratively with LANL, PDSI, and other University and Lab researchers that are working with the parallel traces in the 2009-2010 time frame.
- Project 3: Petascale Storage System Dependability Characterization
- Year 1-2: Collect up to a decade of supercomputer, high performance networking and I/O and file storage system reliability data including machine and environment configuration information, mean time to interrupt, mean time to repair (MTTI/MTTR) and failure cause data.
- See our data and code release site for dowloads of 9 years of failure data, millions of usage records, event records, and placement data. This site is updated as we add more data. Review about every quarter for more updates.
- Year 1-2: Collect up to a decade of supercomputer, high performance networking, and I/O and file storage system usage data, including job length, size,
processor usage and other usage profile data. There is some disk failure data available at our data and code release site and much more disk failue data will be released by Winter 2007-2008.
- Project 4:
- Year 1-5: Assist in the validation of emerging HPC storage related standards and APIs such as parallel Network File System (pNFS), iSCSI Enhanced
RDMA (iSER), OSD/Active storage, and enhanced POSIX I/O.
- See our POSIX HECEWG IO API Exentsions Document work at POSIX High End Extensions Working Group I/O API Documents. The work in this area is mostly in the slow process of generating a standard.
- For more on the POSIX HECEWG Extensions project see POSIX extensions project
- See this site for what pNFS is. LANL will be testing pNFS in the Summer of 2008i - Summer of 2009 for parallel NFS access via IP and Infiniband.
- See the University of Michigan Center for INformation Technlogy Integration ASC program site and their related pNFS site for development information.
- Project 5: Exploration of Novel Mechanisms for Emerging Petascale Science Requirements
- Year 2-5: Assist in research to define an API for how scientific applications could provide application specific metadata to be stored in the parallel file
system and how applications could query this extended metadata information. Work has begun on this by working with Universities on projects in this area.
- A project began in the Summer of 2007 with Milo Polte, a graduate student at CMU on a multi-dimensional file system capability built using the PVFS file system, more information to come.
- A project will begin Winter of 2007-2008 with UCSC involving their Linking File System LiFS and access methods for files. More information to come.
- Project 6: Exploration of Automation for Petascale Storage System Management
- Year 2-5: Participate in the creation of autonomic systems designs and management visualization tools for easily finding and viewing exception data from
the massive amount of operational data generated by petascale file storage systems. This work will begin in 2008; LANL will begin collecting monitoring data on nodes, switches, storage etc. This data will be provided as input to the autonomic design research. Later LANL will review autonomic designs that arrise from the data and collaboration.
- Year 2-5: Assist in validation of autonomic system design, management visualization tools, and at-scale failure and usage analysis by providing access to
parallel computational resource and interfacing to real production computation, networking, and storage systems management personnel. LANL will assist in validating autonomic designs probably beginning in 2009.