Parallel Log Structured File System

LANL Technical Publication Overview - full document LANL LA-UR-08-07314


Initial Goals for the PLFS


The N processes to N files method is excellent for file system bandwidth but stresses the file system metadata handling capability and is difficult for users to deal with so many The N processes to 1 file method for checkpointing HPC applications has many advantages for the user, unfortunatly it has disadvantages for the file system. The major disadvantage to the N to 1 method is the issues with concurrent writing of data into a single file and the difficulties this presents to modern global parallel file systems. All parallel file systems have a difficult time with this I/O pattern. There has been much work on this problem in the last decade. The PLFS work is an attempt to take the best properties of N to N I/O and the best properties of N to 1 I/O and make them available to users/applications in a completely transparent way to the applications (requiring no modifications/re-linking etc.) applications that use the N to 1 dump method. The hope is that this work will improve performance of N to 1 dumps for applications immensely and make it possible for N to 1 small I/O patterns to scale extremely well, again with no application modifications. Additionally, it is hoped that by making PLFS available to the HEC File Systems and I/O research community, this tool could be used for further research to explore new frontiers of thought and to improve the tool or replace it with something far better.
To achieve the above goals, we will exploit the fact that checkpointing is a write mostly workload optimization will be done for N to 1 small strided writes and read convenience and speed may need to be traded for write performance. Additionally, we may have to trade off some peak single node write performance to enable scalable writes across multiple machines.

Initial PLFS Design


To achieve the goals stated above for the PLFS, FUSE was chosen as the technology that underlies the PLFS.
The N to N checkpointing method benefits from the fact that each process has its own file. There are many benefits that are exploited because of this file per process scheme, high concurrency with little to no locking/consistency issues in the parallel process setting and aggressive caching because of no consistency issues. The high concurrence enables extremely good scaling of write performance. Additionally, the aggressive caching enables latency hiding of application I/O. An even further benefit is that most N to N applications write their own file serially or nearly serially which even further assists the aggressive caching and minimizes file system seeking. Of course the drawback to this N to N pattern is the large number of files created making metadata workload high and making it difficult to manage the files over time.
The concept behind PLFS is to turn N to 1 strided write operations into serial write operations only coming from one process or possibly all the processes on one host/compute node or aggregator node. The example cartoon below shows how each process writes its various arrays of data to a file per process serially while at the same time maintaining an index of where each write really should be if this were writing to a single file. Currently the index files are written serially as well with offset and length values required to map the file per process serial files to the logical virtual N to 1 strided file. One serial data file and one index file are created per process. This is just one example of how this PLSF concept could be implemented.
Notice in the diagram below that the process data/index files are grouped into a directory per host. This is done to put some structure over the large number of files that will be created. Notice that each directory of a subset of process data/index files is placed on a different metadata server. This is done to enable scaling of metadata operations within this single logical file. This again is one example of an implementation of a method to overcome the bad parts of the N to N workload that this conversion of N to 1 method to serial N to N method has created.
Additionally, notice the model file. This file is a surrogate file for the virtual file that is being represented by this directory structure. This is a place where extended attributes can be stored for the logical/virtual N to 1 strided file. Also notice that the top level directory of this virtual N to 1 file is a SUID directory. This is a directory with the POSIX SUID bit turned on. SUID is currently undefined for directories so this is the method used to tell the difference between a real file, a directory and a virtual N to 1 file. That SUID directory entry represents the logical N to 1 file in the name space of the real file system. Notice that in the highest level directory there can be many entries and each entry can be a normal file, a directory, or an SUID directory with the name of the logical file the user knows the file by.

Current Prototype Implementation

There is currently a prototype implementation of this PLFS we call FUSEPLFS. We use the FUSE capability to hide this SUID directory from the user. The SUID directory appears to the user as a regular file. Write operations occur to the logical file as described in the design section where each process writes serially to a file per process and each process also writes to an index file. The current prototype implements a subdirectory per host just as described above. Metadata operations like getattr to fulfill stat() calls uses the index files to determine the length of the file. Other metadata operations like chown(), chmod(), utime(), etc. use the model file to represent the logical file the user sees. There is an experimental read method being developed to use the index files to look up where read requests should be directed to.
Currently the prototype will not tolerate concurrent writes and read operations on a single logical file. It also will not handle append file operations. Additionally, the current prototype requires that the storage under the FUSEPLFS file system be at least a global file system.
The current PLFS design implies that the users will make some promises to use the facility. The primary promise that has to be made by the user is that they will not write the same range of the logical file with more than one process ever. This is little different than other optimizations that have been done in the area of concurrent writing to a single file in middleware and other file systems.
There are plans to release this PLFS software into the open source community. Currently the code can be given out selectively with appropriate paperwork until the Department of Energy and the Los Alamos National Laboratory approve public release. We are hoping that will be very soon.

Click here to download this software when it becomes generally available

Possible Future Work

As was stated earlier, PLFS is a prototype currently. There are certainly many avenues of R&D that can be followed using the PLFS.
The preliminary nature of results leaves much room for improvement, adding more file systems, cluster systems, scale, and parameter sweeps in the results could further motivate investment in the PLFS. Additionally, trying more real applications especially those that use popular but problematic N to 1 small strided methods like those that use HDF and NetCDF as well as MPI-IO collectives.
Currently, the PLFS uses the FUSE direct-io mode to allow for large writes. In the near future, the FUSE package will allow large writes without using the direct-io mode. This may open up the ability for the PLFS to perform even better for smaller strided writes.
Of course, given the prototype nature of the PLFS, there is much room for hardening of the code as well as improving the very experimental read method. Support for concurrent read/write operations and append could also be added. Additionally, it is quite possible that one file per host and one index per host might perform reasonably well and have produce even less files. Various PLFS layouts could be explored with various workloads.
It is unclear if reconstruction of a real N to 1 strided file in an offline fashion before the file is needed for read is necessary or desired. For some read patterns, the indexed PLFS format might serve read requests adequately. A study as to the appropriateness of rebuilding the real N to 1 strided file, versus building an efficient single index from the distributed index, versus just leaving the file in PLFS format would certainly need to be done to determine how to best handle differing read workloads against PLFS files.
The indexing scheme used by the prototype PLFS is extremely simplistic. Study of indexing schemes in memory during writing, offline reconstruction of indexes, honoring read requests during a write operation, index types to take advantage of popular N to 1 strided patterns, different index distribution schemes etc. are all excellent follow on work. Additionally, the PLFS concept could also be merged with ideas like Gigaplus making N to N operations. This entire
The entire PLFS concept starts to head HPC file system storage towards file formats in the file system. It is quite possible that other file types besides N to 1 strided might be served well by similar thinking to PLFS. Decades ago, file types and access methods were used and were supported within a single file system. The IBM MVS storage systems allowed for many different file types, partitioned data sets, indexed sequential, virtual sequential, and sequential to name a few. Storage for modern HPC systems may benefit from a new parallel/scalable version of file types. There is much research to be done in this area to determine the usefulness of this concept and how such a thing would work with modern supercomputers and future HPC languages and operating environments.

LANL PLFS Website designed and hosted by
Institutes Office at Los Alamos National Laboratory.

Email Contacts: Gary Grider, James Nunez, John Bent