Home	About	News & Publications	HPC Users	Projects
Systems	Services	Analytics	Status & Statistics	Help

HPC Users

Analytics: Scientific Data Management

Quick Link: NERSC Tools for Scientific Data Management

This site provides a brief overview of data management technologies commonly used in various scientific communities. The technologies are categorized according to the following areas:

Scientific Data Formats
Database Systems and Metadata Storage
Metadata Access
Efficient Indexing and Querying
File Transfer
Remote Access and Distributed Data Management
More Information

Each technology is briefly summarized and references to the original software or development site are given. Also available on a separate page: links to documentation and reference information for software that is available on NERSC platforms or that is under development and may be deployed in the future.

Scientific Data Formats

Scientific data is often based on numerical arrays that are stored in scientific data formats such as NetCDF (network common data format), HDF5 (hierarchical data format) and FITS (Flexible Image Transport System). The data formats support efficient mechanisms for accessing and manipulating arrays and allow machine-independent data exchange. They also provide mechanisms for storing metadata information such as simulation parameters.

HDF5 supports parallel I/O which allows multiple processors to read and write files. This is particularly important for large-scale data intensive applications. NetCDF, on the other hand, only supports serial I/O. Parallel I/O is provided by ParallelNetCDF.

Currently there is no common scientific data format that is used across different scientific disciplines. For instance, NetCDF commonly is used in the climate modeling community. FITS is the preferred format for storing astronomy data, while HDF5 is used, for instance, in the combustion and fusion simulation communities.

The Visualization Group/Analytics Team currently is working with the Paul Scherrer Institut (PSI) in Switzerland on H5Part, a storage model and API based on HDF5, for simplifying data exchange within the accelerator modeling community. Another standardization effort for HDF5-based data exchange is Fiberbundles.

Database Systems and Metadata Storage

The two most commonly used open-source database systems are PostgreSQL and MySQL. Both database systems are based on so-called relational technology where data are stored in tables and accessed via a standardized query language called SQL. In general database systems are used for storing transactional data that are frequently updated, such as bank transactions. They provide mechanisms for fault-tolerance and error recovery.

In the scientific community database systems are mostly used for storing metadata, i.e., information that describes the base data. An example of typical metadata would be a set of simulations parameters, the date the simulation was run, the number of processors used, etc.. The base data typically are stored in scientific data formats since they provide more powerful parallel I/O capabilities, are machine independent, and often are more easily used by scientists.

The High-Energy Physics community has developed ROOT at CERN, an object-oriented database system with visualization technology. This system manages among the largest data sets in the world and has a large user community mainly in High-Energy Physics.

Metadata Access

Metadata describes information about the actual data that is collected by scientific experiments or simulations. Typical examples of metadata are simulation parameters or provenance information, such as who generated the data, or which programs were used to derive some quantities. Scientific data formats provide some basic mechanisms to store metadata. Large metadata repositories are typically stored in database systems since they provide more powerful query and fault-tolerance mechanisms. However, scientific data formats such HDF5 and netCDF also provide some metadata information.

The Metadata Catalog Service (MCS) developed by Globus is a stand-alone metadata catalog service that associates application-specific descriptions with data files, tables, or objects. MCS is widely used in various Grid projects.

The Storage Resource Broker's (SRB's) MCat is a metadata repository system to provide a mechanism for storing and querying system-level and domain-dependent metadata using a uniform interface. MCat is based on SRB and is used in various Grid projects.

Top of Page

Efficient Indexing and Querying

Typical end-user analysis is an iterative process of searching for regions of interest in the data such as, "Find all supernova explosions with temperature < 5000 and pressure > 1000". Traditional data analysis tools sequentially process all the data in order to find the records that match the search criteria. These techniques often are impractical for analyzing large amounts of data. FastBit developed by the Scientific Data Management Research Group at Berkeley Lab is a powerful indexing technique for accelerating these types of searches.

HDF5-FastQuery is an API that supports semantic indexing of HDF5 datasets and simplifies executing queries on HDF5 files. HDF5-FastQuery uses the FastBit technology to significantly accelerate accessing and querying large HDF5 files. Datasets can be retrieved using complex compound range queries such as "(energy > 100) && (70 < pressure < 90)". The bitmap index technology only retrieves the data elements that satisfy the query condition and shows significant speedup compared with reading the entire datasets.

File Transfer

There are many tools available to transfer files. The choice of the best tool depends the size of the files, the number of files, and the distance involved in the transfer.

scp: Secure Copy (scp) is part of the Secure Shell (ssh) suite of tools, and as such is familiar, secure, and available on most platforms (including all NERSC compute platforms). It is easy to use, and works well for moderately-size files over moderate distances. However, because it encrypts the data stream as well as the control connection, its performance can be unacceptable for large files and/or long distances.
ftp: "Traditional" ftp is widely available, but has a serious security liability: the password is transmitted in plain text across the network. Because of this, NERSC compute platforms do not allow inbound ftp connections, although outbout connections are allowed (but not recommended). Note that NERSC HPSS systems support a limited ability to accept inbound ftp connections that use encrypted login strings.
sftp: Secure ftp (sftp) is another component of the Secure Shell. Since the password is encrypted, it is safe and supported on all NERSC compute platforms. However, it also encrypts the data stream, and so suffers from the same performance limitations as scp. Note that NERSC HPSS systems do not support any Secure Shell tools.
GridFTP: GridFTP is now supported on all NERSC systems for high performance data transfers, and is the most effective way to move large data sets across the WAN. This includes the ability to perform parallel and third-party transfers. The two most commonly used GridFTP clients are globus-url-copy and uberftp. GridFTP requires the use of grid certificates. For more information on using GridFTP to move data, refer to NERSC's GridFTP documentation. ESnet also provides a GridFTP quickstart guide with some examples that include tuning buffer sizes and parallel streams for optimal performance.
pftp: pftp is a NERSC-developed ftp client that can open multiple, parallel data streams between a NERSC compute platform and a NERSC HPSS system. This provides very high data transfer rates between (for example) batch jobs and users' archive storage area. pftp uses encrypted login strings for authentication. There is also a version (pftp_gsi) that can use grid certificates for authentication. Note that there are many other tools named "pftp" that have been developed over the years. NERSC's pftp is only installed on NERSC platforms.
bbcp/bbftp: bbcp is a point-to-point network file copy application with excellent network transfer rates. The application was originally written for transferring large files of the data-intensive High-Energy Physics community.
hsi/htar: hsi is a Unix-like interface to transfer files between disk and HPSS. It is similar in functionality to ftp or pftp, i.e., multiple files can be transferred. A related tool is htar. The advantage of htar over hsi is that it allows multiple file transfers without the need to store data in intermediate local file storage. NERSC provides downloadable versions of hsi and htar for common Linux and Unix (including MacIntosh) systems.

Remote Access and Distributed Data Management

According to several user surveys, scientists spend a lot of time transferring data between various storage systems. Examples of these transfers are from supercomputers or clusters to mass storage systems for archival purposes, or vice versa from mass storage systems to data analysis farms. The data transfer process often is tedious and time consuming since different storage systems support different data transfer tools and employ different access policies. Moreover, data transfers may be stalled due to network failures or limited storage space. Efficient and robust tools for managing data transfers, storage space, and file access such as the Storage Resource Manager (SRM) developed at Berkeley Lab in collaboration with others, simplifies this process.

The SRM manages transfers and access of large files across different storage systems both on disk and tape. The API is standardized for various storage systems. An important feature of the SRM is space management and fault-tolerance. SRMs are widely used in the High-Energy Physics community for transferring large amounts of data between different labs and universities across the world. SRMs also are being standardized by the Grid community.

The Storage Resource Broker (SRB) developed at San Diego Supercomputer Center allows access to files stored on different storage systems and also provides metadata information. SRBs are used in various Grid projects by many different application domains.

More Information

For more information about these technologies, contact the NERSC Analytics Team at consult@nersc.gov.