MOTD
Getting Help AnalyticsSystems |
Analytics: Scientific Data ManagementQuick Link: NERSC Tools for Scientific Data Management This site provides a brief overview of data management technologies commonly used in various scientific communities. The technologies are categorized according to the following areas:
Each technology is briefly summarized and references to the original software or development site are given. Also available on a separate page: links to documentation and reference information for software that is available on NERSC platforms or that is under development and may be deployed in the future. Scientific Data FormatsScientific data is often based on numerical arrays that are stored in scientific data formats such as NetCDF (network common data format), HDF5 (hierarchical data format) and FITS (Flexible Image Transport System). The data formats support efficient mechanisms for accessing and manipulating arrays and allow machine-independent data exchange. They also provide mechanisms for storing metadata information such as simulation parameters. HDF5 supports parallel I/O which allows multiple processors to read and write files. This is particularly important for large-scale data intensive applications. NetCDF, on the other hand, only supports serial I/O. Parallel I/O is provided by ParallelNetCDF. Currently there is no common scientific data format that is used across different scientific disciplines. For instance, NetCDF commonly is used in the climate modeling community. FITS is the preferred format for storing astronomy data, while HDF5 is used, for instance, in the combustion and fusion simulation communities. The Visualization Group/Analytics Team currently is working with the Paul Scherrer Institut (PSI) in Switzerland on H5Part, a storage model and API based on HDF5, for simplifying data exchange within the accelerator modeling community. Another standardization effort for HDF5-based data exchange is Fiberbundles. Database Systems and Metadata StorageThe two most commonly used open-source database systems are PostgreSQL and MySQL. Both database systems are based on so-called relational technology where data are stored in tables and accessed via a standardized query language called SQL. In general database systems are used for storing transactional data that are frequently updated, such as bank transactions. They provide mechanisms for fault-tolerance and error recovery.
In the scientific community database systems are mostly used for
storing metadata, i.e., information that describes the base data. An
example of typical metadata would be a set of simulations parameters, the
date the simulation was run, the number of processors used, etc..
The base data typically are stored in scientific data
formats since they provide more powerful parallel I/O capabilities, are
machine independent, and often are more easily used by scientists. The High-Energy Physics community has developed ROOT at CERN, an object-oriented database system with visualization technology. This system manages among the largest data sets in the world and has a large user community mainly in High-Energy Physics. Metadata AccessMetadata describes information about the actual data that is collected by scientific experiments or simulations. Typical examples of metadata are simulation parameters or provenance information, such as who generated the data, or which programs were used to derive some quantities. Scientific data formats provide some basic mechanisms to store metadata. Large metadata repositories are typically stored in database systems since they provide more powerful query and fault-tolerance mechanisms. However, scientific data formats such HDF5 and netCDF also provide some metadata information. The Metadata Catalog Service (MCS) developed by Globus is a stand-alone metadata catalog service that associates application-specific descriptions with data files, tables, or objects. MCS is widely used in various Grid projects. The Storage Resource Broker's (SRB's) MCat is a metadata repository system to provide a mechanism for storing and querying system-level and domain-dependent metadata using a uniform interface. MCat is based on SRB and is used in various Grid projects. Efficient Indexing and QueryingTypical end-user analysis is an iterative process of searching for regions of interest in the data such as, "Find all supernova explosions with temperature < 5000 and pressure > 1000". Traditional data analysis tools sequentially process all the data in order to find the records that match the search criteria. These techniques often are impractical for analyzing large amounts of data. FastBit developed by the Scientific Data Management Research Group at Berkeley Lab is a powerful indexing technique for accelerating these types of searches. HDF5-FastQuery is an API that supports semantic indexing of HDF5 datasets and simplifies executing queries on HDF5 files. HDF5-FastQuery uses the FastBit technology to significantly accelerate accessing and querying large HDF5 files. Datasets can be retrieved using complex compound range queries such as "(energy > 100) && (70 < pressure < 90)". The bitmap index technology only retrieves the data elements that satisfy the query condition and shows significant speedup compared with reading the entire datasets. File TransferThere are many tools available to transfer files. The choice of the best tool depends the size of the files, the number of files, and the distance involved in the transfer.
Remote Access and Distributed Data ManagementAccording to several user surveys, scientists spend a lot of time transferring data between various storage systems. Examples of these transfers are from supercomputers or clusters to mass storage systems for archival purposes, or vice versa from mass storage systems to data analysis farms. The data transfer process often is tedious and time consuming since different storage systems support different data transfer tools and employ different access policies. Moreover, data transfers may be stalled due to network failures or limited storage space. Efficient and robust tools for managing data transfers, storage space, and file access such as the Storage Resource Manager (SRM) developed at Berkeley Lab in collaboration with others, simplifies this process. The SRM manages transfers and access of large files across different storage systems both on disk and tape. The API is standardized for various storage systems. An important feature of the SRM is space management and fault-tolerance. SRMs are widely used in the High-Energy Physics community for transferring large amounts of data between different labs and universities across the world. SRMs also are being standardized by the Grid community. The Storage Resource Broker (SRB) developed at San Diego Supercomputer Center allows access to files stored on different storage systems and also provides metadata information. SRBs are used in various Grid projects by many different application domains. More InformationFor more information about these technologies, contact the NERSC Analytics Team at consult@nersc.gov. |
Page last modified: Mon, 04 Aug 2008 23:47:10 GMT Page URL: http://www.nersc.gov/nusers/analytics/sdm/ Web contact: webmaster@nersc.gov Computing questions: consult@nersc.gov Privacy and Security Notice |