Overview:
The goal of this project is to develop measurement methods for evaluating the reliability and robustness of
grid-computing systems that utilize the emerging Open Grid Forum (OGF) standards and related specifications.
Grid computing systems enable dynamic composition of large numbers of distributed resources to perform highly
compute-intensive tasks. These resources include processors, software components, memory and disk storage, high-speed
data transfer capabilities, and databases. In recent years, there has been rapid growth in the number of industrial
grid systems developed to support applications such as electronic commerce and finance, engineering design, product
development, and scientific research. As demand grows and development of this technology continues, industrial grid
systems that implement standards solutions will require methods to measure, analyze, and manage increasingly greater
numbers of grid resources in order to ensure system reliability under volatile and uncertain conditions.
Industry Need Addressed: The rapid growth of commercial grid computing depends upon the success of standards currently being developed by the OGF and similar organizations.
While these standards focus on providing a common platform for executing core grid resource management functions,
less attention has been devoted to understanding how standards-based grid systems might behave at a large scale,
or how well they might respond under volatile and uncertain conditions. At large scales, interactions among grid
components can lead to complex, non-linear behaviors that produce unsuspected and uncontrolled system-wide effects,
which in turn, can endanger and severely degrade effectiveness of an industrial grid system. The ability of grid
systems to exhibit reliability and robustness in the face of such conditions is important; otherwise, significant
productivity losses are likely to occur and long-term technological progress will be endangered. To ensure the
required reliability of large-scale grid systems, the development of measurement methods is needed for analysis
and management.
NIST/ITL Role: AAs industry focuses on establishing basic grid capabilities, NIST/ITL assists the private sector by developing
methods to measure system behaviors and characteristics that impact reliability and robustness in large-scale,
standards-based grid computing systems. These methods measure the ability of grids to provide services to industrial
applications in the face of volatile and uncertain conditions. The methods will also enable understanding of causes
of complex behavior in grid systems, detection of the onset of undesirable system states, and ultimately control to
promote desirable system states. Specifically, the work provides:
- A simulation
framework for modeling different approaches to resource management
and control that currently underlie grid computing specifications,
- A set of
metrics, scenarios, and methods against which to measure and evaluate
robustness and reliability of the proposed approaches, with sample evaluations,
- Control algorithms that facilitate desirable overall grid system behaviors
and
- Identification of issues and requirments for grid system reliability,
developed through the OGF Reliability and Robustness Research Group.
Impact: NIST/ITL, through publication, interaction with industry,
and participation in the standards bodies, provides critical information
to developers of commercial grid computing standards and applications.
Such information should suggest new approaches to improving reliability
and robustness under volatile conditions.
|
|