mcs | publications | abstracts

2007 Abstracts of MCS Reports and Preprints

Preprints
J. Jung and A. Hassanein, "Three-Phase CFD Analytical Modeling of Blood Flow," Preprint ANL/MCS-P1386-0107, January 2007. The behavior of blood cells in disturbed flow regions of arteries has significant relevance for understanding atherogenesis.  However, their distribution with red blood cells (RBCs) and leukocytes is not so well studied and understood.  Our three-phase computational fluid dynamics approach including plasma, RBCs, and leukocytes was used to numerically simulate the local hemodynamics in such a flow regime.  This model has tracked the wall shear stress (WSS), phase distributions, and flow patterns for each phase in a concentrated suspension shear flow of blood.  Unlike other computational approaches, this approach does not require dispersion coefficients as an input.  The non-Newtonian viscosity model was applied to a wide physiological range of hematocrits, including low shear rates.  The migration and segregation of blood cells in disturbed flow regions were computed, and the results compared favorably with available experimental data.  The predicted higher leukocyte concentration was correlated with relatively low WSS near the stenosis having a high WSS.  This behavior was attributed to flow-dependent interactions of the leukocytes with RBCs in pulsatile flow.  This three-phase hemodynamic analysis may have application to vulnerable plaque formation in arteries with in vivo complex flow conditions.
   
R. Latham, R. Ross, R. Thakur, "Implementing MPI-IO Atomic Mode and Shared File Pointers Using MPI One-sided Communication," Preprint ANL/MCS-P1388-0107, January 2007. The ROMIO implementation of the MPI-IO standard provides a portable infrastructure for use on top of any number of different underlying storage targets.  These targets vary widely in their capabilities, and in some cases additional effort is needed within ROMIO to support all MPI-IO semantics.  Two aspects of the interface that can be problematic to implement are MPI-IO atomic mode and the shared file pointer access routines.  Atomic mode requires enforcing strict consistency semantics, while shared file pointer routines require communication and coordination in order to atomically update a shared resource.  For some file systems, native locks may be used to implement these features, but not all file systems have lock support.  In this work, we describe algorithms for implementing efficient mutex locks using MPI-1 and the one-sided capabilities from MPI-2.  We then show how these algorithms may be used to implement both MPI-IO atomic mode and shared file pointer methods for ROMIO without requiring any features from the underlying file system.  We evaluate the performance of these algorithms and show that they can outperform traditional file system lock approaches.  Because of the portable nature of these algorithms, they are likely useful in a variety of situations where distributed locking or coordination is needed in the MPI-2 environment.
   
C. Falzone, A. Chan, E. Lusk, and W. Gropp, "A Portable Method for Finding User Errors in the Usage of MPI Collective Operations," Preprint ANL/MCS-P1389-0107, January 2007. An MPI profiling library is a standard mechanism for intercepting MPI calls by applications.  Profiling libraries are so named because they are commonly used to gather runtime information about performance characteristics.  Here we present a profiling library whose purpose is to detect user errors in the use of MPI's collective operations.  While some errors can be detected locally (by a single process), other errors involving the consistency of arguments passed to MPI collective functions must be tested for in a collective fashion.  While the idea of using such a profiling library does not originate here, we take the idea further than it has been taken before (we detect more errors, including those involving datatype inconsistencies) and present an open-source library that can be used with any MPI implementation.  We describe the tests carried out, provide some details of the implementation, illustrate the usage of the library, and present performance tests.
   
P. Beckman, K. Iskra, K. Yoshii, S. Coghlan, and A. Nataraj, "Benchmarking the Effects of Operating System Interference on Extreme-Scale Parallel Machines," Preprint ANL/MCS-P1390-0107, January 2007. We investigate operating system noise, which we identify as one of the main reasons for a lack of synchronicity in parallel applications.  Using a microbenchmark, we measure the noise on several contemporary platforms and find that, even with a general-purpose operating system, noise can be limited if certain precautions are taken.  We then inject artificially generated noise into a massively parallel system and measure its influence on the performance of collective operations.  Our experiments indicate that on extreme-scale platforms, the performance is correlated with the largest interruption to the application, even if the probability of such an interruption on a single process is extremely small.  We demonstrate that synchronizing the noise can significantly reduce its negative influence.
   
N. Desai, E. Lusk, and R. Bradshaw, "A Composition Environment for MPI Programs," Preprint ANL/MCS-P1391-0207, February 2007 While MPI is the most common mechanism for expressing parallelism, MPI programs are not composable by using current MPI process managers or parallel shells.  We introduce MPISH2, an MPI process manager analogous to serial Unix shells.  It allows the composition of MPI and serial Unix utilities with one another to perform scalable tasks across large numbers of Unix clients.  This paper discusses in detail issues of process management and parallel tool composition.
   
B. Norris, A. Hartono, W. Gropp, "Annotations for Productivity and Performance Portability," in Petascale Computing: Algorithms and Applications, Chapman & Hall,  CRC Press (to appear).  Also Preprint ANL/MCS-P1392-0107, January 2007. In many scientific applications, significant time is spent in tuning codes for a particular high-performance architecture.  Multiple approaches to such tuning exist, ranging from the relatively nonintrusive (e.g., by using compiler options) to extensive code modifications that attempt to exploit specific architecture features.  In most cases, the more intrusive code tuning is not easily reversible and thus can result in inferior performance on a different architecture or, in the worst case, in wholly nonportable code.  Readability is also greatly reduced in such highly optimized codes, resulting in lowered productivity during code maintenance.  We introduce an extensible annotation system that aims to improve both performance and productivity by enabling software developers to insert annotations into their source code that trigger a number of low-level performance optimizations on a specified code fragment.
   
J. Bresnahan, M. Link, R. Kettimuthu, D. Fraser, and I. Foster, "GridFTP Pipelining," Preprint ANL/MCS-P1393-0207, February 2007. GridFTP is an exceptionally fast transfer protocol for large volumes of data.  Implementations of it are widely deployed and used on well-connected Grid environments such as those of the TeraGrid because of its ability to scale to network speeds.  However, when the data is partitioned into many small files instead of few large files, it suffers from lower transfer rates.  The latency between the serialized transfer requests of each file directly detracts from the amount of time data pathways are active, thus lowering achieved throughput.  Further, when a data pathway is inactive, the TCP window closes, and TCP must go through the slow-start algorithm.  The performance penalty can be severe.  This situation is known as the "lots of small files" problem.  In this paper we introduce a solution to this problem.  This solution, called pipelining, allows many transfer requests while a data transfer is in progress.  We present an implementation and performance study of the pipelining solution.
   
E. T. Ong, J. W. Larson, B. Norris, R. L. Jacob, M. Tobis, and M. Steder, "Multilingual Interfaces for Parallel Coupling in Multiphysics and Multiscale Systems," Preprint ANL/MCS-P1395-0207, February 2007. Multiphysics and multiscale simulation systems are emerging as a new grand challenge in computational science, largely because of increased computing power provided by the distributed-memory parallel programming model on commodity clusters.  These systems often present a parallel coupling problem in their intercomponent data exchanges.  Another potential problem in these coupled systems is language interoperability between their various constituent codes.  In anticipation of combined parallel coupling/language interoperability challenges, we have created a set of interlanguage bindings for a successful parallel coupling library, the Model Coupling Toolkit.  We describe the method used for automatically generating the bindings using the Babel language interoperability tool, and illustrate with short examples how MCT can be used from the C++ and Python languages.  We report preliminary performance reports for the MCT interpolation benchmark.  We conclude with a discussion of the significance of this work to the rapid prototyping of large parallel coupled systems.
   
M. R. Paul, M. I. Einarsson, P. F. Fischer, and M. C. Cross, "Extensive Chaos in Rayleigh-Bénard Convection," Preprint ANL/MCS-P1396-0207, February 2007. Using large-scale numerical calculations we explore spatiotemporal chaos in Rayleigh-Bénard convection for experimentally relevant conditions.  We calculate the spectrum of Lyapunov exponents and the Lyapunov dimension describing the chaotic dynamics of the convective fluid layer at constant thermal driving over a range of finite system sizes.  Our results reveal that the dynamics of fluid convection is truly chaotic for experimental conditions as illustrated by a positive leading order Lyapunov exponent.  We also find the chaos to be extensive over the range of finite sized systems investigated as indicated by a linear scaling between the Lyapunov dimension of the chaotic attractor and the system size.
   
M. Anitescu, G. Palmiotti, W.-S.Yang, and M. Neda, "Stochastic Finite-Element Approximation of the Parametric Dependence of Eigenvalue Problem Solution," Preprint ANL/MCS-P1397-0307, March 2007. We present a stochastic finite-element approach for characterizing parameter dependence of minimum eigenvalue problems encountered in neutronic calculations.  Our formulation results in solving a nonlinear system of equations, that is K times larger than the original problem and has K constraints, where K is the number of terms considered in the perturbative expansion of the solution.  This approach allows us to calculate the behavior of the eigenvalue and the eigenvector in the entire parameter range, as opposed to a narrow region around a nominal value calculated by classical sensitivity analysis.  Initial investigation for a small parameter space indicates that the method has the potential of substantial savings over Monte Carlo calculations that attempt to characterize the behavior of the eigenvector and eigenvalue over the entire parameter space.
   
Z. Insepov. T. Bazhirov, G. Norman, and V. Stegailov, "Computer Simulation of Bubble Formation," Preprint ANL/MCS-P1398-0307, March 2007. Properties of liquid metals (Li, Pb, Na) containing nanoscale cavities were studied by atomistic Molecular Dynamics (MD).  Two atomistic models of cavity simulation were developed that cover a wide area in the phase diagram with negative pressure.  In the first model, the thermodynamics of cavity formation, stability and the dynamics of cavity evolution in bulk liquid metals have been studied.  Radial densities, pressures, surface tensions, and work functions of nano-scale cavities of various radii were calculated for liquid Li, Na, and Pb at various temperatures and densities, and at small negative pressures near the liquid-gas spinodal, and the work functions for cavity formation in liquid Li were calculated and compared with the available experimental data.  The cavitation rate can further be obtained by using the classical nucleation theory (CNT).  The second model is based on the stability study and on the kinetics of cavitation of the stretched liquid metals.  A MD method was used to simulate cavitation in a metastable Pb and Li melts and determine the stability limits.  States at temperatures below critical (T < 0.5Tc) and large negative pressures were considered.  The kinetic boundary of liquid phase stability was shown to be different from the spinodal.  The kinetics and dynamics of cavitation were studied.  The pressure dependences of cavitation frequencies were obtained for several temperatures.  The results of MD calculations were compared with estimates based on classical nucleation theory.
   
J. P. Allain, M. Nieto, M. Hendricks, A. Hassanein, C. Tarrio, S. Grantham, and V. Bakshi, "Energetic and Thermal Sn Interactions and their Effect on EUVL Source Collector Mirror Lifetime at High Temperatures," Preprint ANL/MCS-P1400-0307, March 2007. Exposure of collector mirrors facing the hot, dense pinch plasma in plasma-based EUV light sources remains one of the highest critical issues of source component lifetime and commercial feasibility of EUV lithography technology.  Studies at Argonne have focused on understanding the underlying mechanisms that hinder collector mirror performance under Sn exposure and developing methods to mitigate them.  Both Sn ion irradiation and thermal evaporation exposes candidate mirrors tested (i.e., Ru, Rh and Pd) in the experimental facility known as IMPACT (Interaction of Materials with charged Particles and Components Testing).  Studies have led to an understanding of how Sn energetic ions compared to Sn thermal atoms affect three main surface properties of the collector mirror: 1) surface chemical state, 2) surface structure, and 3) surface morphology.  All these properties are crucial in understanding how collector mirrors will respond to Sn-based EUV source operation.  This is primarily due to the correlation of how variation in these properties affects the reflectivity of photons in the EUV spectral range of interest (in-band 13.5-nm).  This paper discusses the first property and its impact on 13.5-nm reflectivity.
   
Z. Insepov, A. Hassanein, T. T. Bazhirov, G. E. Norman, V. V. Stegailov, "Molecular Dynamics Simulations of Bubble Formation and Cavitation in Liquid Metals, Preprint ANL/MCS-P1402-0307, March 2007. Thermodynamics and kinetics of nano-scale bubble formation in liquid metals such as Li and Pb were studied by molecular dynamics (MD) simulations at pressures typical for magnetic and inertial fusion.  Two different approaches to bubble formation were developed.  In one method, radial densities, pressures, surface tensions, and work functions of the cavities in supercooled liquid lithium were calculated and compared with the surface tension experimental data.  The critical radius of a stable cavity in liquid lithium was found for the first time.  In the second method, the cavities were created in the highly stretched region of the liquid phase diagram; and then the stability boundary and the cavitation rates were calculated in liquid lead.  The pressure dependencies of cavitation frequencies were obtained over the temperature range 700-2700°K in liquid Pb.  The results of MD calculations for cavitation rate were compared with estimates of classical nucleation theory (CNT).
   
D. Sulakhe, A. Rodriguez, M. Wilde, I. Foster, and N. Maltsev, "Interoperability of GADU in using Heterogeneous Grid Resources for Bioinformatics Applications," IEEE Trans. on Information Technology in Biomedicine (to appear).  Also preprint ANL/MCS-P1403-0307, March 2007.  During the past decade, the scientific community has witnessed the rapid accumulation of gene sequence data and data related to physiology and biochemistry of organisms.  Bioinformatics tools used for efficient and computationally intensive analysis of genetic sequences require large-scale computational resources to accommodate the growing data.  Grid computational resources such as the Open Science Grid and TeraGrid have proved useful for scientific discovery.  GADU is a high-throughput computational system developed to automate the steps involved in accessing the Grid resources for running bioinformatics applications.  This paper describes the requirements for building an automated scalable system such as GADU that can run jobs on different Grids.  The paper describes the resource-independent configuration of GADU using the Pegasus-based Virtual Data System that makes high-throughput computational tools interoperable on heterogeneous Grid resources.  The paper also highlights the features implemented to make GADU a gateway to computationally intensive bioinformatics applications on the Grid.  The paper will not go into the details of problems involved or the lessons learned in using individual Grid resources as it has already been published in our paper on GNARE and will focus primarily on the architecture that makes GADU resource independent and interoperable across heterogeneous Grid resources.
   
J. N. Lyness and S. Joe, "Determination of the Rank of an Integration Lattice," Preprint ANL/MCS-P1404-0307, March 2007. The continuing and widespread use of lattice rules for high-dimensional numerical quadrature is driving the development of a rich and detailed theory.  Part of this theory is devoted to computer searches for rules, appropriate to particular situations.  In some applications, one is interested in obtaining the (lattice) rank of a lattice rule Q(Λ) directly from the elements of a generator matrix B (possibly in upper triangular lattice form) of the corresponding dual lattice Λ
   
P. Fischer, J. Lottes, A. Siegel, and G. Palmiotti, "Large Eddy Simulation of Wire-Wrapped Fuel Pins I: Hydrodynamics of a Single Pin," Preprint ANL/MCS-P1405-0307, March 2007. We present large-eddy simulations of flow and heat transfer in a wire-wrapped fuel assembly at subchannel Reynolds numbers of Reh=4684-29184.  The domain consists of a single pin in a hexagonally periodic array, corresponding to two interior subchannels.  Periodic boundary conditions are also used in the axial direction over a single wire-wrap period.
   
J. P. Allain, M. Nieto, M. Hendricks, S. S. Harilal, and A. Hassanein, "Debris and Radiation-Induced Damage Effects on EUV Nanolithography Source Collector Mirror Optics Performance," Preprint ANL/MCS-P1406-0407, April 2007.

Exposure of collector mirrors facing the hot, dense pinch plasma in plasma-based EUV light sources to debris (fast ions, neutrals, off-band radiation, droplets) remains one of the highest critical issues of source component lifetime and commercial feasibility of nanolithography at 13.5-nm.  Typical radiators used at 13.5-nm include Xe and Sn.  Fast particles emerging from the pinch region of the lamp are known to induce serious damage to nearby collector mirrors.  Candidate collector configurations include either multi-layer mirrors (MLM) or single-layer mirrors (SLM) used at grazing incidence.

 

Studies at Argonne have focused on understanding the underlying mechanisms that hinder collector mirror performance at 13.5-nm under fast Sn or Xe exposure.  This is possible by a new state-of-the-art in-situ EUV reflectometry system that measures real time relative EUV reflectivity (15-degree incidence and 13.5-nm) variation during fast particle exposure.  Intense EUV light and off-band radiation is also known to contribute to mirror damage.  For example off-band radiation can couple to the mirror and induce heating affecting the mirror’s surface properties.  In addition, intense EUV light can partially photo-ionize background gas (e.g., Ar or He) used for mitigation in the source device.  This can lead to local weakly ionized plasma creating a sheath and accelerating charged gas particles to the mirror surface and inducing sputtering.

 

In this paper we study several aspects of debris and radiation-induced damage to candidate EUVL source collector optics materials.  The first study concerns the use of IMD simulations to study the effect of surface roughness on EUV reflectivity.  The second studies the effect of fast particles on MLM reflectivity at 13.5-nm.  And lastly the third studies the effect of multiple energetic sources with thermal Sn on 13.5-nm reflectivity.   These studies focus on conditions that simulate the EUVL source environment in a controlled way.

   
B. Clifford, I. Foster, J.-S. Voeckler, M. Wilde, and Y. Zhao, "Tracking Provenance in a Virtual Data Grid," Concurrency and Computation: Practice and Experience (to appear).  Also Preprint ANL/MCS-P1407-0407, April 2007. The virtual data model allows data sets to be described prior to, and separately from, their physical materialization.  We have implemented this model in a Virtual Data Language (VDL) and associated supporting tools, which provide for both the storage, query, and retrieval of virtual data set descriptions, and the automated, on-demand materialization of virtual data sets.  We use a standardized data provenance challenge exercise to illustrate the powerful queries that can be performed on the data maintained by these tools, which for a single virtual data set can include three elements: the computational procedure(s) that must be executed to materialize the data set, the runtime log(s) produced by the execution of the computation(s), and optional metadata annotation(s) that associate application semantics with data and procedures.
   
S. S. Varghese, S. H. Frankel, and P. F. Fischer, "Modeling Transition to Turbulence in Eccentric Stenotic Flows," Preprint ANL/MCS-P1408-0407, April 2007. Mean flow predictions obtained from a host of turbulence models were found to be in poor agreement with recent direct numerical simulation results for turbulent flow distal to an idealized eccentric stenosis.  Many of the widely used turbulence models, including a large eddy simulation model, were unable to accurately capture the post-stenotic transition to turbulence.  The results suggest that efforts towards developing more accurate turbulence models for low Reynolds number, separated transitional flows are necessary before such models can be used confidently under hemodynamic conditions where turbulence may develop.
   
Z. Insepov, A. Hassanein, J. Norem, and D. R. Swenson, "Advanced Surface Polishing using Gas Cluster Ion Beams," Nuclear Instruments and Methods in Physics, B (to appear).  Also Preprint ANL/MCS-P1409-0407, April 2007. The gas cluster ion beam (GCIB) treatment can be an important treatment for mitigation of the Q-slope in superconducting cavities.  The existing surface smoothening methods were analyzed and a new surface polishing method was proposed based on employing extra-large gas cluster ions (X-GCIB).
   
F. Loth, P. F. Fischer, and H. S. Bassiouny, "Blood Flow in End-to-Side Anastomoses," Preprint ANL/MCS-P1410-0407, April 2007. Blood flow in end-to-side autogenous or prosthetic graft anastomoses is of great interest to biomedical researchers because the biomechanical force profile engendered by blood flow disturbances at such geometric transitions is thought to play a significant role in vascular remodeling and graft failure.  Thus, investigators have extensively studied anastomotic blood flow patterns in relationship to graft failure with the objective of enabling design of a more optimal graft anastomotic geometry.  In contrast to arterial bifurcations, surgically created anastomoses can be modified to yield a flow environment that improves graft longevity.  Understanding blood flow patterns at anastomotic junctions is a challenging problem because of the highly varying and complex three-dimensional nature of the geometry that is subjected to pulsatile and, occasionally, turbulent flow.
   
J. Shin, "Introducing Control Flow into Vectorized Code," Preprint ANL/MCS-P1411-0407, April 2007. Single instruction multiple data (SIMD) functional units are ubiquitous in modern microprocessors.  Effective use of these SIMD functional units is essential in achieving the highest possible performance.  Automatic generation of SIMD instructions in the presence of control flow is challenging, however, not only because SIMD code is hard to generate in the presence of arbitrarily complex control flow, but also because the SIMD code executing the instructions in all control paths may slow compared to the scalar original, which may bypass a large portion of the code.  One promising technique introduced recently involves inserting branches-on-superword-condition-codes (BOSCCs) to bypass vector instructions.  In this paper, we describe two techniques that improve on the previous approach.  First, BOSCCs are generated in a nested fashion so that even BOSCCs themselves can be bypassed by other BOSCCs.  Second, we generate all vec_any_* instructions to bypass even some predicate-defining instructions.  We implemented these techniques in a vectorizing compiler.  On 14 kernels, the compiler achieves distinct speedups, including 1.99X over the previous technique that generates single-level BOSCCs and vec_any_ne only.
   
J. W. Larson and B. Norris, "Component Specification for Parallel Coupling Infrastructure," Preprint ANL/MCS-P1412-0407, April 2007. Coupled systems comprise multiple interacting subsystems and are an increasingly common computational science application, most notably as multiscale and multiphysics models.  Parallel computing and, in particular, message-passing programming have enabled the development of these models but also present a parallel coupling problem (PCP) in the form of intermodel data dependencies.  Component-based software engineering has been proposed as one means of conquering software complexity in scientific applications; and given the compound nature of coupled models, it is a natural approach to addressing the PCP.  We define a software component specification for solving the PCP, abstracting the elements of the PCP and mapping them onto a set of components from the Common Component Architecture.  We discuss a reference implementation based on the Model Coupling Toolkit.  We demonstrate how these components might be deployed to solve coupling problems in climate modeling.
   
J. W. Larson, "Some Organising Principles for Coupling in Multiphysics and Multiscale Models," Preprint ANL/MCS-P1414-0207, February 2007. Computational science faces new challenges posed by multiphysics and multiscale, or more generally put, coupled  models.  These systems are composites formed from separate subsystem models that interact via data exchanges.  These data dependencies pose a coupling problem, and on distributed-memory computers, a parallel coupling problem.  This paper presents a definition of terms and a set of organising principles for the coupling and parallel coupling problems.  It is meant as a first step towards creating a theory of coupled models.  These principles are then employed in a case study of a coupled climate model and offer remarkable insight into its structure.
   
M. Tobis, M. Steder, R. L. Jacob, R. T. Pierrehumbert, J. W. Larson, and E. T. Ong, "PyMCT and PyCPL: Refactoring the Community Climate System Model," Preprint ANL/MCS-P1415-0207, February 2007. Coupled climate models are multiphysics models comprising multiple separately developed codes that are combined into a single physical system.  This composition of codes is amenable to a scripting solution, and Python is a language that offers many desirable properties for this task.  We have prototyped a version of the Community Climate System (CCSM) with coupling infrastructure written in Python.  Our objective was to improve dramatically CCSM's already flexible coupling infrastructure to enable research uses of the model not currently supported.  Here we report the progress in the first steps in this effort: the construction of Python bindings for he Model Coupling Toolkit, a key piece of third-party coupling middleware used in CCSM, and a Python-based CCSM coupler application.  We find that the choice of Python over the original Fortran implementation in the coupler imposes minimal visible performance impact to the overall coupled system.  We believe our results augur well for the use of Python in the top-level coupling and organization of large parallel multiphysics and multiscale applications.
   
Y. Zhao, M. Hategan, B. Clifford, I. Foster, G. von Laszewski, V. Nefedova, I. Raicu, T. Stef-Praun, and M. Wilde, "Swift: Fast, Reliable, Loosely Coupled Parallel Computation," Preprint ANL/MCS-P1416-0507, May 2007. We present Swift, a system that combines a novel scripting language called SwiftScript with a powerful runtime system based on CoG Karajan, Falkon, and Glubus to allow for the concise specification, and reliable and efficient execution, of largely loosely coupled computations.  Swift adopts and adapts ideas first explored in the GriPhyN virtual data system, improving on that system in many regards.  We describe the SwiftScript language and its use of XDTM to describe the logical structure of complex file system structures.  We also present the Swift runtime system and its use of CoG Karajan, Falkon, and Globus services to dispatch and manage the execution of many tasks in parallel and Grid environments.  We describe application experiences and performance experiments that quantify the cost of Swift operations.
   
R. Latham, W. Gropp, R. Ross, and R. Thakur, "Extending the MPI-2 Generalized Request Interface," Preprint ANL/MCS-P1417-0507, May 2007. The MPI-2 standard added a new feature to MPI called generalized requests.  Generalized requests allow users to add new nonblocking operations to MPI while still making use of many pieces of MPI infrastructure such as request objects and the progress notification routines (MPI_Test, MPI_Wait).  The generalized request design as it stands, however, has deficiencies regarding typical use cases.  This is particularly true in environments that do not support threads or signals, such as some of the leading petascale systems (IBM BG/L and BG/P, Cray XT-3 and XT-4).  This paper examines those shortcomings, proposes extensions to the interface to overcome them, and presents implementation results.
   
R. Thakur and W. Gropp, "Test Suite for Evaluating Performance of MPI Implementations That Support MPI_THREAD_MULTIPLE," Preprint ANL/MCS-P1418-0507, May 2007. MPI implementations that support the highest level of thread safety for user programs, MPI_THREAD_MULTIPLE, are becoming widely available.  Users often expect that different threads can execute independently and that the MPI implementation can provide the necessary level of thread safety with only a small overhead.  The MPI Standard, however, requires only that no MPI call in one thread block MPI calls in other threads; it makes no performance guarantees.  Therefore, some way of measuring an implementation's performance is needed.  In this paper, we propose a number of performance tests that are motivated by typical application scenarios.  These tests cover the overhead of providing the MPI_THREAD_MULTIPLE level of thread safety for user programs, the amount of concurrency in different threads making MPI calls, the ability to overlap communication with computation, and other features.  We present performance results with this test suite on several platforms (Linux cluster, Sun, and IBM SMPs) and MPI implementations (MPICH2, Open MPI, IBM, and Sun).
   
W. D. Gropp and R. Thakur, "Revealing the Performance of MPI RMA Implementations," Preprint ANL/MCS-P1419-0507, May 2007. The MPI remote-memory access (RMA) operations provide a different programming model from the regular MPI-1 point-to-point operations.  This model is particularly appropriate for cases where there are multiple communication events for each synchronization and where the target memory locations are known by the source processes.  In this paper, we describe a benchmark designed to illustrate the performance of RMA with multiple RMA operations for each synchronization, as compared with point-to-point communication.  We measured the performance of this benchmark on several platforms (SGI Altix, Sun Fire, IBM SMP, Linux cluster) and MPI implementations (SGI, Sun, IBM, MPICH2, Open MPI).  We also investigated the effectiveness of the various optimization options specified by the MPI standard.  Our results show that MPI RMA can provide substantially higher performance than point-to-point communication on some platforms, such as SGI Altix and Sun Fire.  The results also show that many opportunities still exist for performance improvements in the implementation of MPI RMA.
   
J. P. Allain, M. Nieto, M. R. Hendricks, P. Plotkin, S. S. Harilal, A. Hassanein, "IMPACT: A Facility for Studying the Interaction of Low-Energy Intense Charged Particle Beams with Dynamic Heterogeneous Surfaces," Preprint ANL/MCS-P1420-0507, May 2007.

The Interaction of Materials with Particles and Components Testing (IMPACT) experimental facility is furnished with multiple ion sources and in situ diagnostics to study the modification of surfaces undergoing physical, chemical, and electronic changes during exposure to particle beams. Ion beams with energies in the range of 20 to 5000 eV can bombard samples at flux levels in the range of 1010 to 1015 cm‑2 s-1; parameters such as ion angle of incidence and exposed area are also controllable during the experiment. IMPACT has diagnostics that allow full characterization of the beam, including a Faraday cup, a beam imaging system, and a retarding field energy analyzer. IMPACT is equipped with multiple diagnostics, such as electron (Auger, photoelectron) and ion scattering spectroscopies, that allow different probing depths of the sample to monitor compositional changes in multicomponent or layered targets. A unique real-time erosion diagnostic based on a dual quartz crystal microbalance measures deposition rates smaller that 0.01 nm/s, which can be converted to sputter yields given a particular crystal position and sputtered angular distribution. The monitoring crystal can be rotated and placed in the target position in order to probe the quartz crystal oscillator surface without having to transfer it outside the chamber.

   
P. Balaji, S. Bhagvat, D. K. Panda, R. Thakur, and W. Gropp, "Advanced Flow-control Mechanisms for the Sockets Direct Protocol over InfiniBand," Preprint ANL/MCS-P1422-0507, May 2007. The Sockets Direct Protocol (SDP) is an industry standard to allow existing TCP/IP sockets based applications to be executed on high-speed networks such as InfiniBand (IB).  Like many other high-speed networks, IB requires the receiver process to inform the network interface card (NIC), before the data arrives, about buffers in which incoming data has to be placed.  To ensure that the receiver process is ready to receive data, the sender process typically performs flow-control on the data transmission.  Existing designs of SDP flow-control are naive and do not take advantage of several interesting features provided by IB.  specifically, features such as RDMA are only used for performing zero-copy communication, although RDMA has more capabilities such as sender-side buffer management (where a sender process can manage SDP resources for the sender as well as the receiver).  Similarly, IB also provides hardware flow-control capabilities that have not been studied in previous literature.  In this paper, we utilize these capabilities to improve the SDP flow-control over IB using two designs: RDMA-based flow-control and NIC-assisted RDMA-based flow-control.  We evaluate the designs using micro-benchmarks and real applications.  Our evaluations reveal that these designs can improve the resource usage of SDP and consequently its performance by an order-of-magnitude in some cases.  Moreover, we can achieve 10-20% improvement for various applications.
   
Z. Insepov, "Surface Erosion and Modification by Energetic Ions," Preprint ANL/MCS-P1423-0507, May 2007. Interactions of Gas Cluster Ion Beams (GCIB) and Highly-Charged Ions (HCI) with solid surfaces have fundamental and practical interests in such areas as nuclear fuels, TeV accelerators, and extreme ultra-violet lithography (EUVL) source devices, HCI driven SIMS for surface analysis, protein desorption by HCI impacts.  Mitigation of high voltage rf breakdowns and Q-slope is a major concern in development of higher-field RF cavities for next generation accelerators.  Surface treatment of GCIB method has recently been proposed as a new way to significantly reduce the surface roughness and the dark current from the rf-cavity surfaces and enabling operation at higher acceleration gradients.
   
R. Thakur and W. Gropp, "Open Issues in MPI Implementation," Preprint ANL/MCS-P1426-0607, June 2007. MPI (the Message Passing Interface) continues to be the dominant programming model for parallel machines of all sizes, from small Linux clusters to the largest parallel supercomputers such as IBM Blue Gene/L and Cray XT3.  Although the MPI standard was released more than 10 years ago and a number of implementations of MPI are available from both vendors and research groups, there are many areas in which MPI implementations still need improvement.  In this paper, we discuss several such areas, including performance, scalability, fault tolerance, support for debugging and verification, topology awareness, collective communication, derived datatypes, and parallel I/O.  We also present results from some experiments with several MPI implementations (MPICH2, Open MPI, Sun, IBM) on a number of platforms (Linux clusters, Sun, and IBM SMPs) that demonstrate the need for performance improvement in one-sided communication and support for multithreaded programs.
   
L. C. McInnes, T. Dahlgren, J. Nieplocha, D. Bernholdt, B. Allan, R. Armstrong, D. Chavarria, W. Elwasif, I. Gorton, J. Kenny, M. Krishan, A. Malony, B. Norris, J. Ray, and S. Shende, "Research Initiatives for Plug-and-Play Scientific Computing," Preprint ANL/MCS-P1428-0607, June 2007. This paper introduces three component technology initiatives that focus on reducing the software development challenges faced by today's computational scientists.  Rapid advances and increasing diversity in high-performance hardware platforms continue to spur the growing complexity of scientific simulations.  The resulting environment presents ever-increasing productivity challenges associated with creating, managing, and applying simulation software to scientific discovery.  As key facets within the SciDAC Center for Technology for Advanced Scientific Compo9nent Software (TASCS), these initiatives leverage the component standard for scientific computing under development by the Common Component Architecture (CCA) Forum.  Component technology, which is now widely used in mainstream computing but has only recently begun to make inroads in high-performance computing (HPC), extends the benefits of object-oriented design by providing coding methodologies and supporting infrastructure to improve software's extensibility, maintainability, and reliability.  All three initiatives are based on the premise that, in addition to aiding software development, the component environment can facilitate the deployment of new computational capabilities to benefit the entire lifecycle of scientific simulation software.
   
I. Konkashbaev, P. Fischer, A. Hassanein, and N. V. Mokhov, "Enhancement of Heat Removal Using Concave Liquid Metal Targets for High-Power Accelerators," Preprint ANL/MCS-P1429-0607, June 2007. The need is increasing for development of high-power targets and beam dump areas for the production of intense beams of secondary particles.  The severe constraints arising from a megawatt beam deposited on targets and absorbers call for nontrivial procedures to dilute the beam.  This study describes the development of targets and absorbers and the advantages of using flowing liquid metal in concave channels first proposed by IFMIF to raise the liquid metal boiling point by increasing the pressure in liquid supported by a centrifugal force.  Such flow with a back-wall is subject to Taylor-Couette instability.  The instability can play a positive role of increasing the heat transfer from the hottest region in the target/absorber to the back-wall cooled by water.  Results of theoretical analysis and numerical modeling of both targets and dump areas for the IFMIF, ILC, and RIA facilities are presented.
   
W. Elwasif, B. Norris, B. Allan, and R. Armstrong, "Bocca: A Development Environment for HPC Components," Preprint ANL/MCS-P1430-0607, June 2007. In high-performance scientific software development, the emphasis is often on short time to first solution.  Even when the development of new components mostly reuses existing components or libraries and only small amounts of new code must be created, dealing with component glue code to obtain complete applications is still tedious and error-prone.  Component-based software meant to reduce complexity at the application level increases complexity with the attendant glue code.  To address these needs, we introduce Bocca, the first tool to enable application developers to perform rapid component prototyping while maintaining robust software-engineering practices suitable to HPC environments.  Bocca provides project management and a comprehensive build environment for creating and managing applications composed of Common Component Architecture components.  Of critical importance for HPC applications, Bocca is designed to operate in a language-agnostic way, simultaneously handling components written in any of the common HPC workstation languages:  C, C++, Fortran, Fortran77, Python, and Java.  Bocca automates the tasks related to the component glue code, freeing the user to focus on the scientific aspects of the application.  Bocca embraces the philosophy pioneered by Ruby Rails for Web applications: Start with something that works and evolve it to the user's purpose.
   
G. Narayanaswamy, P. Balaji, and W. Feng, "An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multicore Environments," Preprint ANL/MCS-P1432-0607, June 2007. This paper analyzes the interactions between the protocol stack (TCP/IP or iWARP over 10-Gigabit Ethernet) and its multicore environment.  Specifically, for host-based protocols such as TCP/IP, we notice that a significant amount of processing is statically assigned to a single core, resulting in an imbalance of load on the different cores of the system and adversely impacting the performance of many applications.  For host-offloaded protocols such as iWARP, on the other hand, the portions of the communication stack that are performed on the host, such as buffering of messages and memory copies, are closely tied with the associated process and hence do not create such load imbalances.  Thus, in this paper, we demonstrate that by intelligently mapping different processes of an application to specific cores, the imbalance created by the TCP/IP protocol stack can be largely countered and application performance significantly improved.  At the same time, since the load is a better balanced in host-offloaded protocols such as iWARP, such mapping does not adversely affect performance, thus keeping the mapping generic enough to be used with multiple protocol stacks.
   
S. Pervez, G. Gopalakrishnan, R. M. Kirby, R. Thakur, and W. Gropp, "Formal Methods Applied to HPC Software Design: A Case Study of Locking Based on MPI One-Sided Communication," Preprint ANL/MCS-P1433-0607, June 2007. There is growing need to address the complexity of modeling and verifying the numerous concurrent protocols involved in high-performance computing (HPC) software design.  Finite-state modeling and formal verification (FV) using model-checking technology have made impressive inroads into debugging concurrent protocols.  However, there remains a dearth of research in applying model-checking methods to HPC software design.  This situation can be attributed to the lack of awareness by the HPC community about the potentials of FV in the HPC arena, as well as lack of awareness by the FV community of the challenges that HPC application domains pose.  In this paper, we demonstrate the utility of finite-state modeling and model checking in detecting race conditions and deadlocks in concurrent protocols that arise while developing HPC software.  In particular, we detail a case study that develops a distributed byte-range locking algorithm using MPI's one-sided communication.  Our model-checking effort detected a race condition that can cause deadlock, of which the authors of the algorithm were unaware.  We describe two designs to fix the deadlock problem, and we present their formal analysis using model checking, and their performance analysis on a 128-node cluster.  Our objective is to give practitioners, especially those developing parallel programs using libraries, the opportunity to accurately assess the costs and benefits of finite-state modeling and model checking.
   
K. Abhishek, S. Leyffer, and J. T. Linderoth, "Modeling without Categorical Variables: A Mixed-Integer Nonlinear Program for the Optimization of Thermal Insulation Systems," Preprint ANL/MCS-P1434-0607, June 2007. Optimal design applications are often modeled by using categorical variables to express discrete design decisions, such as material types.  A disadvantage of using categorical variables is the lack of continuous relaxations, which precludes the use of modern integer programming techniques.  We show how to express categorical variables with standard integer modeling techniques, and we illustrate this approach on a load-bearing thermal insulation system.  The system consists of a number of insulators of different materials and intercepts that minimize the heat flow from a hot surface to a cold surface.  Our new model allows us to employ black-box modeling languages and solvers and illustrates the interplay between integer and nonlinear modeling techniques.  We present numerical experience that illustrates the advantage of the standard integer model.
   
M.Min, P.F. Fischer, and Y.-C. Chae, "Spectral Element Discontinuous Galerkin Simulations for Wake Potential Calculations: NEKCEM," Preprint ANL/MCS-P1435-0607, June 2007. In this paper we present high-order spectral element discontinuous Galerkin simulations for wake field and wake potential calculations.  Numerical discretizations are based on body-conforming hexagonal meshes on Gauss-Lobatto-Legendre grids.  We demonstrate wake potential profiles for cylindrically symmetric cavity structures in 3D including the cases for linear and quadratic transitions between two cross sections.  Wake potential calculations are carried out on 2D surfaces for various bunch sizes.
   
M. Min, Y.-H. Chin, P. F. Fischer, Y.-C. Chae, and K.-J. Kim, "Fourier Spectral Simulations for Wake Fields in Conducting Cavities," Preprint ANL/MCS-P1436-0607, June 2007. We investigate the Fourier spectral time-domain simulations applied to the wake field calculations in two-dimensional cylindrical structures.  The scheme involves second-order explicit leap-frogging in time and the Fourier spectral approximation in space, which is obtained from simply replacing the spatial differentiation operator of the YEE scheme by the Fourier differentiation operator on non-staggered grids.  This is a first step towards investigating high-order computational techniques with Fourier spectral method which is relatively simple to implement and enhancing its performance in comparison to the conventional lower-order method.
   
J. J. Moré, T. S. Munson, and J. Sarich, "Optimization in SciDAC Applications," Preprint ANL/MCS-P1437-0707, July 2007. We present a brief overview of optimization tools that are being developed for SciDAC applications.  We emphasize derivative-free and gradient-based methods since these tools make minimal demands on the user and the application.  We discuss the performance of these tools and point out developments that have led to significant improvements in performance.  A parameter estimation problem that arises in nuclear fission is used to illustrate the challenges that arise as we attack nonlinear, noisy, computationally-intensive optimization applications.
   
T. Peterka, R. L. Kooima, D. J. Sandin, A. Johnson, J. Leigh, T. A. DeFanti, "Advances in the Dynallax Solid-State Dynamic Parallax Barrier Autostereoscopic Visualization Display System," Preprint ANL/MCS-P1438-0707, July 2007. A solid-state dynamic parallax barrier autostereoscopic display mitigates some of the restrictions present in static barrier systems, such as fixed view-distance range, slow response to head movements, and fixed stereo operating mode.  By dynamically varying barrier parameters in real time, viewers may move closer to the display and move faster laterally than with a static barrier system, and the display can switch between 3D and 2D modes by disabling the barrier on a per-pixel basis.  Moreover, Dynallax can output four independent eye channels when two viewers are present, and both head-tracked viewers receive an independent pair of left-eye and right-eye perspective views based on their position in 3D space.  The display device is constructed by using a dual-stacked LCD monitor where a dynamic barrier is rendered on the front display and a modulated virtual environment composed of two or four channels is rendered on the rear display.  Dynallax was recently demonstrated in a small-scale head-tracked prototype system.  This paper summarizes the concepts presented earlier, extends the discussion of various topics, and presents recent improvements to the system.
   
J. L. Träff, W. Gropp, and R. Thakur, "Self-Consistent MPI Performance Requirements," Preprint ANL/MCS-P1439-0707, July 2007. The MPI Standard does not make any performance guarantees, but users expect (and like) MPI implementations to deliver good performance.  A common-sense expectation of performance is that an MPI function should perform no worse than a combination of other MPI functions that can implement the same functionality.  In this paper, we formulate some performance requirements and conditions that good MPI implementations can be expected to fulfill by relating aspects of the MPI standard to each other.  Such a performance formulation could be used by benchmarks and tools, such as SKaMPI and Perfbase. to automatically verify whether a given MPI implementation fulfills basic performance requirements.  We present examples where some of these requirements are not satisfied, demonstrating that there remains room for improvement in MPI implementations.
   
S. Pervez, G. Gopalakrishnan, R. M. Kirby, R. Palmer, R. Thakur, and W. Gropp, "Practical Model-Checking Method for Verifying Correctness of MPI Programs," Preprint ANL/MCS-P1440-0707, July 2007. Formal program verification often requires creating a model of the program and running it through a model-checking tool.  However, this model-creation step is itself error prone, tedious, and difficult for someone not familiar with formal verification.  In this paper, we describe a tool for verifying correctness of MPI programs that does not require the creation of a model and instead works directly on the MPI program.  Our tool uses the MPI profiling interface, PMPI, to trap MPI calls and hand over control to the MPI function execution to a scheduler.  The scheduler verifies correctness of the program by executing all "relevant" interleavings of the program.  The scheduler records an initial trace and replays its interleaving variants by using dynamic partial-order reduction.  We describe the design and implementation of the tool and compare it with our previous work based on model checking.
   
J. Bresnahan, R. Kettimuthu, M. Link, I. Foster, "Harnessing Multicore Processors for High-Speed Secure Transfer," Preprint ANL/MCS-P1442-0707, July 2007. A growing need for ultra-high-speed data transfers has motivated continued improvements in the transmission speeds of the physical network layer.  As researchers develop protocols and software to operate over such networks, they often fail to account for security.  The processing power required to encrypt or sign packets of data can significantly decrease transfer rates, and thus security is often sacrificed for throughput.  Emerging multicore processors provide a higher ratio of CPUs to network interfaces and can, in principle, be used to accelerate encrypted transfers by applying multiple processing and network resources to a single transfer.  We discuss the attributes that network protocols and software must have to exploit such systems.  In particular, we study how these attributes may be applied in the GridFTP code distributed with the Globus Toolkit.  GridFTP is a well-accepted and robust protocol for high-speed data transfer.  It has been shown to scale to near-network speeds.  While GridFTP can provide encrypted and protected data transfers, it historically suffers transfer performance penalties when these features are enabled.  We present configurations to the Globus GridFTP server that can achieve fully encrypted high-speed data transfers.
   
K. Keahey, T. Freeman, J. Lauret, and D. Olson, "Virtual Workspaces for Scientific Applications," Preprint ANL/MCS-P1443-0707, July 2007. One of the primary obstacles users face in Grid computing is that Grids are typically composed of many diverse resources, while applications require a very specific, customized environment to run in.  Many applications are dependency-rich and complex, making it hard to run them on anything but a dedicated platform.  Worse, even if the applications do run there, the results they produce may not be consistent across different runs.  As part of the Center for Enabling Distributed Petascale Science (CEDPS) project we have been developing the Workspace service which allows authorized Grid clients to dynamically provision environments in the Grid.  Virtual machines provide an excellent implementation of a portable environment as they allow users to configure an environment and then deploy it on a variety of platforms.  This paper describes a proof-of-concept of this strategy developed for the High-Energy Physics STAR application.  We are currently building on this work to enable production STAR runs in virtual machines.
   
L. Yang, C. Liu, J. M. Schopf, and I. Foster, "Anomaly Dtection and Diagnosis in Grid Environments," Preprint ANL/MCS-P1444-0707, July 2007. Identifying and diagnosing anomalies in application behavior is critical to delivering reliable application-level performance.  In this paper we introduce a strategy to detect anomalies and diagnose the possible reasons behind them.  Our approach extends the traditional window-based strategy by using signal-processing techniques to filter out recurring, background fluctuations in resource behavior.  In addition, we have developed a diagnosis technique that uses standard monitoring data to determine where related changes in behavior occur at the times of the anomalies.  We evaluate our anomaly detection and diagnosis technique by applying it in three contexts and inserting anomalies into the system at random intervals.  The experimental results show that our strategy detects up to 96% of anomalies while reducing the fault positive rate by up to 90% compared to the traditional window average strategy.  In addition, our strategy can diagnose the reason for the anomaly approximately 75% of the time.
   
A. Baranovski, S. Bharathi, J. Bresnahan, A. Chervenak, I. Foster, D. Fraser, T. Freeman, D. Gunter, K. Jackson, K. Keahey, C. Kesselman, D. E. Konerding, N. Leroy, M. Link, M. Livny, N. Miller, R. Miller, G. Oleynik, L. Pearlman, J. M. Schopf, R. Schuler, and B. Tierney, "Enabling Distributed Petascale Science," Preprint ANL/MCS-P1445-0707, July 2007. Petascale science is an end-to-end endeavor, involving not only the creation of massive datasets at supercomputers or experimental facilities, but the subsequent analysis of that data by a user community that may be distributed across many laboratories and universities.  The new SciDAC Center for Enabling Distributed Petascale Science (CEDPS) is developing tools to support this end-to-end process.  These tools include data placement services for the reliable, high-performance, secure, and policy-driven placement of data within a distributed science environment; tools and techniques for the construction, operation, and provisioning of scalable science services; and tools for the detection and diagnosis of failures in end-to-end data placement and distributed application hosting configurations.  In each area, we build on a strong base of existing technology and have made useful progress in the first year of the project.  For example, we have recently achieved order-of-magnitude improvements in transfer times (for lots of small files) and implemented asynchronous data staging capabilities; demonstrated dynamic deployment of complex application stacks for the STAR experiment; and designed and deployed end-to-end troubleshooting services.  We look forward to working with SciDAC application and technology projects to realize the promise of petascale science.  More details can be found at www.cedps.net.
   
N. Desai, E. Lusk, A. Cherry, and T. Voran, "The Computer as Software Component: A Mechanism for Developing and Testing Resource Management Software," Preprint ANL/MCS-P1447-0707, July 2007. Ultrascale system software is difficult to develop, debug, and test.  In this paper, we present an architecture that encapsulates system hardware inside a component architecture used for execution and simulation.  This approach yields a number of novel benefits, including dramatically improved debug and testing capabilities.
   
P. Balaji, W. Feng, S. Bhagvat, D. K. Panda, R. Thakur, and W. Gropp, "Analyzing the Impact of Supporting Out-of-Order Communication on In-order Performance with iWARP," Preprint ANL/MCS-P1448-0707, July 2007. Due to the growing need to tolerate network faults and congestion in high-end computing systems, supporting multiple network co9mmunication paths is becoming increasingly important.  However, multi-path communication comes with the disadvantage of having to deal with the out-of-order arrival of packets (because packets may traverse different paths).  While modern networking stacks such as the Internet Wide-Area RDMA Protocol (iWARP) over 10-Gigabit Ethernet (10GE) support multi-path communication, they do not handle out-of-order packets primarily owing to the overhead on in-order communication that it adds.  Specifically, in iWARP, supporting out-of-order packets requires every packet to carry additional information causing significant overhead on packets that arrive in-order.  Thus, in this paper, we analyze the trade-offs in designing a feature-complete iWARP stack, i.e., one that provides support for out-of-order arriving packets, and thus, multi-path systems, while focusing on the performance of in-order communication.  We propose three feature-complete designs of iWARP and analyze the pros and cons of each of these designs using performance experiments based on several micro-benchmarks as well as an iso-surface visual rendering application.  Our analysis reveals with the iWARP design providing the best overall performance depends on the particular characteristics of the upper layers and that different designs are optimal based on the metric of interest.
   
D. R. Dechow, B. Norris, J. Amundson, "The Common Component Architecture for Particle Accelerator Simulations," Preprint ANL/MCS-P1449-0807, August 2007. Synergia2 is a beam dynamics modeling and simulation application for high-energy accelerators such as the Tevatron at Fermilab and the International Linear Collider, which is now under planning and development.  Synergia2 is a hybrid, multilanguage software package comprised of two separate accelerator physics packages (Synergia and MaryLie/Impact) and one high-performance computer science package (PETSc).  We describe our approach to producing a set of beam dynamics-specific software components based on the Common Component Architecture specifciation.  Among other topics, we describe particular experiences with the following tasks: using Python steering to guide the creation of interfaces and to prototype components; working with legacy Fortran codes; and an example component-based, bedam dynamics simulation.
   
J. N. Lyness, "Numerical Evaluation of a Fixed-Amplitude Variable-Phase Integral," Preprint ANL/MCS-P1450-0807, June 2007. We treat the evaluation of a fixed-amplitude variable-phase integral of the form ∫a b exp[ikG(x)]dx, where G'(x) ≥ 0 and has moderate differentiability in the integration interval.  In particular, we treat in detail the case in which G'(a) = G'(b) = 0, but Gn(a)Gn(b) < 0.  For this, we re-derive a standard asymptotic expansion in inverse half-integer inverse powers of k.  However, this derivation provides straightforward expressions for the coefficients in terms of derivatives of G at the end points.  Thus, it can be used to evaluate the integrals incases where k is large.  We indicate the generalizations to the theory required to cover cases where the oscillator function G has higher order zeros at either or both end points, but this is not treated in detail.  In the simpler case in which G'(a)G'(b) > 0, this approach recovers a special case of a recent result due to Iserles and Nørsett.
   
S. Park, "Weights and Acceptance Ratios in Generalized Ensemble Simulations," Preprint ANL/MCS-P1451-0807, August 2007. This paper addresses issues related to weights and acceptance ratios in generalized ensemble simulations (GES), while comparing two algorithms of GES: serial (e.g., simulated tempering) and parallel (e.g., parallel tempering or replica exchange).  We derive a cumulant approximation formula for optimal weights in the serial GES and discuss its effectiveness in practical applications.  We compare the acceptance ratios of the serial and parallel GES and prove that provided optimal weights are used, the serial GES has higher acceptance ratios than does the parallel GES.  The duality between forward and reverse transitions is at the heart of the derivations throughout the paper.
   
P. Balaji, W. Feng, J. Archuleta, H. Lin, R. Kettimuthu, R. Thakur, and X. Ma, "ParaMEDIC: Parallel Metadata Environment for Distributed I/O and Computing," Preprint ANL/MCS-P1452-0807, August 2007. BLAST is a widely used software toolkit for genomic sequence search.  mpiBLAST is a freely available, open-source parallelization of BLAST that uses database segmentation to allow different worker processors to search (in parallel) unique segments of the database.  After searching, the workers write their output to a filesystem.  While mpiBLAST has been shown to achieve high performance in clusters with fast local filesystems, its I/O processing remains a concern for scalability, especially in systems having limited I/O capabilities such as those using distributed filesystems spread across a wide-area network.  Thus, we present ParaMEDIC, an environment that decouples computation and I/O in distributed environments for applications such as mpiBLAST and dramatically reduces I/O overhead through metadata processing.  Specifically, for mpiBLAST, ParaMEDIC partitions worker processes into compute and I/O workers.  Compute workers, instead of directly writing output to the distributed filesystem, convert their output to metadata and send it to I/O workers.  I/O workers, which physically reside closer to the actual storage, then process this metadata to re-create the actual output and write it to the filesystem.  This approach allows ParaMEDIC to cut down on the I/O time, thus accelerating mpiBLAST by as much as 25-fold in some cases.
   
K. Iskra, J. W. Romein, K. Yoshii, and P. Beckman, "ZOID: I/O-Forwarding Infrastructure for Petascale Architectures," Preprint ANL/MCS-P1453-0807, August 2007. The ZeptoOS project is developing an open-source alternative to the proprietary software stacks available on contemporary massively parallel architectures.  The aim is to enable computer science research on these architectures, enhance community collaboration, and foster innovation.  In this paper, we introduce a component of ZeptoOS called ZOID---an I/O-forwarding infrastructure for architectures such as IBM Blue Gene that decouple file and socket I/O from the compute nodes, shipping those functions to dedicated I/O nodes.  Through the use of optimized network protocols and data paths, as well as a multithreaded daemon running on I/O nodes, ZOID provides greater performance than does the stock infrastructure.  We present a set of benchmark results that highlight the improvements.  Our infrastructure also offers vastly improved flexibility, allowing users to forward data using custom-designed application interfaces, through an easy-to-use plus-in mechanism.  This capability is used for real-time telescope data transfers, extensively discussed in the paper.  Plus-in-specific threads implement prefetching of data obtained over sockets from an input cluster and merge results from individual compute nodes before sending them out, significantly reducing required network bandwidth.  This approach allows a ZOID version of the application to handle a larger number of subbands per I/O node, or even to bypass the input cluster altogether, plugging the input from remote receiver stations directly into the I/O nodes.  Using the resources more efficiently can result in considerable savings.
   
A. A. Rodriguez, T. Bompada, M. Syed, P. K. Shah, and N. Maltsev, "Evolutionary Analysis of Enzymes Using Chisel," Preprint ANL/MCS-P1454-0807, August 2007. Availability of large volumes of genomic and enzymatic data for taxonomically and phenotypically diverse organisms allows for exploration of the adaptive mechanisms that led to diversification of enzymatic functions.  We present Chisel, a computational framework and a pipeline for an automated, high-resolution analysis of evolutionary variations of enzymes.  Chisel allows automatic as well as interactive identification and characterization of enzymatic sequences.  Such knowledge can be used for comparative genomics, microbial diagnostics, metabolic engineering, drug design, and analysis of metagenomes.  Chisel is a comprehensive resource that contains 8,575 clusters and subsequent computational models specific for 939 distinct enzymatic functions and, when data is sufficient, their taxonomic variations.  Application of Chisel to identification of enzymatic sequences in newly sequenced genomes, analysis of organism-specific metabolic networks, "binning" of metagenomes, and other biological problems are presented.  We also provide a thorough analysis of Chisel performance with similar resources and manual annotations on the Shewanella oneidensis MRI genome.  Chisel is available for interactive use at http://compbio.mcs.anl.gov/CHISEL.  The website also provides a user manual, clusters, and function-specific computational models.  Additional data can be found at http://compbio.mcs.anl.gov/CHISEL/htmls/refs.html.
   
J. Bresnahan, M. Link, G. Khanna, Z. Imani, R. Kettimuthu, and I. Foster, "Globus GridFTP: What's New in 2007," Preprint ANL/MCS-P1458-0907, Sept. 2007. GridFTP is a high-performance, secure, reliable data transfer protocol optimized for high-bandwidth wide-area networks.  It is based on the Internet FTP protocol, and it defines extensions for high-performance operation and security.  The Globus implementation of GridFTP provides a software suite optimized for the gamut of data access issues---from bulk file transfer to the details of getting data out of complex storage systems within sites.  We summarize some recent developments in Globus GridFTP.
   
C. Zhang, M. G. Knepley, D. A. Yuen, and Y. Shi, "Two New Approaches in Solving the Nonlinear Shallow Water Equations for Tsunamis," Preprint ANL/MCS-P1459-0907, September 2007. One key component of tsunami research is numerical simulation of tsunamis, which helps us to better understand the fundamental physics and phenomena and leads to better mitigation decisions.  However, writing the simulation program itself imposes a large burden on the user.  In this survey, we review some of the basic ideas behind the numerical simulation of tsunamis, and introduce two new approaches to construct the simulation using powerful, general-purpose software kits, PETSc and FEPG.  PETSc and FEPG support various discretization methods such as finite-difference, finite-element and finite-volume, and provide a stable solution to the numerical problem.  Our application uses the nonlinear shallow-water equations in Cartesian coordinates as the governing equations of tsunami wave propagation.
   
J. N. Brooks and J. P. Allain, "Particle Deposition and Optical Response of ITER Motional Stark Effect Diagnostic First Mirrors," Preprint ANL/MCS-P1460-1007, October 2007. Particle deposition/erosion can affect mirrors used in plasma diagnostics, and this is a major concern for future fusion reactors.  This subject is analyzed for the first and second mirrors of the proposed Motional Stark Effect edge plasma current diagnostic for ITER.  Particle fluxes to the diagnostic module aperture are given by edge-plasma/impurity-transport solutions for convective plasma flow for full power fusion conditions.  The MC-Mirror code with input of TRIM-SP results is used to compute in-module direct, reflected, and sputtered particle transport.  Particles analyzed are D-TZ and He atoms/ions from the plasma, and Fe, Be, and W from first wall sputtering and/or in-module sputtering.  Many of the results are encouraging for optical diagnostic use in ITER, and possibly for post-ITER high duty factor reactors.  The LLNL-4B module design analyzed works well in minimizing particle flux to the mirrors, with a factor of ~200-400 reduction in aperture-to-first-mirror flux.  Sputtering erosion/degradation of Mo or Rh coated mirrors by incident D, T, and He is negligible.  IMD optical effects code analysis shows probably tolerable changes in light reflection and polarization due to mirror beryllium deposition.  Tungsten flux to the mirrors is very low.  Based on available but limited data, however, there is major concern about the effect of the predicted helium flux on mirror optical properties.