Jump to main content.


Research Project Search
 Enter Search Term:
   
 NCER Advanced Search

Final Report: Integrating Numerical Models and Monitoring Data

EPA Grant Number: R829402C002
Subproject: this is subproject number 002 , established and managed by the Center Director under grant R829402
(EPA does not fund or establish subprojects; EPA awards and manages the overall grant for this center).

Center: Center for Integrating Statistical and Environmental Science
Center Director: Stein, Michael
Title: Integrating Numerical Models and Monitoring Data
Investigators: Stein, Michael , Amit, Yali , Beletsky, Dmitry , Chen, Li , Kotamarthi, V. Rao , Lesht, Barry , Nakamura, Noboru , Schwab, David , Stroud, Jonathan , Zhang, Zepu
Institution: Argonne National Laboratory , National Oceanic and Atmospheric Administration , University of Chicago , University of Michigan , University of Pennsylvania
EPA Project Officer: Smith, Bernice
Project Period: March 12, 2002 through March 11, 2007
RFA: Environmental Statistics Center (2001)
Research Category: Ecological Indicators/Assessment/Restoration , Environmental Statistics

Description:

Objective:

This project addressed statistical and scientific issues arising in the use of monitoring data to evaluate and improve numerical models for the physical environment and to combine monitoring data and numerical models to obtain improved pollution maps and forecasts. A total of eleven subprojects were supported over the course of this project, but in summarizing the project’s accomplishments, it is more helpful to divide the work into three broader categories: statistical modeling and inference, numerical model evaluation, and combining numerical model output and measurements.

The methods we developed were applied to numerous datasets, including air pollutants such as ozone, carbon monoxide and sulfates, total column ozone as measured by ozonesondes and satellites, wind speeds, precipitation and various quantities in Lake Michigan including sediment levels, chlorophyll levels and advections. The main numerical model we used was CMAQ, EPA’s main air quality model, but we also considered numerical models for hydrodynamics and sediment transport. Our work encompassed a broad range of temporal (every ten minutes to annual) and spatial (urban area to global) scales.

This project supported a total of seven graduate students and two postdocs. The graduate students, year of graduation and their present positions are: Leah Welty, 2003, Assistant Professor, Department of Preventive Medicine, Northwestern University; Ethan Anderes, 2005, NSF Mathematical Sciences Postdoctoral Research Fellow, Department of Statistics, University of California, Berkeley (will become Assistant Professor, Department of Statistics, University of California, Davis, fall 2008); Mikyoung Jun, 2005, Assistant Professor, Department of Statistics, Texas A & M University; Hae Kyoung Im, 2005, private industry; Xiaofeng Shao, 2006, Assistant Professor, Department of Statistics, University of Illinois Urbana-Champaign; Chae Young Lim, 2007, Assistant Professor, Department of Statistics, Michigan State University; Darongsae Kwon, degree expected in 2009. The two postdocs were Li Chen, presently a lecturer in the Statistics Group of the Department of Statistics in the University of Bristol, UK and Zepu Zhang, presently an Associate Specialist II in the Department of Civil and Environmental Engineering, University of California, Berkeley.

Our EPA collaborators were Jason Ching, Alice Gilliland, Robin Dennis and Peter Finkelstein.

Summary/Accomplishments (Outputs/Outcomes):

The development of statistical models and methods for spatial-temporal processes was central to much of this work. A large part of this work was directly related to the role of such statistical models in studying numerical models, but another large part was directed at statistical models for spatial-temporal processes more generally. The development of appropriate spatial-temporal statistical models for environmental processes is fundamental to the advancement of environmetrics. Furthermore, we contend it is only by understanding the statistical properties of environmental processes that it will be possible to evaluate properly the ability of numerical models to describe environmental processes accurately.

  1. Statistical modeling and inference
  2. We developed new ways of thinking about space-time covariance functions, new classes of models and their properties. We expect this work to have a broad impact on the future theoretical development of space-time statistical models, the models actually used in environmental applications and the statistical methods employed to assess the adequacy of such models. This work formed a building block for much of the rest of the work at CISES, for which space-time statistical models were frequently needed.

    Some of the important developments in this work include a better understanding of the relationship between space-time covariance functions and properties of the corresponding process, including reversibility in time (sometimes called full symmetry, Gneiting (2002)), Markov properties in time and local behavior in space and time simultaneously (Stein, 2005b). In considering space-time processes on global scales, we have necessarily addressed the statistical modeling of space-time processes on the sphere and developed a number of approaches for obtaining reasonably parsimonious models that allow for spatial dependence to depend on latitude, which is a prominent feature of most atmospheric processes when considered on a global scale (Stein, 2007b,c, Jun and Stein, 2007a,b).

    Another major focus of this work was on the development of modeling strategies, computational methods and diagnostics for monitoring data collected at regular time intervals over a long period of time. Bringing together ideas from multiple time series and spatial statistics were keys to this work. In particular, in Stein (2005c), we developed an approach to modeling monitoring data that is spectral in the time domain and spatial in the space domain and yields flexible and interpretable models that can be applied relatively easily to the analysis of regular monitoring data. We view this approach as an important alternative to elaborate parametric modeling of space-time covariance functions that has been the focus of most recent statistical research (including our own) in this area. A key component of this work is the development of fast and effective spectral and time domain approximations to the likelihood function for regular monitoring data. We developed spectral and time domain diagnostics for assessing the goodness of fit of models for space-time covariance functions and demonstrated their effectiveness at illuminating misfits of models that cannot be seen through standard diagnostics from spatial statistics.

    When observation locations of stationary spatial processes do not fall in some regular pattern, it is standard practice to use parametric forms for the covariance function to guarantee that the resulting estimated covariance function is positive definite, which is a necessary requirement for any valid covariance function. An alternative approach is to model the covariance function in the spectral domain, which is more suited to nonparametric approaches, since the spectral density of a valid covariance function merely needs to be nonnegative and with finite total mass. However, irregularly sited observations do not easily lend themselves to spectral methods and, although there have been past attempts to address this problem, the methods have serious statistical failings. Im, Zhu and Stein (2007) developed a likelihood-based approach to spectral density estimation for isotropic Gaussian spatial processes that avoids parametric specification of the covariance function but permits the statistical efficiency of using the likelihood. Simulations and an application to rainfall data demonstrated the clear superiority of this approach over past approaches to nonparametric estimation of the spectral density.

    A fundamental problem in statistical analysis of spatial and spatial-temporal processes is to take account of uncertainties in models or in model parameters on subsequent predictions. It is standard practice in geostatistics to ignore this source of uncertainty, but with the increase in computational power and the development of Markov chain Monte Carlo methods, it is becoming more common to use Bayesian approaches to prediction, which provide a natural framework for accounting for this source of uncertainty. However, for larger datasets, the computations required for a Bayesian approach can be formidable and, furthermore, one has to take great care when selecting priors for spatial processes (Berger, de Oliveira and Sanso, 2001). Parametric bootstrapping (Putter and Young, 2001) could also be used to account for this uncertainty, but again, requires a potentially heavy computational burden. Our approach is a broad generalization of Satterthwaite approximations and bears some relation to work by David Harville and his collaborators on predictive inference for mixed models (see, e.g., Harville and Carriquiry, 1992). We are using asymptotic expansions and simulation studies to examine these various approximations, each of which require only slightly more computational effort than needed to find maximum likelihood estimates of unknown parameters. This work represents the thesis work of doctoral candidate Darongsae Kwon and it will continue without CISES funding until she completes her dissertation.

    Satellite-based datasets are generally far too large to allow one to do all of the calculations one might wish to for statistical purposes. For example, even a single calculation of the likelihood function (and any Bayesian analysis requires very many such calculations) for a Gaussian process model with ungridded spatial data is generally undoable for more than about 10,000 observations. Cressie and Johanneson (2008) showed how models based on finite series expansions for covariance functions make exact computations possible even for very large spatial datasets. Following on this idea, Stein (2007b) considered models for spatial variation of total column ozone on a global scale using truncated expansions in spherical harmonics and applied these models to TOMS measurements of total column ozone. Stein (2007b) showed these expansions are unable to describe the local behavior of the ozone process accurately, which Stein (2007c) addressed by adding a compactly supported covariance function to the model. This modification yielded a much better fit to the local behavior of the process while still allowing exact calculations, but there was still evidence of model misfit, so further work is necessary to address the considerable challenge of finding a practically usable model that accurately describes both the local and large-scale statistical behavior of total column ozone. Jun and Stein (2007b) described an alternative modeling approach to global data that makes use of differential operators applied to homogeneous processes on the sphere introduced in Jun and Stein (2007a). The models were applied to gridded satellite data, which makes it possible to do exact likelihood calculations without any special structure to the covariance functions.

    Environmental processes frequently exhibit spatial nonstationarity. In an influential paper, Sampson and Guttorp (1992) supposed that a spatial process is stationary and isotropic after a smooth deformation of space and develop an approach to estimating this deformation when one has many replications of a spatial process with the same nonstationary covariance structure. However, when one does not have replicates, their approach is not applicable. Anderes and Stein (2008) described a completely implemented procedure for estimating a deformation of an isotropic Gaussian process based on a single realization of the process observed on a dense grid. A simulation study on processes of various smoothness and a broad range of deformations demonstrated the method’s effectiveness.

    Nonstationarity in time is another frequent component of environmental spatial-temporal processes. A particularly important type of temporal nonstationarity is periodic behavior due to, e.g., diurnal or seasonal cycles. For example, total column ozone shows clear seasonal patterns in its space-time covariances, which spurred us to consider periodically correlated models for space-time processes on the sphere. Stein (2007a) described such models for Level-3 (gridded) TOMS data at a single latitude and found that they captured the seasonal structure of the correlations reasonably well.

    In the theoretical study of problems in spatial statistics, it is often natural to consider fixeddomain asymptotics, in which the number of observations in a fixed and bounded region tends to infinity. We published two papers taking this approach to problems in estimating the properties of spatial processes. Loh and Stein (2008) derived the asymptotic properties of bootstrap estimates of variance for spatial variograms, which contains the first theoretical results we are aware of for bootstrapping under fixed-domain asymptotics. Lim and Stein (2008) obtained a number of fixed-domain asymptotic results on the behavior of spatial periodograms and cross-periodograms for spatial processes observed on a lattice, including central limit theorems for smoothed spatial cross-periodograms. Cross-periodograms are a natural tool for studying multivariate processes, which are a critical issue in statistical models for multiple air pollutants. For example, one should expect strong dependencies among NOx, volatile organics and ozone. This work was motivated by statistical analysis of CMAQ output, for which the data are naturally on a large, regular grid, making spatial periodograms a natural tool.

  3. Model evaluation
  4. Jun and Stein (2004) developed numerical and graphical summaries allowing comparisons between CMAQ output and monitoring data that enable one to assess CMAQ’s ability to capture the dynamic patterns in air pollution and applied these methods to daily sulfate levels. We found that although the CMAQ output does explain some of the space-time dependencies in the observations, there is still substantial space-time structure in the residual variation. We expect our methods to be broadly applicable to the assessment of numerical models for air pollution.

    An issue that we addressed in several works has to do with statistical issues when considering CMAQ output at multiple resolutions. In particular, we have studied how to compare output at different resolutions and how to compare these results to observations in ways that allow us to see the strengths and weaknesses of changing the resolution of CMAQ runs.

    Chen, Stein and Zubrow (2008) showed that higher resolution CMAQ model output does not necessarily provide better prediction with smaller MSE than lower resolution output. In particular, the higher resolution runs mainly do worse at urban monitoring sites where it might be thought the extra resolution would be most useful. However, by smoothing high resolution output, one improves the agreement with the data and generally obtains better fits than with lower resolution runs. This shows that the local variations in high resolution runs are essentially noise and do not track local variations in actual pollution levels, at least on an hour-to-hour basis. In addition, we used analysis of variance techniques to decompose the model errors into diurnal effects, day effects, site effects and their interactions to better understand the statistical characteristics of CMAQ model output with different spatial resolutions. We found that CMAQ does a good job with purely temporal variation in ozone levels but has trouble with spatial effects and space-time interactions.

    An issue that arises when comparing model output at different resolutions is how to map the low-resolution output to the high-resolution output, which is important when one wants to assess what is lost about small-scale variations when using lower-resolution output. Shao, Stein and Ching (2005) demonstrated both theoretically and numerically that a simple bilinear interpolation scheme of low-resolution output gives noticeably better agreement with high-resolution output than assuming that pollution levels are constant within a grid cell at the lower-resolution. In particular, the bilinear interpolation removes a periodicity in the spatial variogram of differences between the model outputs at different resolutions caused by the coarser gridding.

    To provide a more quantitative evaluation of the differences in model output at different resolutions, Shao and Stein (2006) developed a bootstrapping approach for conditional simulation to address the sub-grid variability issue for a multiresolution CMAQ run in the Atlanta area, using carbon monoxide as a test case. Given a limited run of low and high resolution outputs, this paper describes how to simulate an ensemble of high resolution runs conditional on an extended low-resolution run. The simulated high resolution output has approximately the same conditional distribution as the original high resolution output given its low resolution version. To give one important example of the advantages of this bootstrapping approach compared with previous work on representing sub-grid variability, it gives more sensible inferences about the conditional distribution of extreme values of high resolution output given low resolution output, which, since air quality standards are often written in terms of extreme levels, is of critical concern.
    Using specialized CMAQ runs done at EPA, Lim, Stein, Ching and Tang (2008) study the statistical descriptions of “model errors”; that is, errors that occur in deterministic numerical models even when the initial and boundary conditions are specified without error. A good statistical description of these errors is a critical component to proper data assimilation that is often ignored, in part, because it is so difficult to do. Our tool for getting at this question was to run versions of CMAQ at different resolutions using, to the extent feasible, identical initial and boundary condtions and then using the discrepancies between the high and low resolution runs as a surrogate for the statistical characteristics of the “model error.” We analyzed these discrepancies for multiple pollutants using a number of different approaches that allowed us to look at how the discrepancies depend on spatial location, time of day, species of pollutant and time since initial conditions were set. One important finding was that even after an hour of running time from identical initial conditions, the outputs from the runs at different resolutions can be quite different, which suggests that even if one had high-frequency air quality observations, it would be important to account for model error in a data assimilation scheme.

  5. Combining numerical model output and measurements
  6. Although comparing model output and measurements is important, as we suggested in our proposal and as has become increasingly apparent subsequently, efforts to combine numerical model output and measurements are becoming perhaps even more important to the environmental modeling community. Data assimilation to produce better forecasts or maps may be the most obvious example of such an approach, but there are other ways in which model output and measurements can be combined to aid in understanding and assessing the state of the environment. Im, Stein and Kotamarthi (2005) developed single and multiple tracer versions of the CMAQ model for inverse modeling applications. An important application of this methodology is to assessing and correcting ammonia emissions, which are particularly hard to measure directly given the dominant role of agricultural sources for ammonia. The tracer models were used to derive source-receptor relationships rather than to match observed values exactly. An essential feature of our approach was to develop tracer models that can be run quickly under many emissions scenarios. To this end, we implemented a single tracer version of the CMAQ model at CISES. This model used ammonia as the only trace gas, with no other reactive or inert tracers in the model. The model as a result does not compute any gas phase or aerosol phase chemistry. This single trace gas in the model was removed from the atmosphere by dry and wet deposition, the largest sinks for ammonia in the atmosphere. The model runs much faster than full CMAQ runs, thus giving us an opportunity to make seasonal and year long runs under various emissions scenarios.

    A second approach we developed to assess source-receptor relationships was to do a single multiple tracer version of CMAQ. Specifically, we implemented a 100-tracer version of the CMAQ model, with 100 “colored” ammonia-like tracers in the model, with each color indicating ammonia emitted from one of 100 predeifined regions. All the 100 tracers behave similar to ammonia in the single tracer version of the model and undergo wet and dry deposition. The wet and dry deposited masses of each of these hundred ammonia-like tracers were separately tracked in the model for developing the source-receptor correlations. Based on a single run of this model, one can obtain source-receptor relationships for a wide range of emission scenarios.

    Using both of these approaches, Im, Stein and Kotamarthi (2005) developed various statistical approaches to recovering adjusted ammonia emissions by comparing modeled and observed ammonia depositions and relating these results to the source-receptor relationships gained from the tracer runs. The method works in principle, but it is not clear that the meteorological (particularly precipitation) and emissions fields are known well enough as of now to make the method reliable. At the least, the approach can suggest areas in which emissions might be badly estimated, directing possible future efforts to improve emissions estimates.

    Zubrow, Chen and Kotamarthi (2008) implemented an Ensemble Adjusted Kalman Filter (EnKF) method for data assimilation in a simplified single tracer version of CMAQ. Carbon monoxide (CO) was simulated as the single tracer in the model, with chemical loss of CO calculated with offline OH fields computed from a full version of the CMAQ model. For both computational and conceptual reasons, we considered this simplified CMAQ as a more appropriate test case for data assimilation with an air quality model than the full CMAQ. The Data Assimilation Research Testbed (DART), developed at NCAR (see, for example, Khare and Anderson (2006)), provides the needed software for the implementation of EnKF and we adopted the CMAQ model and CO measurement data sets to create an environment to develop an optimal methodology for using the EnKF technique. We ran CMAQ in ensemble adjustment Kalman filter mode to assimilate observations from the Air Quality System (AQS) network of monitors. Alexis Zubrow traveled to NCAR to work with the developers of DART to integrate the DART-CMAQ interface into future releases. Our study showed that the proposed method provides an excellent approach for assimilating data, even with large errors and low measurement precision.

    Another system we have investigated in depth is suspended sediment in Lake Michigan. Stroud, et al. (2008a,b) developed spatially dependent statistical models for observation error and physical model error, and devised computational methods for implementing an ensemble Kalman filter with large datasets. One of the challenges this work addressed is the nonlinear relationship between satellite observations and the physical quantity of interest (sediment levels), which is an essential part of any data assimilation scheme. As part of this work, we developed a fully operational ensemble Kalman filter algorithm in place for combining a 2-dimensional sediment transport model with SeaWiFS satellite images. The method allows for very high-dimensional data (10,000 or more observations at a time), nonlinear observational relationships, and provides real-time forecasts of unobserved sediment levels. We also implemented an ensemble Kalman smoothing algorithm for producing hindcasts of sediment levels based on the full sequence of images. Finally, we made major advances in developing computational methods for on-line parameter and bias estimation in high-dimensional models (Stroud and Bengtsson, 2007). During a large storm event in Spring 1998, we showed that these data assimilation methods improved sediment forecasts by 20-40% over standard approaches.

    A perhaps more satisfactory approach to assimilating data in this problem is to carry out assimilation on the hydrodynamics directly rather than on the sediment transport, which is largely determined by the hydrodynamics. Using advection data collected at 11 sites in Lake Michigan on an hourly basis during 1998, Zhang, Beletsky, Schwab and Stein (2007) developed a data assimilation scheme based on the numerical model of Beletsky and Schwab (2001). The big challenge here was to carry out the assimilation in a way that respects the essentially incompressible nature of water and behaves reasonably along the coast. The first problem was solved by carrying out the assimilation on a stream function representation of the advections rather than the advections themselves and the second was solved by adding artificial observations of the stream function along the coast to constrain the coastal behavior of the assimiliations. This work also developed new measures for assessing the accuracy of predictions of vector-valued responses such as advections. The assimilation methods produced improved predictions of advections at monitoring sites used in the assimilation and, to a lesser but still noticeable extent, at sites not used in the assimilation.

    In work that followed up on the thesis project of one of our postdocs, Zhang and Switzer (2005) developed an event-based model for precipitation that is continuous in both space and time, using Boolean random fields to model rain patches. A method for fitting the model to hourly rain gauge data was described and applied to data collected from eight rain gauges in Alabama over a 13 year period.

    Welty and Stein (2004) developed methods for analyzing and understanding spatial data for which the vertical and horizontal variations are fundamentally different, which will be the case for many environmental processes. Welty, et al. (2004) invented a mathematical model for correcting for quenching bias in flourescence measurements of chlorophyll in water, which is critical to making the measurement technique viable during daylight hours. The method was applied to data that were collected in Lake Michigan by using a towed fluorometer during the Episodic Events Great Lakes Experiment (EEGLE) Program in 1998-2000. These results should be of particular interest to EPA scientists, because EPA recently added a towed sensor package to the suite of instruments used as part of its Great Lakes monitoring program (personal communication G. Warren, USEPA/GLNPO) and in its nearshore sampling research activities (personal communication J. Kelly, USEPA/NHEERL).

    Technical details

    Our work does not separate naturally into achievements versus technical details. However, for some of our more mathematically or computationally ambitious works, we include some technical details here. Stein (2005b) contained a number of fundamental theoretical results for spatial-temporal processes. An elegant result given in this paper is an expression for the general form of the autocovariance function of a stationary, Gaussian spatial-temporal process that is Markov in time. The paper included an extensive theoretical investigation of the smoothness of spatial-temporal autocovariance functions away from the origin. Specifically, Stein (2005b) demonstrated that many examples of nonseparable autocovariance functions described in the literature share a generally undesirable feature with separable autocovariance functions in terms of their behavior away from the origin, and thus may be inadequate in situations in which data are frequent in both space and time. A class of space-time autocovariance functions that allows different degrees of smoothness in space and in time yet is infinitely differentiable away from the origin was obtained, although the functions were described in terms of their spectral densities and the autocovariance functions can only be obtained in closed form in a few special cases.

    Another general problem in specifying spatial-temporal autocovariance functions addressed in Stein (2005b) was to obtain interpretable and usable forms for irreversible processes. Stein (2005b)
    described two approaches to this problem. The first was based on a simple idea of “translating” a reversible process (e.g., considering a process Z(x − tv, t) for a space-time location (x, t) and v a fixed vector, where Z is reversible). Stein (2005c) extended this idea to processes on a sphere by replacing the translation by a rotation. This simple idea can lead to much better descriptions of many environmental processes than reversible models. The second approach to obtaining reversible processes was to include cross-terms between spatial and temporal frequencies in the spectral representation of the autocovariance function. Jun and Stein (2007a) obtained a way to extended this idea to spherical spatial domains by applying differential operators to reversible processes, leading to flexible but fairly simple models for irreversible processes for which one can obtain explicit expressions for the spatial-temporal covariance functions.

    Stein (2005c) modeled stationary spatial-temporal processes using four scalar-valued functions: one describing purely spatial covariances, one describing purely temporal covariances, and two describing dependencies between temporal variations at different sites, one for coherences and the other for phase relationships, using the language of spectral representations for multiple time series. This framework is particularly appropriate, both conceptually and computationally, for processes observed regularly at fixed monitoring sites. In particular, it is well-suited to both time and frequency domain approaches to approximating the likelihood using parametric or partially nonparametric approaches to modeling the autocovariance function.

    Stein (2007b,c) investigated a class of models for spatial processes on the sphere known as axially symmetric (Jones 1963). These models can be described in terms of expansions in spherical harmonics with random coefficients, for which a simple restriction on the correlation structure of these coefficients guarantees axial symmetry. Stein (2007c) added a compactly supported covariance function to the model to better capture local behavior of spatial processes. By combining formulas for updating matrix inverses and determinants when adding a low-rank matrix to a matrix whose inverse and determinant are known with methods for computing Cholesky decompositions for sparse matrices, Stein (2007c) showed how to calculate likelihood functions exactly for very large spatial datasets.

    Anderes and Stein (2008) used ideas from complex analysis and differential geometry to demonstrate that a spatial deformation in two dimensions can be (essentially) characterized by the local properties of the deformation. Anderes and Stein (2008) gave an approach to estimating these local properties based on local likelihood methods. These estimates were then converted into an estimated deformation by drawing on a broad array of mathematical and computational ideas, including quasiconformal maps, harmonic functions and numerical methods for solving partial differential equations.

    Im, Stein and Zhu (2007) used a semiparametric approach to modeling the spectral density of an isotropic process. Specifically, we approximated the spectral density using a B-spline at lower frequencies and by an algebraically decaying function at higher frequencies. We then used likelihoods to estimate the locations of the knots, the coefficients of the B-splines and the parameters of the algebraically decaying tail. This work solved a number of challenging computational issues related to calculating the covariance function based on the spline representation and the numerical maximization of the likelihood. In particular, computing the covariances required considerable use of asymptotic properties of special functions and infinite precision arithmetic to obtain accurate numerical results.

    As part of the ongoing thesis work of Darongsae Kwon, Kwon and Stein have been developing simple and computationally efficient methods for obtaining accurate predictive inferences for Gaussian processes with unknown parameters in the covariance function. We are presently exploring the use of Edgeworth expansions, obtained in part using Mathematica, for comparing various approaches to this problem, as well as applying the methods to large space-time datasets to assess their effectiveness.

    Loh and Stein (2008) and Lim and Stein (2008) obtained asymptotic results under fixeddomain asymptotics, in which one considers taking an increasing number of observations in a fixed and bounded region. Loh and Stein (2008) proved that for processes in one dimension that behave locally like Brownian motion, block of block bootstrapping (K¨unsch 1989) yields relatively consistent variance estimates of the variogram at short lags. Loh and Stein (2008) obtained a similar result for processes that behave locally like integrated Brownian motion, but only if one considers the second-order variogram. Lim and Stein (2008) extend ideas in Stein (1995) to obtain fixed-domain asymptotic results for spatial cross-periodograms. Analogous to the results in Loh and Stein (2008), the results only hold if one first differences the spatial data sufficiently before calculating the periodogram. Central limit theorems for smoothed periodograms were obtained by finding asymptotic bounds on the higher order moments of the periodograms.

Conclusions:

Contributions to understanding of environmental problems

Levels of air and water pollution all vary in space and time. It follows that stochastic models for processes varying in space and time should be central to the successful application of statistical methods to environmental processes. However, until the last decade, there was a noticeable lack of research on spatial-temporal processes in the statistical literature compared to the immense time series literature, but also compared to the rather substantial literature on purely spatial processes. Lack of suitable data cannot be the reason, as extensive monitoring networks of air quality have existed for a few decades and meteorological networks for much longer. Lack of computing power and suitable algorithms have certainly been part of the problem and much of the recent work in spatial-temporal statistics has made use of computationally intensive methods such as Markov chain Monte Carlo for Bayesian hierarchical models and ensemble Kalman filters for data assimilation. But, in our opinion, much of this work is based on statistical models chosen for their computational convenience and/or simplicity of model formulation and not because the models used have been demonstrated to provide an accurate description of the processes under study. In our work, we endeavored to ask hard questions about what kinds of properties spacetime stochastic models for dynamic processes should possess and to develop methods for studying space-time data that will allow one to assess the suitability of various stochastic models. To give a few examples, we studied various approaches to modeling space-time asymmetries (ways in which processes look different going forward versus backwards in time), developed models for processes on a global scale, which has led inevitably to studying processes on spheres whose behavior varies with latitude, and developed models that include seasonality in their description of space-time variability. These modeling approaches provide a toolbox that investigators can use now as well as serving as a foundation for further research into the statistical modeling of environmental processes. The various diagnostic approaches we developed to study space-time data complement this work as they provide simple and interpretable ways of evaluating both ours and others’ spatial-temporal statistical models.

Although we did not let computational considerations be the main driver of our model choices, we paid close attention to computational issues. In particular, wherever possible, our models for space-time covariances (a way of modeling statistical dependencies in space-time processes) yield explicit expressions for the covariances, making practical application of the models much easier than for models in which obtaining the covariances requires a numerical integration. In addition, we considered various ways to exploit specific structures in our models and in the data to speed up computations. For example, we made creative use of spectral methods (a set of data analytic methods common in time series analysis) for analyzing space-time data, both as a way of understanding and describing the processes and as a way of speeding (sometimes greatly) computations.

Our work on model evaluation focused on developing extensions and modifications of simple statistical methods that will allow modelers to examine the space-time structure of their model output and how it compares to observed data. For example, looking at spatial and spatial-temporal variograms (a standard tool in spatial statistics) after taking differences of consecutive observations in time at each observation site (or at each pixel for computer model output) provides a good way of using familiar tools from spatial statistics to investigate aspects of the spatial-temporal dependence that may be difficult to discern looking at the variogram of the undifferenced data directly. As another example, we found that the classical statistical tool of analysis of variance can be very helpful in decomposing the variation in model output and their differences from observations into spatial components, temporal components and their interactions. This kind of analysis allows us to understand what aspects of the observed process the model can and cannot capture. Doing this analysis simultaneously for multiple models (or a common model run under differing conditions) can provide considerable insight into how the models are performing.

Several papers of ours considered using computer models in nonstandard, innovative ways. For example, we developed specially modified versions of CMAQ (EPA’s main air quality model) that were meant to trace source-receptor relationships and that can in principle be used to correct for errors in emissions inventories. In another innovative use of CMAQ, by specially controlling the boundary and initial conditions in CMAQ (the actual CMAQ runs were done by an EPA contractor according to our specifications), we showed how it is possible to study the impacts of various aspects of the model on outputs in a way that is practically impossible to study by comparing model output to observations. The possibility of using computer models to do what are, in effect, novel controlled experiments, is one that we believe deserves much greater attention and we hope that our work in this area will open up new lines of research.

Data assimilation as a way of combining observations with computer models has been standard practice in weather forecasting and oceanography for some time now. There is a growing realization among many researchers that air quality and other environmental models should be adopting this approach to assist in producing better pollution forecasts and better retrospective maps of pollution levels (i.e., a product comparable to the NCEP/NCAR Reanalyses Project), as well as to better understand the strengths and weaknesses of models and how both models and monitoring networks can be improved. Our work in data assimilation focused on some difficult statistical problems that arise in environmental applications. One is how to carry out assimilations that respect strong physical constraints in the process. We studied this problem in the context of respecting the (near) incompressibility of water when modeling water flows in Lake Michigan, but one can imagine similar problems occurring due to, for example, chemical balance constraints in air quality models. Another problem we addressed is how to handle strongly nonlinear relationships between observations and the physical quantity of interest. We expect such nonlinear relationships to be quite common with remote sensing data, and hence the proper treatment of these relationships will be an increasingly common issue for the analysis of environmental data, especially for datasets with the kind of wide spatial coverage that is particularly valuable for data assimilation.

Presentations

Michael Stein gave invited talks around the world on his CISES research. International talks at conferences included the Hunter Lecture at the International Environmetrics Society meeting in Beijing, a Medallion Lecture of the Institute of Mathematical Statistics in Toronto, as well as talks at conferences in Austria, Japan and Korea. He also gave invited talks at Nanjing University in China and ETH in Zurich, Switzerland. He gave numerous invited talks in the USA, most notably the Rustagi Lecture at Ohio State University and the KCS Pillai Lecture at Purdue. Other institutions at which he gave invited talks include Stanford, University of Wisconsin-Madison, North Carolina State University, UCLA, SAMSI and NCAR, among others.

Jon Stroud gave invited talks on his CISES work at a number of meetings and departments: International Conference on Statistics, Combinatorics and Related Areas in Portland, Maine (October 2003); University of California at Santa Cruz (April 2004); National Center for Atmospheric Research (May 2004); Meeting of the International Environmetric Society in Portland, Maine (July 2004); Joint Statistical Meetings in Toronto (August 2004); Department of Biostatistics at Johns Hopkins University (April 2005); at the SAMSI Mini-Workshop, Research Triangle Park, NC (June 2005); and SAMSI/IMAGe Summer School, NCAR, Boulder, CO (June 2005).

Our scientific programmer, Alexis Zubrow, and the various graduate students and postdocs gave numerous contributed talks and posters at a wide range of statistical, geophysical and environmental meetings. Mikyoung Jun received an honorable mention for the student paper competition of the American Statistical Association’s Section on Statistics and the Environment.

References:

Berger JO, de Oliveira V, Sanso B. Objective Bayesian analysis of spatially correlated data. Journal of the American Statistical Association 2001;96:1361–1374.

Cressie N, Johannesson G. Fixed rank kriging for very large spatial datasets. Journal of the Royal Statistical Society, Series B (Statistical Methodology) 2008;70:209–226.

Gneiting T. Nonseparable, stationary covariance functions for space-time data. Journal of the American Statistical Association 2002;97:590–600.

Harville DA, Carriquiry AL. Classical and Bayesian prediction as applied to an unbalanced mixed linear model. Biometrics 1992;48:987–1003.

Jones AH. Stochastic processes on a sphere. Annals of Mathematical Statistics 1963;34:213–217.

Khare SP, Anderson J. An examination of ensemble filter based adaptive observation methodologies. Tellus A 2006;58:179-195

Kunsch HR. The jackknife and the bootstrap for general stationary observations. Annals of Statistics 1989;17:1217–1241.

Putter H, Young GA. On the effect of covariance function estimation on the accuracy of kriging predictors. Bernoulli 2001;7:421–438.

Sampson P, Guttorp P. Nonparametric estimation of nonstationary spatial covariance structure. Journal of the American Statistical Association 1992;87:108–119.

Stein ML. Fixed-domain asymptotics for spatial periodograms. Journal of the American Statistical Association 1995;90:1277–1288.


Journal Articles on this Report: 16 Displayed | Download in RIS Format

Other subproject views: All 28 publications 27 publications in selected types All 17 journal articles
Other center views: All 102 publications 59 publications in selected types All 37 journal articles

Type Citation Sub Project Document Sources
Journal Article Anderes EB, Stein ML. Estimating deformations of isotropic Gaussian random fields on the plane. Annals of Statistics 2008;36(2):719-741. R829402C002 (Final)
  • Abstract: Project Euclid Abstract
    Exit EPA Disclaimer
  • Other: Arxiv PDF
    Exit EPA Disclaimer
  • Journal Article Im HK, Stein ML, Kotamarthi VR. A new approach to scenario analysis using simplified chemical transport models. Journal of Geophysical Research 2005;110(D24205), doi:10.1029/2005JD006417. R829402C002 (2006)
    R829402C002 (Final)
  • Abstract: AGU Abstract
    Exit EPA Disclaimer
  • Journal Article Im HK, Stein ML, Zhu Z. Semiparametric estimation of spectral density with irregular observations. Journal of the American Statistical Association 2007;102(478):726-735. R829402C002 (Final)
  • Abstract: Ingenta Connect Abstract
    Exit EPA Disclaimer
  • Other: University of Chicago PDF
    Exit EPA Disclaimer
  • Journal Article Jun M, Stein ML. Statistical comparison of observed and CMAQ modeled daily sulfate levels. Atmospheric Environment 2004;38(27):4427-4436. R829402C002 (2004)
    R829402C002 (Final)
  • Full-text: Science Direct Full Text
    Exit EPA Disclaimer
  • Abstract: Science Direct Abstract
    Exit EPA Disclaimer
  • Other: Science Direct PDF
    Exit EPA Disclaimer
  • Journal Article Jun M, Stein ML. An approach to producing space-time covariance functions on spheres. Technometrics 2007;49(4):468-479. R829402C002 (Final)
  • Abstract: Ingenta Connect Abstract
    Exit EPA Disclaimer
  • Journal Article Loh JM, Stein ML. Spatial bootstrap with increasing observations in a fixed domain. Statistica Sinica 2008;18(2):667-688. R829402C002 (Final)
  • Abstract: Statistica Sinica Abstract
    Exit EPA Disclaimer
  • Other: University of Chicago PDF
    Exit EPA Disclaimer
  • Journal Article Shao X, Stein ML. Statistical conditional simulation of a multiresolution numerical air quality model. Journal of Geophysical Research 2006;111(D15211), doi:10.1029/2005JD007037. R829402 (2006)
    R829402C002 (2006)
    R829402C002 (Final)
  • Abstract: AGU Abstract
    Exit EPA Disclaimer
  • Journal Article Shao X, Stein M, Ching J. Statistical comparisons of methods for interpolating the output of a numerical air quality model. Journal of Statistical Planning and Inference 2007;137(7):2277-2293. R829402C002 (Final)
  • Full-text: Science Direct Full Text
    Exit EPA Disclaimer
  • Abstract: Science Direct Abstract
    Exit EPA Disclaimer
  • Other: Science Direct PDF
    Exit EPA Disclaimer
  • Journal Article Stein ML. Space-time covariance functions. Journal of the American Statistical Association 2005;100(469):310-321. R829402C002 (2002)
    R829402C002 (2004)
    R829402C002 (Final)
  • Abstract: Ingenta Connect Abstract
    Exit EPA Disclaimer
  • Journal Article Stein ML. Statistical methods for regular monitoring data. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2005;67(5):667-687. R829402C002 (2004)
    R829402C002 (2006)
    R829402C002 (Final)
  • Abstract: Blackwell Synergy Abstract
    Exit EPA Disclaimer
  • Journal Article Stein ML. Seasonal variations in the spatial-temporal dependence of total column ozone. Environmetrics 2007;18(1):71-86. R829402C002 (2006)
    R829402C002 (Final)
  • Abstract: InterScience Abstract
    Exit EPA Disclaimer
  • Other: University of Chicago PDF
    Exit EPA Disclaimer
  • Journal Article Stein ML. Spatial variation of total column ozone on a global scale. Annals of Applied Statistics 2007;1(1):191-210. R829402C002 (Final)
  • Abstract: Project Euclid Abstract
    Exit EPA Disclaimer
  • Other: ARXIC PDF
    Exit EPA Disclaimer
  • Journal Article Stein ML. A modeling approach for large spatial datasets. Journal of the Korean Statistical Society 2008;37(1):3-10. R829402C002 (Final)
  • Abstract: Science Direct Abstract
    Exit EPA Disclaimer
  • Other: Science Direct PDF
    Exit EPA Disclaimer
  • Journal Article Zhang Z, Beletsky D, Schwab DJ, Stein ML. Assimilation of current measurements into a circulation model of Lake Michigan. Water Resources Research 2007;43(W11407), doi:10.1029/2006WR005818. R829402C002 (Final)
  • Abstract: AGU Abstract
    Exit EPA Disclaimer
  • Other: University of Chicago PDF
    Exit EPA Disclaimer
  • Journal Article Zhang Z, Switzer P. Stochastic space-time regional rainfall modeling adapted to historical rain gauge data. Water Resources Research 2007;43(W03441), doi:10.1029/2005WR004654. R829402C002 (Final)
  • Abstract: AGU Abstract
    Exit EPA Disclaimer
  • Other: University of Chicago PDF
    Exit EPA Disclaimer
  • Journal Article Zubrow A, Chen L, Kotamarthi VR. EAKF-CMAQ: Introduction and evaluation of a data assimilation for CMAQ based on the ensemble adjustment Kalman filter. Journal of Geophysical Research 2008;113(D09302), doi:10.1029/2007JD009267. R829402C002 (Final)
  • Full-text: AGU Full Text
    Exit EPA Disclaimer
  • Abstract: AGU Abstract
    Exit EPA Disclaimer
  • Supplemental Keywords:

    , Ecosystem Protection/Environmental Exposure & Risk, Economic, Social, & Behavioral Science Research Program, Air, Geographic Area, Scientific Discipline, Health, RFA, PHYSICAL ASPECTS, Ecosystem/Assessment/Indicators, Engineering, Chemistry, & Physics, Risk Assessments, Environmental Statistics, Great Lakes, Applied Math & Statistics, Health Risk Assessment, Physical Processes, Ecological Risk Assessment, Environmental Engineering, EPA Region, particulate matter, Ecological Effects - Environmental Exposure & Risk, Ecosystem Protection, Monitoring/Modeling, Environmental Monitoring, risk assessment, trend monitoring, ozone , chemical transport models, particulate, stochastic models, statistical methodology, air quality, computer models, ecological risk, ecosystem health, environmental indicators, ozone, chemical transport, health risk analysis, human health risk, monitoring, statistical models, particulates, statistical methods, watersheds, Region 5, air pollution, sediment transport, stratospheric ozone, emissions monitoring, data models, exposure, water, chemical transport modeling, ecological models, ecological effects, ecological health, human exposure

    Progress and Final Reports:
    2002 Progress Report
    2004 Progress Report
    2006 Progress Report
    Original Abstract


    Main Center Abstract and Reports:
    R829402    Center for Integrating Statistical and Environmental Science

    Subprojects under this Center: (EPA does not fund or establish subprojects; EPA awards and manages the overall grant for this center).
    R829402C001 Detection of a Recovery in Stratospheric and Total Ozone
    R829402C002 Integrating Numerical Models and Monitoring Data
    R829402C003 Air Quality and Reported Asthma Incidence in Illinois
    R829402C004 Quasi-Experimental Evidence on How Airborne Particulates Affect Human Health
    R829402C005 Model Choice Stochasticity, and Ecological Complexity
    R829402C006 Statistical Approaches to Detection and Downscaling of Climate Variability and Change

    Top of page

    The perspectives, information and conclusions conveyed in research project abstracts, progress reports, final reports, journal abstracts and journal publications convey the viewpoints of the principal investigator and may not represent the views and policies of ORD and EPA. Conclusions drawn by the principal investigators have not been reviewed by the Agency.


    Local Navigation


    Jump to main content.