Hamill, T. M., 2007: Comments on "Calibrated Surface Temperature Forecasts from the Canadian Ensemble Prediction System Using Bayesian Model Averaging." Mo. Wea. Rev., 135, 12, 4226-4230.


Introduction

Wilson et al. (2007, hereafter W07) recently described the application of the Bayesian model averaging (BMA; Raftery et al. 2005, hereafter R05) calibration technique to surface temperature forecasts using the Canadian ensemble prediction system. The BMA technique as applied in W07 produced an adjusted probabilistic forecast from an ensemble through a two-step procedure. The first step was the correction of biases of individual members through regression analyses. The second step was the fitting of a Gaussian kernel around each bias-corrected member of the ensemble. The amount of weight applied to each member's kernel and the width of the kernel(s) were set through an expectation maximization (EM) algorithm (Dempster et al. 1977). The final probability density function (pdf) was a sum of the weighted kernels.

W07 reported (their Fig. 2) that at any given instant, a majority of the ensemble members were typically assigned zero weight, while a few select members received the majority of the weight. Which members received large weights varied from one day to the next. These results were counterintuitive. Why effectively discard the information from so many ensemble members? Why should one member have positive weight one day and none the next?

This comment to W07 will show that BMA where the EM is permitted to adjust the weights individually for each member is not an appropriate application of the technique when the sample size is small;1 specifically, the radically unequal weights of W07 exemplify an "overfitting" (Wilks 2006a, p. 207) to the training data. A symptom of overfitting is an improved fitted relationship to the training data but a worsened relationship with independent data. This may happen when the statistician attempts to fit a large number of parameters using a relatively small training sample. In W07, the EM algorithm was required to set the weights of 16 individual ensemble members and a kernel standard deviation with between 25 and 80 days of data.

To illustrate the problem of overfitting in W07's methodology, a reforecast dataset was used. This was composed of more than two decades of daily ensemble forecasts with perturbed initial conditions, all from a single forecast model. This large dataset permitted a comparison of BMA properties based on small and large training samples. This reforecast dataset used a T62, circa 1998 version of the National Centers for Environmental Prediction (NCEP) Global Forecast System. A 15-member forecast, consisting of a control and seven bred pairs (Toth and Kalnay 1997) was integrated to 15 days lead for every day from 1979 to current. For more details on this reforecast dataset, please see Hamill et al. (2006). The verification data were from the NCEPÐNational Center for Atmospheric Research reanalysis (Kalnay et al. 1996).