First, let’s give Gergis, Karoly and coauthors some props for conceding that there was a problem with their article and trying to fix it. Think of the things that they didn’t do. They didn’t arrange for a realclimate hit piece, sneering at the critics and saying Nyah, nyah,
what about the hockey stick that Oerlemans derived from glacier retreat since 1600?… How about Osborn and Briffa’s results which were robust even when you removed any three of the records?
Karoly recognized that the invocation of other Hockey Sticks was irrelevant to the specific criticism of his paper and did not bother with the realclimate juvenilia that has done so much to erode the public reputation of climate scientists. Good for him.
Nor did he simply deny the obvious, as Mann, Gavin Schmidt and so many others have done with something as simple as Mann’s use of the contaminated portion of Tiljander sediments according to “objective criteria”. The upside-down Tiljander controversy lingers on, tarnishing the reputation of the community that seems unequal to the challenge of a point that a high school student can understand.
Nor did they assert the errors didn’t “matter” and challenge the critics to produce their own results (while simultaneously withholding data.) Karoly properly recognized that the re-calculation obligations rested with the proponents, not the critics.
I do not believe that they “independently” discovered their error or that they properly acknowledged Climate Audit in their public statements or even in Karoly’s email. But even though Karoly’s email was half-hearted, he was courteous enough to notify me of events. Good for him. I suspect that some people on the Team would have opposed even this.
The Screening Irony
The irony in Gergis’ situation is that they tried to avoid an erroneous statistical procedure that is well-known under a variety of names in other fields (I used the term Screening Fallacy), but which is not merely condoned, but embraced, by the climate science community. In the last few days, readers have drawn attention to relevant articles discussing closely related statistical errors under terms like “selecting on the dependent variable” or “double dipping – the use of the same data set for selection and selective analysis”.
I’ll review a few of these articles and then return to Gergis. Shub Niggurath listed a number of an interesting articles at Bishop Hill here.
Kriegeskorte et al 2009 (Nature Neuroscience) in an article entitled “Circular analysis in systems neuroscience: the dangers of double dipping” discuss the same issue commenting as follows:
In particular, “double dipping” – the use of the same data set for selection and selective analysis – will give distorted descriptive statistics and invalid statistical inference whenever the results statistics are not inherently independent of the selection criteria under the null hypothesis.
Nonindependent selective analysis is incorrect and should not be acceptable in neuroscientific publications….
If circularity consistently caused only slight distortions, one could argue that it is a statistical quibble. However, the distortions can be very large (Example 1, below) or smaller, but significant (Example 2); and they can affect the qualitative results of significance tests…
Distortions arising from selection tend to make results look more consistent with the selection criteria, which often reflect the hypothesis being tested. Circularity therefore is the error that beautifies results – rendering them more attractive to authors, reviewers, and editors, and thus more competitive for publication. These implicit incentives may create a preference for circular practices, as long as the community condones them.
A similar article by Kriegeskorte here entitled “Everything you never wanted to know about circular analysis, but were afraid to ask” uses similar language:
An analysis is circular (or nonindependent) if it is based on data that were selected for showing the effect of interest or a related effect
Vul and Kanwisher here, entitled “Begging the Question: The Non-Independence Error in fMRI Data Analysis” make similar observations, including:
In general, plotting non-independent data is misleading, because the selection criteria conflate any effects that may be present in the data from those effects that could be produced by selecting noise with particular characteristics….
Public broadcast of tainted experiments jeopardizes the reputation of cognitive neuroscience. Acceptance of spurious results wastes researchers’ time and government funds while people chase unsubstantiated claims. Publication of faulty methods spreads the error to new scientists.
Reader fred berple reports a related discussion in political science here “How the Cases You Choose Affect the Answers You Get: Selection Bias in Comparative Politics”. Geddes observes:
Most graduate students learn in the statistics courses forced upon them that selection on the dependent variable is forbidden, but few remember why, or what the implications of violating this taboo are for their own work.
John Quiggin, a seemingly unlikely ally in criticism of methods used by Gergis and Karoly, has written a number of blog posts that are critical of studies that selected on the dependent variable.
Screening and Hockey Sticks
Both I and other bloggers (see links surveyed here) have observed that the common “community” practice of screening proxies for the “most temperature sensitive” or equivalent imparts a bias towards Hockey Sticks. This bias has commonly demonstrated by producing a Stick from red noise.
In the terminology of the above articles, screening a data set according to temperature correlations and then using the subset for temperature reconstruction quite clearly qualifies as Kriegeskorte “double dipping” – the use of the same data set for selection and selective analysis. Proxies are screened depending on correlation to temperature (either locally or teleconnected) and then the subset is used to reconstruct temperature. It’s hard to think of a clearer example than paleoclimate practice.
As Kriegeskorte observed, this double use “will give distorted descriptive statistics and invalid statistical inference whenever the results statistics are not inherently independent of the selection criteria under the null hypothesis.” This is an almost identical line of reasoning to many Climate Audit posts.
Gergis et al, at least on its face, attempted to mitigate this problem by screening on detrended data:
For predictor selection, both proxy climate and instrumental data were linearly detrended over the 1921–1990 period to avoid inflating the correlation coefficient due to the presence of the global warming signal present in the observed temperature record. Only records that were significantly (p.&.lt.0.05) correlated with the detrended instrumental target over the 1921–1990 period were selected for analysis.
This is hardly ideal statistical practice, but it avoids the most grotesque form of the error. However, as it turned out, they didn’t implement this procedure, instead falling back into the common (but erroneous) Screening Fallacy.
The first line of defence – from, for example, comments from Jim Bouldin and Nick Stokes – has been to argue that there’s nothing wrong with using the same data set for selection and selective analysis and that Gergis’ attempted precautions were unnecessary. I have no doubt that, had Gergis never bothered with statistical precaution and simply done a standard (but erroneous) double dip/selection on the dependent variable, no “community” reviewer would have raised the slightest objection. If anything, their instinct is to insist on an erroneous procedure, as we’ve seen in opening defences.
Looking ahead, the easiest way for Gergis et al to paper over their present embarrassment will be to argue (1) that the error was only in the description of their methodology and (2) that using detrended correlations was, on reflection, not mandatory. This tactic could be implemented by making only the following changes:
For predictor selection, both proxy climate and instrumental data were linearly detrended over the 1921–1990 period to avoid inflating the correlation coefficient due to the presence of the global warming signal present in the observed temperature record. Only records that were significantly (p<0.05) correlated with the detrended instrumental target over the 1921–1990 period were selected for analysis.
Had they done this in the first place, if it had later come to my attention, I would have objected that they were committing a screening fallacy (as I had originally done), but no one on the Team or in the community would have cared. Nor would IPCC.
So my guess is that they’ll resubmit on these lines and just tough it out. If the community is unoffended by upside-down Mann or Gleick’s forgery, then they won’t be offended by Gergis and Karoly “using the same data for selection and selective analysis”.
Postscript: As Kriegeskorte observed, the specific impact of an erroneous method on a practical data set is hard to predict. In our case, it does not mean that a given reconstruction is necessarily an “artifact” of red noise, since a biased procedure will produce a Stick from an actual Stick signal. (If the “signal” is a Stick, the biased procedure will typically enhance the Stick.) The problem is that a biased method can produce a Stick from red noise as well and therefore not much significance can be placed to a Stick obtained from a flawed method.
If the “true” signal is a Stick, then it should emerge without resorting to flawed methodology. In practical situations with inconsistent proxies, biased methods will typically place heavy weights on a few series (bristlecones in a notorious example) and the validity of the reconstruction then depends on whether these few individual proxies have a unique and even magical ability to measure worldwide temperature – a debate that obviously continues.