Skip to content

Gremlins in the work of Amy J. C. Cuddy, Michael I. Norton, and Susan T. Fiske

1977_AMC_Gremlin_X_-_Hershey_2012_d-1024x5291977_AMC_Gremlin_X_-_Hershey_2012_d-1024x5291977_AMC_Gremlin_X_-_Hershey_2012_d-1024x5291977_AMC_Gremlin_X_-_Hershey_2012_d-1024x5291977_AMC_Gremlin_X_-_Hershey_2012_d-1024x5291977_AMC_Gremlin_X_-_Hershey_2012_d-1024x529

1977_AMC_Gremlin_X_-_Hershey_2012_d-1024x5291977_AMC_Gremlin_X_-_Hershey_2012_d-1024x5291977_AMC_Gremlin_X_-_Hershey_2012_d-1024x5291977_AMC_Gremlin_X_-_Hershey_2012_d-1024x5291977_AMC_Gremlin_X_-_Hershey_2012_d-1024x5291977_AMC_Gremlin_X_-_Hershey_2012_d-1024x529

1977_AMC_Gremlin_X_-_Hershey_2012_d-1024x5291977_AMC_Gremlin_X_-_Hershey_2012_d-1024x5291977_AMC_Gremlin_X_-_Hershey_2012_d-1024x5291977_AMC_Gremlin_X_-_Hershey_2012_d-1024x5291977_AMC_Gremlin_X_-_Hershey_2012_d-1024x5291977_AMC_Gremlin_X_-_Hershey_2012_d-1024x529

1977_AMC_Gremlin_X_-_Hershey_2012_d-1024x5291977_AMC_Gremlin_X_-_Hershey_2012_d-1024x5291977_AMC_Gremlin_X_-_Hershey_2012_d-1024x5291977_AMC_Gremlin_X_-_Hershey_2012_d-1024x5291977_AMC_Gremlin_X_-_Hershey_2012_d-1024x5291977_AMC_Gremlin_X_-_Hershey_2012_d-1024x529

Remember that “gremlins” paper by environmental economist Richard Tol? The one that had almost as many errors as data points? The one where, each time a correction was issued, more problems would spring up? (I’d say “hydra-like” but I’d rather not mix my mythical-beast metaphors.)

Well, we’ve got another one. This time, nothing to do with the environment or economics; rather, it’s from some familiar names in social psychology.

Nick Brown tells the story:

For an assortment of reasons, I [Brown] found myself reading this article one day: This Old Stereotype: The Pervasiveness and Persistence of the Elderly Stereotype by Amy J. C. Cuddy, Michael I. Norton, and Susan T. Fiske (Journal of Social Issues, 2005). . . .

This paper was just riddled through with errors. First off, its main claims were supported by t statistics of 5.03 and 11.14 . . . ummmmm, upon recalculation the values were actually 1.8 and 3.3. So one of the claim wasn’t even “statistically significant” (thus, under the rules, was unpublishable).

But that wasn’t the worst of it. It turns out that some of the numbers reported in that paper just couldn’t have been correct. It’s possible that the authors were doing some calculations wrong, for example by incorrectly rounding intermediate quantities. Rounding error doesn’t sound like such a big deal, but it can supply a useful set of “degrees of freedom” to allow researchers to get the results they want, out of data that aren’t readily cooperating.

Here’s how Brown puts it:

To summarise, either:
/a/ Both of the t statistics, both of the p values, and one of the dfs in the sentence about paired comparisons is wrong;
or
/b/ “only” the t statistics and p values in that sentence are wrong, and the means on which they are based are wrong.

And yet, the sentence about paired comparisons is pretty much the only evidence for the authors’ purported effect. Try removing that sentence from the Results section and see if you’re impressed by their findings, especially if you know that the means that went into the first ANOVA are possibly wrong too.

OK, everybody makes mistakes. These people are psychologists, not statisticians, so maybe we shouldn’t fault them for making some errors in calculation, working as they were in a pre-Markdown era.

The way that this falls into “gremlins” territory is how the mistakes fit together: The claims in this paper are part of an open-ended theory that can explain just about any result, any interaction in any direction. Publication’s all about finding something statistically significant and wrapping it in a story. So if it’s not one thing that’s significant, it’s something else.

And that’s why the authors’ claim that fixing the errors “does not change the conclusion of the paper” is both ridiculous and all too true. It’s ridiculous because one of the key claims is entirely based on a statistically significant p-value that is no longer there. But the claim is true because the real “conclusion of the paper” doesn’t depend on any of its details—all that matters is that there’s something, somewhere, that has p less than .05, because that’s enough to make publishable, promotable claims about “the pervasiveness and persistence of the elderly stereotype” or whatever else they want to publish that day.

As with Richard Tol’s notorious paper, the gremlins feed upon themselves, as each revelation of error reveals the rot beneath the original analysis, and when the authors protest that none of the errors really matter, it makes you realize that, in these projects, the data hardly matter at all.

We’ve encountered all three of these authors before.

Amy Cuddy is a co-author and principal promoter of the so-called power pose, and she notoriously reacted to an unsuccessful outside replication of that study by going into deep denial. The power pose papers were based on “p less than .05” comparisons constructed from analyses with many forking paths, including various miscalculations which brought some p-values below that magic cutoff.

Michael Norton is a coauthor of that horrible air-rage paper that got so much press a few months ago, and even appeared on NPR. It was in a discussion thread on that air-rage paper that the problems of the Cuddy, Norton, and Fiske paper came out. Norton also is on record recommending that you buy bullfight tickets for that “dream vacation in Spain.” (When I mocked Norton and his coauthor for sending people to bullfights, a commenter mocked me right back by recommending “a ticket to a factory farm slaughterhouse” instead. I had to admit that this would be an even worse vacation destination!)

And, as an extra bonus, when I just googled Michael Norton, I came across this radio show in which Norton plugs “tech giant Peter Diamandis,” who’s famous in these parts for promulgating one of the worst graphs we’ve ever seen. These people are all connected. I keep expecting to come across Ed Wegman or Marc Hauser.

Finally, Susan Fiske seems to have been doing her very best to wreck the reputation of the prestigious Proceedings of the National Academy of Sciences (PPNAS) by publishing papers on himmicanes, power pose, and “People search for meaning when they approach a new decade in chronological age.” In googling Fiske, I was amused to come across this press release entitled, “Scientists Seen as Competent But Not Trusted by Americans.”

A whole fleet of gremlins

This is really bad. We have interlocking research teams making fundamental statistical errors over and over again, publishing bad work in well-respected journals, promoting bad work in the news media. Really the best thing you can say about this work is maybe it’s harmless because no relevant policymaker will take the claims about himmicanes seriously, no airline executive or transportation regulator would be foolish enough to believe the claims from those air rage regressions, and, hey, even if power pose doesn’t work, it’s not hurting anybody, right? On the other hand, those of us who really do care about social psychology are concerned about the resources and attention that are devoted to this sort of cargo-cult science. And, as a statistician, I feel disgust at a purely aesthetic level to these fundamental errors of inference. Wrapping it all up is the attitudes of certainty and defensiveness exhibited by the authors and editors of these papers, never wanting to admit that they could be wrong and continuing to promote and promote and promote their mistakes.

A whole fleet of gremlins, indeed. In some ways, Richard Tol is more impressive in that he can do it all on its own, and these psychology researchers work in teams. But the end result is the same. Error piled upon error piled upon error piled on refusal to admit that their conclusions could be completely mistaken.

P.S. Look. I’m not saying these are bad people. I’m guessing that from their point of view, they’re doing science, they have good theories, their data support their theories, and “p less than .05” is just a silly rule they have to follow, a bit of paperwork that needs to be stamped on their findings to get them published. Sure, maybe they cut corners here or there, or make some mistakes, but those are all technicalities—at least, that’s how I’m guessing they’re thinking. For Cuddy, Porter, and Fiske to step back and think that maybe almost everything they’ve been doing for years is all a mistake . . . that’s a big jump to take. Indeed, they’ll probably never take it. All the incentives fall in the other direction. So that’s the real point of this post: the incentives. Forget about these three particular professionals, and consider the larger problem, which is that errors get published and promoted and hyped and Gladwell’d and Freakonomics’d and NPR’d, whereas when Nick Brown and his colleagues do the grubby work of checking the details, you barely hear about it. That bugs me, hence this post.

P.P.S. Putting this in perspective, this is about the mildest bit of scientific misconduct out there. No suppression of data on side effects from dangerous drugs, no million-dollar payoffs, no $228,364.83 in missing funds, no dangerous policy implications, no mistreatment of cancer patients, no monkeys harmed by any of these experiments. It’s just bad statistics and bad science, simple as that. Really the worst thing about it is the way in which respected institutions such as the Association for Psychological Science, National Academy of Sciences, and National Public Radio have been sucked into this mess.

“Positive Results Are Better for Your Career”

Brad Stiritz writes:

I thought you might enjoy reading the following Der Spiegel interview with Peter Wilmshurst. Talk about fighting the good fight! He took the path of greatest resistance, and he beat what I presume are pretty stiff odds.

Then the company representatives asked me to leave some of the patients out of the data analysis. Without these patients, the study result would have been positive.

I guess Serpico-esque stories like this are probably outlier stories, particularly when they have happier endings than what Frank Serpico experienced. Or for that matter, Boris Kolesnikov.

Wow—that’s pretty scary! I had cardiac catheterization once!

The Spiegel interview begins with a bang:

SPIEGEL: In your early years as a researcher, a pharmaceutical company offered you a bribe equivalent to two years of your salary: They wanted to prevent you from publishing negative study results. Were you disappointed that you weren’t worth more?

Peter Wilmshurst: (laughs) I was just a bit surprised to be offered any money, really. I was a very junior researcher and doctor, only 33 years old, so I didn’t know that sort of thing happened. I didn’t know that you could be offered money to conceal data.

SPIEGEL: How exactly did they offer it to you? They probably didn’t say: “Here’s a bribe for you.”

Wilmshurst: No, of course not! Initially we were talking about the results that I’d obtained: That the drug that I had been testing for them did not work and had dangerous side effects. Then the company representatives asked me to leave some of the patients out of the data analysis. Without these patients, the study result would have been positive. When I said I couldn’t do that, they asked me not to publish the data. And to compensate me for the work I had done in vain, they said, they would offer me this amount of money.

I recommend you read the whole thing.

P.S. Full disclosure: Some of my research is funded by Novartis.

“Merciless Indian savages”

Americans (used to) love world government

cat globe

Sociologist David Weakliem writes:

It appears that an overwhelming majority of Americans who have an opinion on the subject think that Britain should remain in the European Union. But how many would support the United States joining an organization like the EU? My guess is very few. But back in 1946, the Gallup Poll asked “Do you think the United Nations organization should be strengthened to make it a world government with power to control the armed forces of all nations, including the United States?” 54% said yes, and only 24% no, with the rest undecided. The question was asked again in 1946 and 1947, with similar results. In 1951, the margin was smaller, at 49-36%. In 1953 and 1955, there were narrow margins against the idea. That was the last time the question, or anything like it, was asked. Of course, opposition probably would have increased if anyone had seriously tried to implement a plan like this, but for a while many Americans were willing to at least contemplate the idea.

Wow. 54%. Really? I did a Google search and indeed that’s what the poll said. Here’s George Gallup writing in the Pittsburgh Press on Christmas Eve, 1947:

Screen Shot 2016-07-01 at 8.01.13 PM

Even more striking was the 49% support as late as 1951, at which point I assume any illusions about our Soviet allies had dissipated.

Weakliem does have a good point, though, when he writes that “opposition probably would have increased if anyone had seriously tried to implement a plan like this.” Supporting world government is one thing; supporting any particular version of it is another.

Anyway, this poll finding seems worth sharing amid all the Brexit discussion, also a good item for July 4th.

On deck this week

Mon: Americans (used to) love world government

Tues: “Positive Results Are Better for Your Career”

Wed: “I would like to share some sad stories from economics related to these issues”

Thurs: Happiness formulas

Fri: “Participants reported being hungrier when they walked into the café (mean = 7.38, SD = 2.20) than when they walked out [mean = 1.53, SD = 2.70, F(1, 75) = 107.68, P < 0.001]."

Sat: Causal and predictive inference in policy research

Sun: Over at the sister blog, they’re overinterpreting forecasts

Data science as the application of theoretical knowledge

Patrick Atwater writes:

Insights that “much of what’s hard looks easy” and it’s about “getting the damn data” highlight important points that much of the tech-ey industry dominating definitions overlook in the excitement about production ML recommendation systems and the like.

Working to build from that grounded perspective, I penned together a quick piece digging into what really defines data science and I think the applied nature of the work that you hint at holds an important key. In many ways, the confused all-things-to-all-people nature echos the fractal nature of a field like “management” which devolves into poetic aphorisms and intellectual-lite books elucidating best practices while at the same time pulling from more formal academic disciplines (civil engineering, environmental science, and chemistry for instance in water management).

Consider an applied data science example. A friend at a water utility I work with built a billing calculator using R shiny. That required something like a half day of analytical work and then a couple weeks to get the UI/UX looking right and the servers playing nicely. Note that’s an analyst doing the work rather than a software engineer which I think speaks to the interdisciplinary nature of data science and the oft cited CS / Statistics / Domain expertise venn diagram.

I don’t really have anything to say about this—the language is too far from mine—but I thought I’d share it with you.

Too good to be true: when overwhelming mathematics fails to convince

Gordon Danning points me to this news article by Lisa Zyga, “Why too much evidence can be a bad thing,” reporting on a paper by Lachlan Gunn and others. Their conclusions mostly seem reasonable, if a bit exaggerated. For example, I can’t believe this:

The researchers demonstrated the paradox in the case of a modern-day police line-up, in which witnesses try to identify the suspect out of a line-up of several people. The researchers showed that, as the group of unanimously agreeing witnesses increases, the chance of them being correct decreases until it is no better than a random guess.

This doesn’t make sense. I have a feeling their conclusion is leaning heavily on some independence assumption in their model.

I clicked through to see the paper, and I don’t see any actual data on police lineups. So I see no reason to trust them on that. The math is interesting, though, and I’ll agree there’s some relevance to real problems. I’m just disturbed by everyone’s willingness to assume the particular mathematical results apply to particular real scenarios.

“Simple, Scalable and Accurate Posterior Interval Estimation”

Cheng Li, Sanvesh Srivastava, and David Dunson write:

We propose a new scalable algorithm for posterior interval estimation. Our algorithm first runs Markov chain Monte Carlo or any alternative posterior sampling algorithm in parallel for each subset posterior, with the subset posteriors proportional to the prior multiplied by the subset likelihood raised to the full data sample size divided by the subset sample size. To obtain an accurate estimate of a posterior quantile for any one-dimensional functional of interest, we simply calculate the quantile estimates in parallel for each subset posterior and then average these estimates.

Wow—does this really work? This could be awesome. Definitely worth trying out in Stan.

Informative priors for treatment effects

Biostatistician Garnett McMillan writes:

A PI recently completed a randomized trial where the experimental treatment showed a large, but not quite statistically significant (p=0.08) improvement over placebo. The investigators wanted to know how many additional subjects would be needed to achieve significance. This is a common question, which is very hard to answer for non-statistical audiences. Basically, I said we would need to conduct a new study.

I took the opportunity to demonstrate how a Bayesian analysis of these data using skeptical and enthusiastic priors on the treatment effect. I also showed how the posterior is conditional on accumulated data, and naturally lends itself to sequential analysis with additional recruitment. The investigators, not surprisingly, loved the Bayesian analysis because it gave them ‘hope’ that the experimental treatment might really help their patients.

Here is the problem: The investigators want to report BOTH the standard frequentist analysis AND the Bayesian analysis. In their mind the two analyses are simply two sides of the same coin. I have never seen this (outside of statistics journals), and have a hard time explaining how one reconciles statistical results where the definition of probability is so different. Do you have any help for me in explaining this problem to non-statisticians? Any useful metaphors or analogies?

My reply: I think it’s fine to consider the classical analysis as a special case of the Bayesian analysis under a uniform prior distribution. So in that way all the analyses can be presented on the same scale.

But I think what’s really important here is to think seriously about plausible effect sizes. It is not in general a good idea to take a noisy point estimate and use it as a prior. For example, suppose the study so far gives an estimated odds ratio of 2.0, with a (classical) 95% interval of (0.9, 4.4). I would not recommend a prior centered around 2. Indeed, the same sort of problem—or even worse—comes from taking previous published results as a prior. Published results are typically statistically significant and thus can grossly overestimate effect sizes.

Indeed, my usual problem with the classical estimate, or its Bayesian interpretation, is with the uniform prior, which includes all sorts of unrealistically large treatment effects. Real treatment effects are usually small. So I’m guessing that with a realistic prior, estimates will be pulled toward zero.

On the other hand, I don’t see the need for requiring 95% confidence. We have to make our decisions in the meantime.

Regarding the question of whether the treatment helps the patients: I can’t say more without context, but in many settings we can suppose that the treatment is helping some people and hurting others, so I think it makes sense to consider these tradeoffs.

Horrible attack in Turkey

I don’t have anything to say about this, nor I think did I blog on the attacks in Florida or Paris or all the terrible things going on in the Middle East every day. It’s not my area of expertise and I don’t have anything particular to add.

I’m only posting this note here because we’ve had a bunch of recent posts related to Brexit news, and during the next several months we’ll continue to be interrupting our regular fare of himmicanes, power pose, and Stan with posts on the U.S. election. (The blog is already full through mid-Nov but polling and election news will certainly be causing us to bump some scheduled material.) It seems kinda weird to sometimes be posting topically but then to completely ignore things like the Turkey attack.

We learn from anomalies. And, the sad thing is, these terror events are no longer anomalous; they’re common place. Speaking as a social scientist, the way to study these is no longer, why do they happen, but rather, when and how do they occur, how do people and organizations react to them, and so forth.

Again, I have no special insight here; it just felt odd not to not react at all to something which, considered in isolation, is such a striking and scary thing. You can take it as representative of all the striking and scary things happening in the world that we just let pass by us.

Why experimental economics might well be doing better than social psychology when it comes to replication

There’s a new paper, “Evaluating replicability of laboratory experiments in economics,” by Colin Camerer, Anna Dreber, Eskil Forsell, Teck-Hua Ho, Jürgen Huber, Magnus Johannesson, Michael Kirchler, Johan Almenberg, Adam Altmejd, Taizan Chan, Emma Heikensten, Felix Holzmeister, Taisuke Imai, Siri Isaksson, Gideon Nave, Thomas Pfeiffer, Michael Razen, Hang Wu, which three different people sent to me, including one of the authors of the paper, a journalist, and also Dale Lehman, who wrote:

This particular study appears to find considerable reproducibility and I think it would be valuable for you to comment on it. I have not reviewed it myself, but I suspect it has been done reasonably—my guess is that there are some good reasons why experimental economics studies might be more readily reproducible (perhaps I should say replicable) than studies in psych, business, etc. The experiments are generally better conceived so as to have fewer intervening factors—e.g., random assignment with varying levels of financial rewards for performing some types of tasks. I believe these types of experiments differ in some fundamental ways from experiments about “power poses” or other such things. It may also be that economists are more careful about experimental setup than psychologists.

One further issue that deserves some attention is the difference between reproducing economic experimental results and replicating economic observational studies. I believe the status of the latter is likely to be very poor—and nearly impossible to investigate given how hard it is to get access to the data. The ability to reproduce results from the experimental studies casts no light on the likelihood of being able to replicate other types of economics studies.

The paper also came up here on the blog, where I wrote that I would not be surprised if experimental economics has a higher rate of replication than social psychology. I don’t know enough about the field of economics to make the comparison with any confidence, but as I said in my post on psychology replication, I feel that many social psychologists are destroying their chances by purposely creating interventions that are minor and at times literally imperceptible. Economists perhaps are more willing to do real interventions.

Another thing is that economists, compared to psychologists, seem more attuned to the challenges of generalizing from survey or lab to real-world behavior. Indeed, many times economists have challenged well-publicized findings in social psychology by arguing that people won’t behave these ways in the real world with real money at stake. So, just to start with, economists unlike psychologists seem aware of the generalization problem.

That said, I think it’s hard to interpret any of these overall percentages because it all depends on what you include in your basket of studies. Stroop will replicate, ovulation and voting won’t, and there’s a big spectrum in between.

Broken broken windows policy?

cat_window_broken_peep_104131_1920x1080

A journalist pointed me to this recent report from the New York City Department of Investigation, which begins:

Between 2010 and 2015, the New York City Police Department (NYPD) issued 1,839,414 “quality-of-life” summonses for offenses such as public urination, disorderly conduct, drinking alcohol in public, and possession of small amounts of marijuana. . . . NYPD has claimed for two decades that quality-of-life enforcement is also a key tool in the reduction of felony crime . . . NYPD has claimed for two decades that quality-of-life enforcement is also a key tool in the reduction of felony crime.

Here’s what they find:

OIG-NYPD’s analysis has found no empirical evidence demonstrating a clear and direct link between an increase in summons and misdemeanor arrest activity and a related drop in felony crime. Between 2010 and 2015, quality-of-life enforcement ratesand in particular, quality-of-life summons rateshave dramatically declined, but there has been no commensurate increase in felony crime. While the stagnant or declining felony crime rates observed in this six-year time frame may be attributable to NYPD’s other disorder reduction strategies or other factors, OIG-NYPD finds no evidence to suggest that crime control can be directly attributed to quality-of-life summonses and misdemeanor arrests.

I took a quick look, and they make a reasonable case that there’s no evidence from 2010-2015 that so-called quality-of-life policing has any effect in reducing serious crime, hence there’s a good case for doing less of this sort of harassment of citizens on the street, given that, as they say in the report: “Issuing summonses and making misdemeanor arrests are not cost free. The cost is paid in police time, in an increase in the number of people brought into the criminal justice system and, at times, in a fraying of the relationship between the police and the communities they serve.”

But I don’t know how relevant this is to claims about the effectiveness of quality-of-life or “broken windows” policing in the past. The argument was that quality-of-life policing was necessary in the 1970s/80s/90s because the law was not widely respected. Now that behaviors, attitudes, and expectations have changed, perhaps an intense level of quality-of-life policing no longer has the effect it had earlier. So it’s possible that the ramping-up of those sorts of police actions was a good idea in the 1990s, and that the ramping-down is a good idea now.

P.S. I don’t know who wrote the report in question. The only names I see are Mark Peters, Commissioner, and Philip Eure, Inspector General for the NYPD. I find it difficult to interact with a document with no listed author. At the end of the report, it says, “Please contact us at: Office of the Inspector General for the New York City Police Department” and then gives an address and some phone numbers and emails. That’s fine, and I understand that it’s the whole Office of the Inspector General for the NYPD that takes responsibility for the report, but there’s still an author, right?

P.P.S. Here’s a news article by Nick Pinto with some background.

P.P.P.S. Lots of informed discussion here from Peter Moskos.

Short course on Bayesian data analysis and Stan 18-20 July in NYC!

logo_textbottom

Jonah Gabry, Vince Dorie, and I are giving a 3-day short course in two weeks.

Before class everyone should install R, RStudio and RStan on their computers. (If you already have these, please update to the latest version of R and the latest version of Stan, which is 2.10.) If problems occur please join the stan-users group and post any questions. It’s important that all participants get Stan running and bring their laptops to the course.

Class structure and example topics for the three days:

Monday, July 18: Introduction to Bayes and Stan
Morning:
Intro to Bayes
Intro to Stan
The statistical crisis in science
Afternoon:
Stan by example
Components of a Stan program
Little data: how traditional statistical ideas remain relevant in a big data world

Tuesday, July 19: Computation, Monte Carlo and Applied Modeling
Morning:
Computation with Monte Carlo Methods
Debugging in Stan
Generalizing from sample to population
Afternoon:
Multilevel regression and generalized linear models
Computation and Inference in Stan
Why we don’t (usually) have to worry about multiple comparisons

Wednesday, July 20: Advanced Stan and Big Data
Morning:
Vectors, matrices, and transformations
Mixture models and complex data structures in Stan
Hierarchical modeling and prior information
Afternoon:
Bayesian computation for big data
Advanced Stan programming
Open problems in Bayesian data analysis

Specific topics on Bayesian inference and computation include, but are not limited to:
Bayesian inference and prediction
Naive Bayes, supervised, and unsupervised classification
Overview of Monte Carlo methods
Convergence and effective sample size
Hamiltonian Monte Carlo and the no-U-turn sampler
Continuous and discrete-data regression models
Mixture models
Measurement-error and item-response models

Specific topics on Stan include, but are not limited to:
Reproducible research
Probabilistic programming
Stan syntax and programming
Optimization
Warmup, adaptation, and convergence
Identifiability and problematic posteriors
Handling missing data
Ragged and sparse data structures
Gaussian processes

Again, information on the course is here.

The course is organized by Lander Analytics.

The course is not cheap. Stan is open-source, and we organize these courses to raise money to support the programming required to keep Stan up to date. We hope and believe that the course is more than worth the money you pay for it, but we hope you’ll also feel good, knowing that this money is being used directly to support Stan R&D.

Should this paper in Psychological Science be retracted? The data do not conclusively demonstrate the claim, nor do they provide strong evidence in favor. The data are, however, consistent with the claim (as well as being consistent with no effect)

Retractions or corrections of published papers are rare. We routinely encounter articles with fatal flaws, but it is so rare that such articles are retracted that it’s news when it happens.

Retractions sometimes happen at the request of the author (as in the link above, or in my own two retracted/corrected articles) and other times are achieved only with great difficulty if at all, in the context of scandals involving alleged scientific misconduct (Hauser), plagiarism (Wegman), fabrication (Lacour, Stapel), and plain old sloppiness (Reinhart and Rogoff, maybe Tol falls into this category as well).

And one thing that’s frustrating is that, even when the evidence is overwhelming that a published claim is just plain wrong, authors will fight and fight and refuse to admit even an inadvertent mistake (see the story on pages 51-52 here).

These cases are easy calls from the ethical perspective, whatever political difficulties might arise in trying to actually elicit a reaction in the face of opposition.

Should this paper be retracted?

Now I want to talk about a different example. It’s a published paper not involving any scientific misconduct, not even any p-hacking that I notice, but the statistical analysis is flawed, to the extent that I do not think the data offer any strong support for the researchers’ hypothesis.

Should this paper be retracted/corrected? I see three arguments:

1. Yes. The paper was published as an empirical study that offers strong support for a certain hypothesis. The study offers no such strong support, hence the paper should be flagged so that future researchers do not take it as evidence for something it’s not.

2. No. Although the data are consistent with the researchers’ hypothesis being false, they are also consistent with the researchers’ hypothesis being true. We can’t demonstrate convincingly that the hypothesis is wrong, either, so the paper should stand.

3. No. In practice, retraction and even correction are very strong signals, and these researchers should not be punished for an innocent mistake. It’s hard enough to get actual villains to retract their papers, so why pick on these guys.

Argument 3 has some appeal but I’ll set it aside; for the purpose of this post I will suppose that retractions and corrections should be decided based on scientific merit rather than on a comparative principle.

I’ll also set aside the reasonable argument that, if a fatal statistical error is enough of a reason for retraction, then half the content of Psychological Science would be retracted each issue.

Instead I want to focus on the question: To defend against retraction, is it enough to point out that your data are consistent with your theory, even if the evidence is not nearly as strong as was claimed in the published paper?

A study of individual talent and team performance

OK, now for the story, which I learned about through this email from Jeremy Koster:

I [Koster] was reading this article in Scientific American, which led me to the original research article in Psychological Science (a paper that includes a couple of researchers from the management department at Columbia, incidentally).

After looking at Figure 2 for a little while, I [Koster] thought, “Hmm, that’s weird, what soccer teams are comprised entirely of elite players?”

Screen Shot 2015-10-04 at 12.14.46 AM

Which led me to their descriptive statistics. Their x-axis ranges to 100%, but the means and SD’s are only 7% and 16%, respectively:

Screen Shot 2015-10-04 at 12.28.35 AM

They don’t plot the data or report the range, but given that distribution, I’d be surprised if they had many teams comprising 50% elite players.

And yet, their results hinge on the downward turn that their quadratic curve takes at these high values of the predictor. They write, “However, Study 2 also revealed a significant quadratic effect of top talent: Top talent benefited performance only up to a point, after which the marginal benefit of talent decreased and turned negative (Table 2, Model 2; Fig. 2).”

If you’re looking to write a post about the perils of out-of-sample predictions, this would seem to be a fun candidate . . .

For convenience, I’ve displayed the above curve in the range 0 to 50% so you can see that, based on the fitted model, there’s no evidence of any decline in performance:

Screen Shot 2015-10-04 at 12.33.09 AM

So, in case you were thinking of getting both Messi and Cristiano Ronaldo on your team: Don’t worry. It looks like your team’s performance will improve.

Following the links

The news article is by a psychology professor named Cindi May and is titled, “The Surprising Problem of Too Much Talent: A new finding from sports could have implications in business and elsewhere.” The research article is by Roderick Swaab, Michael Schaerer, Eric Anicich, Richard Ronay and Adam Galinsky and is titled, “The Too-Much-Talent Effect: Team Interdependence Determines When More Talent Is Too Much or Not Enough.”

May writes:

Swaab and colleagues compared the amount of individual talent on teams with the teams’ success, and they find striking examples of more talent hurting the team. The researchers looked at three sports: basketball, soccer, and baseball. In each sport, they calculated both the percentage of top talent on each team and the teams’ success over several years. . . .

For both basketball and soccer, they found that top talent did in fact predict team success, but only up to a point. Furthermore, there was not simply a point of diminishing returns with respect to top talent, there was in fact a cost. Basketball and soccer teams with the greatest proportion of elite athletes performed worse than those with more moderate proportions of top level players.

Now that the finding’s been established, it’s story time:

Why is too much talent a bad thing? Think teamwork. In many endeavors, success requires collaborative, cooperative work towards a goal that is beyond the capability of any one individual. . . . When a team roster is flooded with individual talent, pursuit of personal star status may prevent the attainment of team goals. The basketball player chasing a point record, for example, may cost the team by taking risky shots instead of passing to a teammate who is open and ready to score.

Two related findings by Swaab and colleagues indicate that there is in fact tradeoff between top talent and teamwork. First, Swaab and colleagues found that the percentage of top talent on a team affects intrateam coordination. . . . The second revealing finding is that extreme levels of top talent did not have the same negative effect in baseball, which experts have argued involves much less interdependent play. In the baseball study, increasing numbers of stars on a team never hindered overall performance. . . .

The lessons here extend beyond the ball field to any group or endeavor that must balance competitive and collaborative efforts, including corporate teams, financial research groups, and brainstorming exercises. Indeed, the impact of too much talent is even evident in other animals: When hen colonies have too many dominant, high-producing chickens, conflict and hen mortality rise while egg production drops.

This is all well and good (except the bit about the hen colonies; that seems pretty much irrelevant to me, but then again I’m not an egg farmer so what do I know?), but it all hinges on the general validity of the claims made in the research paper. Without the data, it’s just storytelling. And I can tell as good a story as anyone. OK, not really. Steven King’s got me beat. Hell, Jonathan Franzen’s got me beat. Salman Rushdie on a good day’s got me beat. John Updike or Donald Westlake could probably still out-story me, even though they’re both dead. But I can tell stories just as well as the ovulation-and-voting people, or the fat-arms-and-politial attitudes people, or whatsisname who looked at beauty and sex ratio, etc. Stories are cheap. Convincing statistical evidence, that’s what’s hard to find.

So . . . I was going to look into this. After all, I’m a busy guy, I have lots to do and thus a desperate need to procrastinate. So if some perfect stranger emails me asking me to look into a paper I’ve never heard of on a topic that only mildly interests me (yes, I’m a sports fan but, still, this isn’t the most exciting hypothesis in the world), then, sure, I’m up for it. After all, if the options are blogging or real work, I’ll choose blogging any day of the week.

I contacted one of the authors who’s at Columbia and he reminded me that this paper had been discussed online by Leif Nelson and Uri Simonsohn. And then I remembered that I’d read that post by Nelson and Simonsohn and commented on it myself a year ago.

Swaab et al. responded to Nelson and Simonsohn with a short note, and here are their key graphs:

Screen Shot 2015-12-17 at 1.40.37 PM

I think we can all agree on three things:

1. There’s not a lot of data in the “top talent” range as measured by the authors. Thus, to the extent there is a “top talent effect,” it is affecting very few teams.

2. The data are consistent with there being declining performance for the most talented teams.

3. The data are also consistent with there being no decline in performance for the most talented teams. Or, to put it another way, if these were the quantitative results that had been published (when using the measure that they used in the main text of their paper, they found no statistically significant decline at all; they were only able to find such a decline by changing to a different measure that had only been in the supplementary version of their original paper), I can’t imagine the paper would’ve been published.

The authors also present results for baseball, which they argue should not show a “too-much-talent effect”:

Screen Shot 2015-12-17 at 1.58.20 PM

This looks pretty convincing, but I think the argument falls apart when you look at it too closely. Sure, these linear patterns look pretty good. But, again, these graphs are also consistent with a flat pattern at the high end—just draw a threshold far enough near the right edge of either graph and you’ll find no statistically significant pattern beyond the threshold.

In discussing these results, Swaab et al. write, “The finding that the effect of top talent becomes flat (null) at some point is an important finding: Even under the assumption of diminishing marginal returns, the cost-benefit ratio of adding more talent can decline as hiring top talent is often more expensive than hiring average talent.”

Sure, that’s fine, but recall that a key part of their paper was that their empirical findings contradicted naive intuition. In fact, the guesses they reported from naive subjects did show declining marginal return on talent. Look at this, from their Study 1 on “Lay Beliefs About the Relationship Between Top Talent and Performance”:

Screen Shot 2015-12-17 at 2.10.13 PM

These lay beliefs seem completely consistent with the empirical data, especially considering that Swaab et al. defined “top talent” in a way so that there are very few if any teams in the 90%-100% range of talent.

What to do?

OK, so should the paper be retracted? What do you think?

The authors summarize the reanalysis with the remark:

The results of the new test . . . suggest that the strongest version of our arguments—that more talent can even lead to worse performance—may not be as robust as we initially thought. . . .

Sure, that’s one way of putting it. But “not as robust as we initially thought” could also be rephrased as “The statistical evidence is not as strong as we claimed” or “The data are consistent with no decline in performance” or, even more bluntly, “The effect we claimed to find may not actually exist.”

Or, perhaps this:

We surveyed ordinary people who thought that there would be diminishing returns of top talent on team performance. We, however, boldly proclaimed that, at some point, increasing the level of top talent would decrease team performance. Then we and others looked carefully at the data, and we found that the data are more consistent with with ordinary people’s common-sense intuition than with our bold, counterintuitive hypothesis. We made a risky hypothesis and it turned out not to be supported by the data.

That’s how things go when you make a risky hypothesis—you often get it wrong. That’s why they call it a risk!

How is Brexit different than Texit, Quexit, or Scotxit?

Here’s a news item:

Emboldened by Brexit, U.S. secessionists in Texas are keen to adopt the campaign tactics used to sway the British vote for leaving the European Union and are demanding “Texit” comes next. . . . “The Texas Nationalist Movement is formally calling on the Texas governor to support a similar vote for Texans,” the group said on Friday. . . . The group, which claims about a quarter million supporters, failed earlier this year to place a vote on secession on the November ballot but aims to relaunch its campaign for the next election cycle in 2018, buoyed by the British vote . . .

And, of course, Quebec and Scotland have been talking for awhile about leaving Canada and the United Kingdom, respectively.

There’s a big difference between Brexit, on one hand, and Texit or Quexit or Scotxit on the other, and this has to do with the democratic structure, or lack thereof, of the larger political unit.

Suppose the conservative voters of Texas decide that they don’t want to be part of the liberal-dominated U.S. government. They’re sick of Obamacare, environmental regulations, the $15 minimum wage, unisex bathrooms, and a foreign policy that sends U.S. troops all over the world on ill-defined missions. Fine. But then they have to realize that, by leaving the country, they’ll make the rest of the U.S. more liberal. Texiters are gaining freedom to operate within Texas but losing influence within the larger United States.

Similarly, Quexit would leave the rest of Canada without Quebec’s representation. So, if Quebec were to go on its own in one direction, one would expect Canada to drift slowly the other way, a sort of equal-and-opposite, conservation-of-momentum sort of way.

And Scotxit would give that northern country self-government but at the cost of their influence within Great Britain. After the past few elections, Scots might feel this is a tradeoff worth making—especially if you throw E.U. membership into the bargain—but it clearly is a tradeoff.

Brexit is different because the E.U. is not democratic (except for the powerless European parliament). Britain is not the Texas of Europe, so the analogy is not perfect, but the point is that the individual British voter has little to no influence in Brussels. Or, to put it another way, this influence is so indirect that it is hard to see. It’s not like Texas’s 38 electoral votes, 36 members of the House of Representatives, and two Senators. So, even if British voters are more conservative (in some sense) than the average European, it’s not clear that the departure of the U.K. will allow the rest of the E.U. to shift to the left—not in the same way that Texas would shift the rest of the U.S. to the left, or that Scotxit would shift the rest of Britain to the right.

In a simple parliamentary or majority vote system, there’s a rough balance of influence, and if you take some voters away from one side, it will increase the relative power of the other. Thus, Texit or Scotxit is, to first order, zero-sum with regard to political power. (Not zero-sum with regard to ultimate outcomes—that depends on all sorts of things that might happen—but zero-sum in that you’re getting local power but giving up the corresponding number of votes at the national level.) Brexit, not so much: British voters are gaining power within their country and it’s not so clear what power they’re losing within the E.U. This is not to say that Brexit is a good idea—what do I know about that?—but just that the political calculation is different because of the non-democratic nature of the larger structure.

In that way, the appropriate category for Brexit is not Texit or Quexit or Scotxit, but the decision to join or withdraw from a treaty agreement. This point is obvious—“Brexit” is, after all, a recommendation to withdraw from a treaty—but I feel like this point has been missed in much of the discussion of the topic.

P.S. I don’t know enough about Quebec and Scotland to comment on their cases, but when we discuss Texit moving U.S. politics moving to the left, this is not completely speculative. Last time Texas exited the United States, the national government enacted various left-wing ideas including the Homestead Act, the Morrill Act (land-grant colleges), and of course emancipation of the slaves.

On deck this week

Mon: How is Brexit different than Texit, Quexit, or Scotxit?

Tues: Should this paper in Psychological Science be retracted? The data do not conclusively demonstrate the claim, nor do they provide strong evidence in favor. The data are, however, consistent with the claim (as well as being consistent with no effect)

Wed: Individual and aggregate patterns in the Equality of Opportunity research project

Thurs: Why experimental economics studies might well do better than psychology when it comes to replication

Fri: Informative priors for treatment effects

Sat: Too good to be true: when overwhelming mathematics fails to convince

Sun: Data science as the application of theoretical knowledge

When are people gonna realize their studies are dead on arrival?

stock-vector-dog-bites-man-208556146

A comment at Thomas Lumley’s blog pointed me to this discussion by Terry Burnham with an interesting story of some flashy psychology research that failed to replicate.

Here’s Burnham:

[In his popular book, psychologist Daniel] Kahneman discussed an intriguing finding that people score higher on a test if the questions are hard to read. The particular test used in the study is the CRT or cognitive reflection task invented by Shane Frederick of Yale. The CRT itself is interesting, but what Professor Kahneman wrote was amazing to me [Burnham],

90% of the students who saw the CRT in normal font made at least one mistake in the test, but the proportion dropped to 35% when the font was barely legible. You read this correctly: performance was better with the bad font.

I [Burnham] thought this was so cool. The idea is simple, powerful, and easy to grasp. An oyster makes a pearl by reacting to the irritation of a grain of sand. Body builders become huge by lifting more weight. Can we kick our brains into a higher gear, by making the problem harder?

This is a great start (except for the odd bit about referring to Kahneman as “Professor”).

As in many of these psychology studies, the direct subject of the research is somewhat important, the implications are huge, and the general idea is at first counterintuitive but then completely plausible.

In retrospect, the claimed effect size is ridiculously large, but (a) we don’t usually focus on effect size, and (b) a huge effect size often seems to be taken as a sort of indirect evidence: Sure, the true effect can’t be that large, but how could there be so much smoke if there were no fire at all?

Burnham continues with a quote from notorious social science hype machine Malcolm Gladwell, but I’ll skip that in order to spare the delicate sensibilities of our readership here.

Let’s now rejoin Burnham. Again, it’s a wonderful story:

As I [Burhnam] read Professor Kahneman’s description, I looked at the clock and realized I was teaching a class in about an hour, and the class topic for the day was related to this study. I immediately created two versions of the CRT and had my students take the test – half with an easy to read presentation and half with a hard to read version.

Screen Shot 2016-04-03 at 9.32.18 AM

Within 3 hours of reading about the idea in Professor Kahneman’s book, I had my own data in the form of the scores from 20 students. Unlike the study described by Professor Kahneman, however, my students did not perform any better statistically with the hard-to-read version. I emailed Shane Frederick at Yale with my story and data, and he responded that he was doing further research on the topic.

This is pretty clean, and it’s a story we’ve heard before (hence the image at the top of this post). Non-preregistered study #1 reports a statistically significant difference in a statistically uncontrolled setting; attempted replication #2 finds no effect. In this case the replication was only N = 20, so even with the time-reversal heuristic, we still might tend to trust the original published claim.

The story continues:

Roughly 3 years later, Andrew Meyer, Shane Frederick, and 8 other authors (including me [Burnham]) have published a paper that argues the hard-to-read presentation does not lead to higher performance.

The original paper reached its conclusions based on the test scores of 40 people. In our paper, we analyze a total of over 7,000 people by looking at the original study and 16 additional studies. Our summary:

Easy-to-read average score: 1.43/3 (17 studies, 3,657 people)
Hard-to-read average score: 1.42/3 (17 studies, 3,710 people)

Malcolm Gladwell wrote, “Do you know the easiest way to raise people’s scores on the test? Make it just a little bit harder.”

The data suggest that Malcolm Gladwell’s statement is false. Here is the key figure from our paper with my [Burnham’s] annotations in red:

png;base64db8e9525745b448

What happened?

After the plane crashes, we go to the black box to see the decision errors that led to the catastrophe.

So what happened with that original study? Here’s the description:

Main Study

Forty-one Princeton University undergraduates at the student campus center volunteered to complete a questionnaire that contained six syllogistic reasoning problems. The experimenter approached participants individually or in small groups but ensured that they completed the questionnaire without the help of other participants. The syllogisms were selected on the basis of accuracy base rates established in prior research (Johnson-Laird & Bara, 1984; Zielinski, Goodwin, & Halford, 2006). Two were easy (answered correctly by 85% of respondents), two were moderately difficult (50% correct response rate), and two were very difficult (20% correct response rate). The easy and very difficult items were omitted from further analyses because the ceiling and floor effects obscured the effects of fluency on processing depth. Shallow heuristic processing enabled participants to answer the easy items correctly, whereas systematic reasoning was insufficient to guarantee accuracy on the difficult questions. Participants were randomly assigned to read the questionnaire printed in either an easy-to-read (fluent) or a difficult-to-read (disfluent) font, the same fonts that were used in Experiment 1.

Finally, participants indicated how happy or sad they felt on a 7-point scale (1 = very sad; 4 = neither happy nor sad; 7 = very happy). This is a standard method for measuring transient mood states (e.g., Forgas, 1995).

Results and Discussion

As expected, participants in the disfluent condition answered a greater proportion of the questions correctly (M = 64%) than did participants in the fluent condition (M = 43%), t(39) = 2.01, p < .05, η2 = .09. This fluency manipulation had no impact on participants' reported mood state (Mfluent = 4.50 vs. Mdisfluent = 4.29), t < 1, η2 < .01; mood was not correlated with performance, r(39) = .18, p = .25; and including participants' mood as a covariate did not diminish the impact of fluency on performance, t(39) = 2.15, p < .05, η2 = .11. The performance boost associated with disfluent processing is therefore unlikely to be explained by differences in incidental mood states.

Let’s tick off the boxes:

– Small sample size and variable measurements ensure that any statistically significant difference will be huge, thus providing Kahneman- and Gladwell-bait.

– Data processing choices were made after the data were seen. In this case, two-thirds of the data were discarded because they did not fit the story. Sure, they have an explanation based on ceiling and floor effects—but what if they had found something? They would easily have been able to explain it in the context of their theory.

– Another variable (mood scale) was available. If the difference had shown up as statistically significant only after controlling for mood scale, or if there had been a statistically significant difference on mood scale alone, any of these things could’ve been reported as successful demonstrations of the theory.

Eastern_Grey_Kangaroo_Young_Waiting

What is my message here? Is it that researchers should be required to preregister their hypotheses? No. I can’t in good conscience make that recommendation given that I almost never preregister my own analyses.

Rather, my message is that this noisy, N = 41, between-person study never had a chance. The researchers presumably thought they were doing solid science, but actually they’re trying to use a bathroom scale to weigh a feather—and the feather is resting loosely in the pouch of a kangaroo that is vigorously jumping up and down.

To put it another way, those researchers might well have thought that at best they were doing solid science and at worst they were buying a lottery ticket, in that, even if their study was speculative and noisy, it was still giving them a shot at a discovery.

But, no, they weren’t even buying a lottery ticket. When you do this sort of noisy uncontrolled study and you “win” (that is, find a statistically significant comparison), you actually are very likely to be losing (high type M error, high type S error rate).

That’s what’s so sad about all this. Not that the original researchers failed—all of us fail all the time—but that they never really had a chance.

On the plus side, our understanding of statistics has increased so much in the past several years—no joke—that now we realize this problem, while in the past even a leading psychologist such as Kahneman and a leading journalist such as Gladwell were unaware of the problem.

P.S. It seems that I got some of the details wrong here.

Andrew Meyer supplies the correction:

That’s actually the black box from the wrong flight. Alter et al.’s paper included four tests of the underlying conjecture. All four experiments shared the same basic features: a noisy dependent variable, a subtle manipulation, a small sample size, and an impressive result. The data Gelman describes are Alter et al.’s fourth study, whose dependent variable was the number of correct judgments of syllogistic validity. Meyer & Frederick et al. attempted to replicate Alter’s first study, whose dependent variable was the number correct on a 3-item math test (Frederick’s CRT). A post-hoc examination of Alter et al.’s first study reveal some somewhat subtler issues. Quoting footnote 2 from Meyer & Frederick et al.:

It is often implied and sometimes claimed that disfluent fonts improve performance on the bat-and-ball problem. In fact, there was no such effect even in the original study, as shown in row 1 of Table 1. The entire effect reported in Alter et al. (2007) was driven by just one of the three CRT items: “widgets,” which was answered correctly by 16 of 20 participants in the disfluent font condition, but only 4 of 20 participants in the control condition. The 20% solution rate in the control condition is poorer than every other population except Georgia Southern. It is also significantly below the 50% solution rate observed in a sample of 300 Princeton students (data available from Shane Frederick upon request). This implicates sampling variation as the reason for the original result. If participants in the control condition had solved the widgets item at the same rate as Princeton students in other samples, the original experiment would have had a p value of 0.36, and none of the studies in Table 1 would exist.

Alter et al.’s fourth study actually has been successfully replicated once (Rotello & Heit, 2009). But it has failed to replicate on at least four other occasions (Exell & Stupple, 2011; Morsanyi & Handley, 2012; Thompson et al., 2013; Trippas, Handley & Verde, 2014).

We should also note that, since Meyer & Frederick et al. came out, we became aware of one additional “successful” test of font fluency effects on CRT scores: an experiment in a classroom setting at Dartmouth. So, the current count, including the 17 CRT studies mentioned in Meyer and Frederick et al. and the 2 that were not mentioned (the small one by Burnham and the moderately substantial one at Dartmouth), comes to 2 observations of a statistically significant disfluent font benefit on CRT scores and 17 failures to observe that effect.

Euro 2016 update

Big news out of Europe, everyone’s talking about soccer.

Leo Egidi updated his model and now has predictions for the Round of 16:

Screen Shot 2016-06-25 at 9.47.00 PM

Here’s Leo’s report, and here’s his zipfile with data and Stan code.

The report contains some ugly histograms showing the predictive distributions of goals to be scored in each game. The R histogram function FAILS with discrete data because it puts the bin boundaries at 0, 1, 2, etc. Or, in this case, 0, .5, 1, 1.5, etc., which is even worse because now the y-axis is hard to interpret as the frequencies all got multiplied by 2. When data are integers, you want the boundaries at -.5, .5, 1.5, 2.5, etc. Or use barplot(). Really, though, you want scatterplots because the teams are playing against each other. You’ll want heatmaps, actually: scatterplots don’t work so well with discrete data.

What they’re saying about “blended learning”: “Perhaps the most reasonable explanation is that no one watched the video or did the textbook reading . . .”

template_cell

Someone writes in:

I was wondering if you had a chance to see the commentary by the Stockwells on blended learning strategies that was recently published in Cell and which also received quite a nice write up by Columbia. It’s also currently featured on Columbia’s webpage.

In fact, I was a student in Prof. Stockwell’s Biochemistry class last year, and a participant in this study, which was why I was so surprised that it ended up in Cell and received the attention that it did.

I was part of the textbook group, for which he assigned over 30 pages of dense textbook reading (which would probably have taken multiple hours to fully digest, and was more than 2-3 times more than what he’d assign for a typical class), so I’m sure the video was much more tailored to the material he covered in class and ultimately quizzed everyone on. Moreover, in his interview Stockwell claims that he’ll “use video lectures and assign them in advance,” rather than relying exclusively on a textbook, yet it was surprising that in their commentary they write:

We also compared the exam scores of students in the textbook versus video preparation groups but found no statistically significant difference in this relatively modest sample size, despite the trend toward higher scores in the group that received the video assignment.

Perhaps the most reasonable explanation is that no one watched the video or did the textbook reading for a class that wasn’t going to be covered on any of the exams? What’s even more confusing to me is that they admit the sample size of the textbook/video groups were “modest,” but are readily able to draw conclusions about which of the 4 arms provides the most effect model for learning, when each arm had half as many participants as these two larger groups! I’m not sure if that’s just confirmation bias, or if the results are truly significant? I’m also not sure if the figure in the paper is mislabeled since Group 2 and 3 and panel A are different than what’s used in panel D (see above).

Do you have any thoughts on the statistical power of such a study?

I know the above seems a little bit like I have an axe to grind, but it seemed to me like the conclusions of this experiment were quite reaching, especially for such a short study with so few participants, and I was wondering what someone else with more expertise on experimental design than I have thought.

I had not heard about this study and don’t really have the time to look at it, but I’m posting it here in case any of you have any comments.

As to why Cell chose to publish it: This seems clear enough. Everybody knows that teaching is important and it’s hard to get students to learn, we try lots of teaching strategies but there are not a lot of controlled trials of teaching methods, so when there is such a study, and when it gives positive results with that magic “p less than .05,” then, yeah, I’m not surprised it gets published in a top journal.

Brexit polling: What went wrong?

Commenter numeric writes:

Since you were shilling for yougov the other day you might want to talk about their big miss on Brexit (off by 6% from their eve-of-election poll—remain up 2 on their last poll and leave up by 4 as of this posting).

Fair enough: Had Yougov done well, I could use them as an example of the success of MRP, and political polling more generally, so I should take the hit when they fail. It looks like Yougov was off by about 4 percentage points (or 8 percentage points if you want to measure things by vote differential). It will be interesting to how much this difference was nonuniform across demographic groups.

The difference between survey and election outcome can be broken down into five terms:

1. Survey respondents not being a representative sample of potential voters (for whatever reason, Remain voters being more reachable or more likely to respond to the poll, compared to Leave voters);

2. Survey responses being a poor measure of voting intentions (people saying Remain or Undecided even though it was likely they’d vote to leave);

3. Shift in attitudes during the last day;

4. Unpredicted patterns of voter turnout, with more voting than expected in areas and groups that were supporting Leave, and lower-than-expected turnout among Remain supporters.

5. And, of course, sampling variability. Here’s Yougov’s rolling average estimate from a couple days before the election:

freddie1

Added in response to comments: And here’s their final result, “YouGov on the day poll: Remain 52%, Leave 48%”:

Final poll

We’ll take this final 52-48 poll as Yougov’s estimate.

Each one of the above five explanations seems to be reasonable to consider as part of the story. Remember, we’re not trying to determine which of 1, 2, 3, 4, or 5 is “the” explanation; rather, we’re assuming that all five of these are happening. (Indeed, some of these could be happening but in the opposite direction; for example it’s possible that the polls oversampled Remain voters (a minus sign on item 1 above) but that this non-representativeness was more than overbalanced by a big shift in attitudes during the last day (a big plus sign on item 3).

The other thing is that item 5, sampling variability, does not stand on its own. Given the amount of polling on this issue (even within Yougov itself, as indicated by the graph above), sampling variability is an issue to the extent that items 1-4 above are problems. If there were no problems with representativeness, measurement, changes in attitudes, and turnout predictions, then the total sample size of all these polls would be enough that they’d predict the election outcome almost perfectly. But given all these other sources of uncertainty and variation, you need to worry about sampling variability too, to the extent that you’re using the latest poll to estimate the latest trends.

OK, with that as background, what does Yougov say? I went to their website and found this article posted a few hours ago:

Unexpectedly high turnout in Leave areas pushed the campaign to victory

Unfortunately YouGov was four points out in its final poll last night, but we should not be surprised that the referendum was close – we have shown it close all along. Over half our polls since the start of the year we showed Brexit in the lead or tied. . . .

As we wrote in the Times newspaper three days ago: “This campaign is not a “done deal”. The way the financial and betting markets have reacted you would think Remain had already won – yesterday’s one day rally in the pound was the biggest for seven years, and the odds of Brexit on Betfair hit 5-1. But it’s hard to justify those odds using the actual data…. The evidence suggests that we are in the final stages of a genuinely close and dynamic race.”

Just to check, what did Yougov say about this all before the election? Here’s their post from the other day, which I got by following the links from my post linked above:

Our current headline estimate of the result of the referendum is that Leave will win 51 per cent of the vote. This is close enough that we cannot be very confident of the election result: the model puts a 95% chance of a result between 48 and 53, although this only captures some forms of uncertainty.

The following three paragraphs are new, in response to comments, and replace one paragraph I had before:

OK, let’s do a quick calculation. Take their final estimate that Remain will win with 52% of the vote and give it a 95% interval with width 6 percentage points (a bit wider than the 5-percentage-point width reported above, but given that big swing, presumably we should increase the uncertainty a bit). So the interval is [49%, 55%], and if we want to call this a normal distribution with mean 52% and standard deviation 1.5%, then the probability of Remain under this model would be pnorm(52, 50, 1.5) = .91, that is, 10-1 odds in favor. So, when Yougov said the other day that “it’s hard to justify those [Betfair] odds” of 5-1, it appears that they (Yougov) would’ve been happy to give 10-1 odds.

But these odds are very sensitive to the point estimate (for example, pnorm(51.5, 50, 1.5) = .84, which gives you those 5-1 odds), to the forecast uncertainty (for example, pnorm(52, 50, 2.5) = .79), and to any smoothing you might do (for example, take a moving average of the final few days and you get something not far from 50/50).

In short, betting odds in this setting are highly sensitive to small changes in the model, and when the betting odds stay stable (as I think they were during the final period of Brexit), this suggests they contain a large element of convention or arbitrary mutual agreement.

The “out” here seems to be that last part of Yougov’s statement from the other day: “although this only captures some forms of uncertainty.”

It’s hard to know how to think about other forms of uncertainty, and I think that one way that people handle this in practice is to present 95% intervals and treat them as something more like 50% intervals.

Think about it. If you want to take the 95% interval as a Bayesian predictive interval—and Yougov does use Bayesian inference—then you’d be concluding that the odds are 40-1 that Remain would get more than 48% of the vote the outcome would fall below the lower endpoint of the interval. That’s pretty strong. But that would not be an appropriate conclusion to draw, not if you remember that this interval “only captures some forms of uncertainty.” So you can mentally adjust the interval, either by making it wider to account for these other sources of uncertainty, or by mentally lowering its probability coverage. I argue that in practice people do the latter, that they take 95% intervals as statements of uncertainty, without really believing the 95% part.

OK, fine, but if that’s right, then did the betting markets appear to be taking Yougov’s uncertainties literally with those 5-1 odds? There I’m guessing the problem was . . . other polls. Yougov was saying 51% for Leave, or maybe 52% for Remain, but other polls were showing large leads for Remain. If all the polls had looked like Yougov, and had betters been rational about accounting for nonsampling error, we might have seen something like 3-1 or 2-1 odds in favor, which would’ve been more reasonable (from a prospective sense, given Yougov’s pre-election polling results and our general knowledge that nonsampling error can be a big deal).

Houshmand Shirani-Mehr, David Rothschild, Sharad Goel, and I recently wrote a paper estimating the level of nonsampling error in U.S. election polls, and here’s what we found:

It is well known among both researchers and practitioners that election polls suffer from a variety of sampling and non-sampling errors, often collectively referred to as total survey error. However, reported margins of error typically only capture sampling variability, and in particular, generally ignore errors in defining the target population (e.g., errors due to uncertainty in who will vote). Here we empirically analyze 4,221 polls for 608 state-level presidential, senatorial, and gubernatorial elections between 1998 and 2014, all of which were conducted during the final three weeks of the campaigns. Comparing to the actual election outcomes, we find that average survey error as measured by root mean squared error (RMSE) is approximately 3.5%, corresponding to a 95% confidence interval of ±7%—twice the width of most reported intervals.

Got it? Take that Yougov pre-election 95% interval of [.48,.53] and double its width and you get something like [.46,.56] which more appropriately captures your uncertainty.

That all sounds just fine. But . . . I didn’t say this before the vote? So now the question is not, “Yougov: what went wrong?” or “UK bettors: what went wrong?” but, rather, “Gelman: what went wrong?”

That’s a question I should be able to answer! I think the most accurate response is that, like everyone else, I was focusing on the point estimate rather than the uncertainty. And, to the extent I was focusing on the uncertainty I was implicitly taking reported 95% intervals and treating them like 50% intervals. And, finally, I was probably showing too much deference to the betting line.

But I didn’t put this all together and note the inconsistency between the wide uncertainty intervals from the polls (after doing the right thing and widening the intervals to account for nonsampling errors) and the betting odds. In writing about the pre-election polls, I focused on the point estimate and didn’t focus in on the anomaly.

I should get some credit for attempting to untangle these threads now, but not as much as I’d deserve if I’d written this all two days ago. Credit to Yougov, then, for publicly questioning the 5-1 betting odds, before the voting began.

OK, now back to Yougov’s retrospective:

YouGov, like most other online pollsters, has said consistently it was a closer race than many others believed and so it has proved. While the betting markets assumed that Remain would prevail, throughout the campaign our research showed significantly larger levels of Euroscepticism than many other polling organisations. . . .

Early in the campaign, an analysis of the “true” state of public opinion claimed support for Leave was somewhere between phone and online methodologies but a little closer to phone. We disputed this at the time as we were sure our online samples were getting a much more representative sample of public opinion.

Fair enough. They’re gonna take the hit for being wrong, so they might as well grab what credit they can for being less wrong than many other pollsters. Remember, there still are people out there saying that you can’t trust online polls.

And now Yougov gets to the meat of the question:

We do not hide from the fact that YouGov’s final poll miscalculated the result by four points. This seems in a large part due to turnout – something that we have said all along would be crucial to the outcome of such a finely balanced race. Our turnout model was based, in part, on whether respondents had voted at the last general election and a turnout level above that of general elections upset the model, particularly in the North.

So they go with explanation 4 above: unexpected patterns of turnout.

They frame this as a North/South divide—which I guess is what you can learn from the data—but I’m wondering if it’s more of a simple Leave/Remain divide, with Leave voters being, on balance, more enthusiastic, hence turning out to vote at a higher-than-expected rate.

Related to this is explanation 3, changes in opinion. After all, that Yougov report also says, “three of YouGov’s final six polls of the campaign showing ‘Leave’ with the edge ranging from a 4% Remain lead to an 8% Leave lead.” And if you look at the graph reproduced above, and take a simple average, you’ll see a win for Leave. So the only way to call the polls as a lead for Remain (as Yougov did, in advance of the election) was to weight the more recent polls higher, that is to account for trends in opinion. It makes sense to account for trends, but once you do that, you have to accept the possibility of additional changes after the polling is done.

And, just to be clear: Yougov’s estimates using MRP were not bad at all. But this did not stop Yougov from reporting, as a final result, that mistaken 52-48 pro-Remain poll on the eve of the vote.

To get another perspective on what went wrong with the polling, I went to the webpage of Nikos Askitas, whose work I’d “shilled” on the sister blog the other day. Askitas had used a tally based on Google search queries—a method that he reported had worked for recent referenda in Ireland and Greece—and reported just before the election a slight lead for Remain, very close to the Yougov poll, as a matter of fact. Really kind of amazing it was so close, but I don’t know what adjustments he did to the data to get there; it might well be that he was to some extent anchoring his estimates to the polls. (He did not preregister his data-processing rules before the campaign began.)

Anyway, Askitas was another pundit to get things wrong. Here’s what he wrote in the aftermath:

Two ways ago observing the rate at which the brexit side was recovering from the murder of Jo Cox I was writing that “as of 16:15 hrs on Tuesday afternoon the leave searches caught up by half a percentage point going from 47% to 47.5%. If trend continues they will be at 53% or Thursday morning”. This was simply regressing the leave searches on each hours passed. When I then saw the first slow down I had thought that it might become 51% or 52% but recovering most of the pre-murder momentum was still possible with only one obstacle in its way: time. When the rate of recovery of the leave searches slowed down in the evening of the 22nd of June and did not move upwards in the early morning of the 23rd I had to call the presumed trend as complete: if your instrument does not pick up measurement variation then you declare the process you are observing for finished. Leave was at 48%.

What explains the difference? Maybe the trend I was seeing early on was indeed still mostly there and there was simply no time to be recorded in search? Maybe the rain damaged the remaineers as it is widely believed? Maybe the pour turnout in Wales? Maybe our tool does not have the resolution it needs for such a close call? or maybe as I was saying elsewhere “I am confident to mostly have identified the referendum relevant searches and I can see that many -but not all- of the top searches are indeed related to voting intent”.

Askitas seems to be focusing more on items 2 and 3 (measurement issues and opinion changes) and not so much on item 1 (non-representativeness of searchers) and item 4 (turnout). Again, let me emphasize the that all four items interact.

Askitas also gives his take on the political outcome:

The principle of parliamentary sovereignty implies that referendum results are not legally binding and that action occurs at the discretion of the parliament alone. Consequently a leave vote is not identical with leaving. As I was writing elsewhere voting leave is hence cheap talk and hence the rational thing to do: you can air any and all grievances with the status quo and it is your vote if you have any kind of ax to grind (and most people do). Why wouldn’t you want to do so? The politicians can still sort it out afterwards. These politicians are now going to have to change their and our ways. Pro European forces in the UK, in Brussels and other European capitals must realize that scaremongering is not enough to stir people towards Europe. We saw that more than half of the Britons prefer a highly uncertain path than the certainty of staying, a sad evaluation of the European path. Pro Europeans need to paint a positive picture of staying instead of ugly pictures of leaving and most importantly they need to sculpt it in 3D reality one European citizen at a time.

P.S. I could’ve just as well titled this, “Brexit prediction markets: What went wrong?” But it seems pretty clear that the prediction markets were following the polls.

P.P.S. Full disclosure: YouGov gives some financial support to the Stan project. (I’d put this in my previous post on Yougov but I suppose the commenter is right that I should add this disclaimer to every post that mentions the pollster. But does this mean I also need to disclose our Google support every time I mention googling something? And must I disclose my consulting for Microsoft ever time I mention Clippy? I think I’ll put together a single page listing outside support and then I can use a generic disclaimer for all my posts.

P.P.P.S. Ben Lauderdale sent me a note arguing that Yougov didn’t do so bad at all:

I worked with Doug Rivers on the MRP estimates you discussed in your post today. I want to make an important point of clarification: none of the YouGov UK polling releases *except* the one you linked to a few days back used the MRP model. All the others were 1 or 2 day samples adjusted with raking and techniques like that. The MRP estimates never showed Remain ahead, although they got down to Leave 50.1 the day before the referendum (which I tweeted). The last run I did the morning of the referendum with the final overnight data had Leave at 50.6, versus a result of Leave 51.9.

Doug and I are going to post a more detailed post-mortem on the estimates when we recover from being up all night, but fundamentally they were a success: both in terms of getting close to the right result in a very close vote, and also in predicting the local authority level results very well. Whether our communications were successful is another matter, but it was a very busy week in the run up to the referendum, and we did try very hard to be clear about the ways we could be wrong in that article!

P.P.P.P.S. And Yair writes:

I like the discussion about turnout and Leave voters being more enthusiastic. My experience has been that it’s very difficult to separate turnout from support changes. I bet if you look at nearly any stable subgroup (defined by geography and/or demographics), you’ll tend to see the two moving together.

Another piece here, which I might have missed in the discussion, is differential non-response due to the Jo Cox murder. Admittedly I didn’t follow too closely, but it seems like all the news covereage in recent days was about that. Certainly plausible that this led to some level of Leave non-response, contributing to the polling trend line dipping towards Remain in recent days. I don’t think the original post mentioned fitting the MRP with Party ID (or is it called something else in the UK?), but I’m remembering the main graphs from the swing voter Xbox paper being pretty compelling on this point.

Last — even if the topline is off, in my view there’s still a lot of value in getting the subgroups right. I know I was informally looking at the YouGov map compared to the results map last night. Maybe would be good to see a scatterplot or something. I know everyone cares about the topline more than anything, but to me (and others, I hope) the subgroups are important, both for understanding the election and for understanding where the polls were off.