Nature | Comment

Policy: NIH plans to enhance reproducibility

Francis S. Collins and Lawrence A. Tabak discuss initiatives that the US National Institutes of Health is exploring to restore the self-correcting nature of preclinical research.

Article tools

Chris Ryan/Nature

A growing chorus of concern, from scientists and laypeople, contends that the complex system for ensuring the reproducibility of biomedical research is failing and is in need of restructuring1, 2. As leaders of the US National Institutes of Health (NIH), we share this concern and here explore some of the significant interventions that we are planning.

Science has long been regarded as 'self-correcting', given that it is founded on the replication of earlier work. Over the long term, that principle remains true. In the shorter term, however, the checks and balances that once ensured scientific fidelity have been hobbled. This has compromised the ability of today's researchers to reproduce others' findings.

Let's be clear: with rare exceptions, we have no evidence to suggest that irreproducibility is caused by scientific misconduct. In 2011, the Office of Research Integrity of the US Department of Health and Human Services pursued only 12 such cases3. Even if this represents only a fraction of the actual problem, fraudulent papers are vastly outnumbered by the hundreds of thousands published each year in good faith.

Instead, a complex array of other factors seems to have contributed to the lack of reproducibility. Factors include poor training of researchers in experimental design; increased emphasis on making provocative statements rather than presenting technical details; and publications that do not report basic elements of experimental design4. Crucial experimental design elements that are all too frequently ignored include blinding, randomization, replication, sample-size calculation and the effect of sex differences. And some scientists reputedly use a 'secret sauce' to make their experiments work — and withhold details from publication or describe them only vaguely to retain a competitive edge5. What hope is there that other scientists will be able to build on such work to further biomedical progress?

Exacerbating this situation are the policies and attitudes of funding agencies, academic centres and scientific publishers. Funding agencies often uncritically encourage the overvaluation of research published in high-profile journals. Some academic centres also provide incentives for publications in such journals, including promotion and tenure, and in extreme circumstances, cash rewards6.

Then there is the problem of what is not published. There are few venues for researchers to publish negative data or papers that point out scientific flaws in previously published work. Further compounding the problem is the difficulty of accessing unpublished data — and the failure of funding agencies to establish or enforce policies that insist on data access.

Preclinical problems

Reproducibility is potentially a problem in all scientific disciplines. However, human clinical trials seem to be less at risk because they are already governed by various regulations that stipulate rigorous design and independent oversight — including randomization, blinding, power estimates, pre-registration of outcome measures in standardized, public databases such as ClinicalTrials.gov and oversight by institutional review boards and data safety monitoring boards. Furthermore, the clinical trials community has taken important steps towards adopting standard reporting elements7.

Preclinical research, especially work that uses animal models1, seems to be the area that is currently most susceptible to reproducibility issues. Many of these failures have simple and practical explanations: different animal strains, different lab environments or subtle changes in protocol. Some irreproducible reports are probably the result of coincidental findings that happen to reach statistical significance, coupled with publication bias. Another pitfall is overinterpretation of creative 'hypothesis-generating' experiments, which are designed to uncover new avenues of inquiry rather than to provide definitive proof for any single question. Still, there remains a troubling frequency of published reports that claim a significant result, but fail to be reproducible.

Proposed NIH actions

As a funding agency, the NIH is deeply concerned about this problem. Because poor training is probably responsible for at least some of the challenges, the NIH is developing a training module on enhancing reproducibility and transparency of research findings, with an emphasis on good experimental design. This will be incorporated into the mandatory training on responsible conduct of research for NIH intramural postdoctoral fellows later this year. Informed by this pilot, final materials will be posted on the NIH website by the end of this year for broad dissemination, adoption or adaptation, on the basis of local institutional needs.

“Efforts by the NIH alone will not be sufficient to effect real change in this unhealthy environment.”

Several of the NIH's institutes and centres are also testing the use of a checklist to ensure a more systematic evaluation of grant applications. Reviewers are reminded to check, for example, that appropriate experimental design features have been addressed, such as an analytical plan, plans for randomization, blinding and so on. A pilot was launched last year that we plan to complete by the end of this year to assess the value of assigning at least one reviewer on each panel the specific task of evaluating the 'scientific premise' of the application: the key publications on which the application is based (which may or may not come from the applicant's own research efforts). This question will be particularly important when a potentially costly human clinical trial is proposed, based on animal-model results. If the antecedent work is questionable and the trial is particularly important, key preclinical studies may first need to be validated independently.

Informed by feedback from these pilots, the NIH leadership will decide by the fourth quarter of this year which approaches to adopt agency-wide, which should remain specific to institutes and centres, and which to abandon.

The NIH is also exploring ways to provide greater transparency of the data that are the basis of published manuscripts. As part of our Big Data initiative, the NIH has requested applications to develop a Data Discovery Index (DDI) to allow investigators to locate and access unpublished, primary data (see go.nature.com/rjjfoj). Should an investigator use these data in new work, the owner of the data set could be cited, thereby creating a new metric of scientific contribution unrelated to journal publication, such as downloads of the primary data set. If sufficiently meritorious applications to develop the DDI are received, a funding award of up to three years in duration will be made by September 2014. Finally, in mid-December, the NIH launched an online forum called PubMed Commons (see go.nature.com/8m4pfp) for open discourse about published articles. Authors can join and rate or contribute comments, and the system is being evaluated and refined in the coming months. More than 2,000 authors have joined to date, contributing more than 700 comments.

Community responsibility

Clearly, reproducibility is not a problem that the NIH can tackle alone. Consequently, we are reaching out broadly to the research community, scientific publishers, universities, industry, professional organizations, patient-advocacy groups and other stakeholders to take the steps necessary to reset the self-corrective process of scientific inquiry. Journals should be encouraged to devote more space to research conducted in an exemplary manner that reports negative findings, and should make room for papers that correct earlier work.

We are pleased to see that some of the leading journals have begun to change their review practices. For example, Nature Publishing Group, the publishers of this journal, announced8 in May 2013 the following: restrictions on the length of methods sections have been abolished to ensure the reporting of key methodological details; authors use a checklist to facilitate the verification by editors and reviewers that critical experimental design features have been incorporated into the report, and editors scrutinize the statistical treatment of the studies reported more thoroughly with the help of statisticians. Furthermore, authors are encouraged to provide more raw data to accompany their papers online.

Similar requirements have been implemented by the journals of the American Association for the Advancement of Science — Science Translational Medicine in 2013 and Science earlier this month9 — on the basis of, in part, the efforts of the NIH's National Institute of Neurological Disorders and Stroke to increase the transparency of how work is conducted10.

Perhaps the most vexed issue is the academic incentive system. It currently over-emphasizes publishing in high-profile journals. No doubt worsened by current budgetary woes, this encourages rapid submission of research findings to the detriment of careful replication. To address this, the NIH is contemplating modifying the format of its 'biographical sketch' form, which grant applicants are required to complete, to emphasize the significance of advances resulting from work in which the applicant participated, and to delineate the part played by the applicant. Other organizations such as the Howard Hughes Medical Institute have used this format and found it more revealing of actual contributions to science than the traditional list of unannotated publications. The NIH is also considering providing greater stability for investigators at certain, discrete career stages, utilizing grant mechanisms that allow more flexibility and a longer period than the current average of approximately four years of support per project.

In addition, the NIH is examining ways to anonymize the peer-review process to reduce the effect of unconscious bias (see go.nature.com/g5xr3c). Currently, the identifiers and accomplishments of all research participants are known to the reviewers. The committee will report its recommendations within 18 months.

Efforts by the NIH alone will not be sufficient to effect real change in this unhealthy environment. University promotion and tenure committees must resist the temptation to use arbitrary surrogates, such as the number of publications in journals with high impact factors, when evaluating an investigator's scientific contributions and future potential.

The recent evidence showing the irreproducibility of significant numbers of biomedical-research publications demands immediate and substantive action. The NIH is firmly committed to making systematic changes that should reduce the frequency and severity of this problem — but success will come only with the full engagement of the entire biomedical-research enterprise.

Journal name:
Nature
Volume:
505,
Pages:
612–613
Date published:
()
DOI:
doi:10.1038/505612a

References

  1. Prinz, F., Schlange, T. & Asadullah, K. Nature Rev. Drug Disc. 10, 712713 (2011).

  2. The Economist 'Trouble at the Lab' (19 October 2013); available at http://go.nature.com/dstij3

  3. US Department of Health and Human Services, 2011 Office of Research Integrity Annual Report 2011 (US HHS, 2011); available at http://go.nature.com/t7ykcv

  4. Carp, J. NeuroImage 63, 289300 (2012).

  5. Vasilevsky, N. A. et al. PeerJ 1, e148 (2013).

  6. Franzoni, C., Scellato, G. & Stephan, P. Science 333, 702703 (2011).

  7. Moher, D., Jones, A. & Lepage, L. for the CONSORT Group J. Am. Med. Assoc. 285, 19921995 (2001).

  8. Nature 496, 398 (2013).

  9. McNutt, M. Science 343, 229 (2014).

  10. Landis, S. C. et al. Nature 490, 187191 (2012).

Author information

Affiliations

  1. Francis S. Collins is director and Lawrence A. Tabak is principal deputy director of the US National Institutes of Health, Bethesda, Maryland, USA.

Corresponding author

Correspondence to:

Author details

For the best commenting experience, please login or register as a user and agree to our Community Guidelines. You will be re-directed back to this page where you will see comments updating in real-time and have the ability to recommend comments to other users.

Comments for this thread are now closed.

Comments

15 comments Subscribe to comments

  1. Avatar for Andrea OBrien
    Andrea OBrien
    As a former researcher, I believe that this is an important development and am involved in launching a new interdisciplinary journal entirely dedicated to Method development called MethodsX (http://www.journals.elsevier.com/methodsx/) , which should address some of these issues.
  2. Avatar for Sepehr Ehsani
    Sepehr Ehsani
    It may be worth mentioning that we should not expect biomedical results to be entirely reproducible at present. Such an expectation may stem from a tendency to view biology with the same lens as physics and chemistry, which deal with questions at a more elementary scale and are, therefore, able to gain deeper insights on the subject at hand. Because biology, on the other hand, deals with cells that incorporate innumerable physical and chemical systems, the level of understanding cannot, by definition, reach that of physics or chemistry. Hence even the current levels of reproducibility in biomedical research can in essence be surprising. Unless we continue to raise and address more fundamental questions about the workings of the cell, a plethora of unknown variables can lead two seemingly identical experiments to yield different results. Sepehr Ehsani (Whitehead Institute / MIT CSAIL, Cambridge, MA)
  3. Avatar for Discussant T
    Discussant T
    This is great progress! These standards need to be applied to mental health/psychotherapy research, as well. Consumers are currently spending tremendous amounts of time and money, and risking adverse effects, on psychotherapies that are not supported by well-designed research or carefully evaluated data, and may be costly, useless, or harmful.
  4. Avatar for Margit Burmeister
    Margit Burmeister
    Most cases are not outright fraud, but not lack of training either. That's why the article, and NIGMS's RFI for how to improve training, is not going to fix anything. Every PI asks students to show the BEST blot and publishes it as typical - and most PI s know that this is not how it should be, but it is the standard of the field. The biggest issue and problem is: what is rewarded. Publish or perish means: If you don't publish you don't get tenure. But if you publish something that is wrong, and then later publish a different result, you have two papers. The problem is that there are no negative repercussion from publishing something wrong. When I tell people in the lab to do a replication, I am doing them a disservice - it would be better for their career to publish without replication, and then a replication or nonreplication as a separate paper. Now, if the replication doesn't confirm the first finding, they have no paper - and that is harmful. An NIH grant is criticized as "mediocre productivity" if there are only 2 papers per year. It is common for senior faculty to have co-authored 500+ publications - they are not checking on all of these papers thoroughly. It used to be in old times that if you publish something wrong, your career is done with. Now, if you publish something wrong, you get a second paper to correct. What has to change to increase reproducibility is to drastically (5-10 fold) change expectation of numbers of paper per PhD thesis, per grant, per PI before tenure, and to dramatically increase the negative re-percussions from publishing irreproducible or false data, or incomplete methods. The current view is if something can't be reproduced, its usually the problem of the lab trying to reproduce, and if it is "winner's curse", its just luck of the draw. A comment section to each paper on pubmed would be a good idea where people can comment what they could or couldn't reproduce, what was wrong or right in the paper, and it would help for tenure and careers if one could add these comments.
  5. Avatar for Dmytro Demydenko
    Dmytro Demydenko
    We might need an International Code of Science Conduct (name can be different).
  6. Avatar for Abhay Sharma
    Abhay Sharma
    Collins and Tabak (Nature 27 Jan. 2014) highlight that there is no evidence to suggest that irreproducibility is caused by scientific misconduct. But what is scientific misconduct could be hard to define. Take for example biased reporting. Over-selection and over-reporting of false positive results are increasingly plaguing the published research with an alarming rate (Nature 485, 149; 2012). In the current practice, such reporting is considered as honest errors not amounting to misconduct (Nature 485, 137; 2012). However, since intention is the core of misconduct, one may very well argue that reporting of results with systematic positive bias should also be placed under the ambit of misconduct. Scientific community and policy makers need to consider this tough option in the overall interest of science.
  7. Avatar for Helene Hill
    Helene Hill
    In my comment after the Nature article Research ethics: 3 ways to blow the whistle : Nature News & Comment last November, I pointed out that there were 22 attempts to replicate results published in 2 articles in the journal Radiation Research. I brought this to the attention of the editor who brushed me off. I think that most journals are afraid to retract especially when it comes to reproducibility problems. It should be up to the funding agencies, the institutions and the journals themselves to make it clear that failure to reproduce must be reported. The obligation to do so should be part of the oath signed by Principal Investigators when submitting their reports and grant applications. Authors submitting papers should acknowledge their obligation to do so if it should occur. And journals should recognize their obligation to purify the literature. There is more information about this on my website: www.helenezhill.com, see especially the expert witness report by Michael Robbins under the Qui tam tab -- the experts.
  8. Avatar for C. A.
    C. A.
    I commend the NIH leadership for finally starting to address this very deep crisis of confidence in science. I think, however, that Drs. Collins and Tabak are being naive (or overly optimistic) to think that this is a matter of training. True, many researchers, both PhD and MD, lack basic knowledge in statistics - this could certainly be improved. But scientist who “massage” their data to coax out a “publishable” result (i.e., most scientists) are very well aware of what they’re doing. And they have no choice but behave that way, or they’ll be unable to sustain (or launch) a research career, for the very reasons that Drs. Collins and Tabak nicely articulate in the article. As I read through this, I was hoping for an announcement that the NIH will institute (or look into instituting) mandatory pre-registration of all research studies. This, in my opinion, is the most crucial piece in reforming science. But alas, this doesn’t seem to be in the cards. Such pre-registration will put an and to exploratory studies masquerading as hypothesis-driven, confirmatory studies. There’s nothing wrong with exploratory research, as long as the reader recognizes it as such. Having studies pre-registered would bring much needed transparency to science, as it has done (albeit not perfectly) for clinical trials.
  9. Avatar for Andrew Ekstrom
    Andrew Ekstrom
    As a statistician and chemist, I took classes called Design of Experiments 1 and Design of Experiments 2. All the issues raised in this editorial are the reasons why scientists NEED to take more stats classes. Especially courses in Design of Experiments and Data Analysis. When I read an article, no matter who published it, I look at the experimental design and analysis sections first. If the authors didn't do a good job of designing or properly analyze their work, I skip the paper. I figure if they put garbage into the analysis or do a poor job of analysis, you can't get good results. On occasion, a garbage analysis will yield the correct results. That does not mean the author used good technique. Last week, I gave a presentation to the Detroit ACS group. The first 5 slides were all about how the methods most chemists use for experiments is improper. I told them that t-tests are only good for a single comparison and not very powerful. If they ever see someone write an article that uses more than one t-test, the author will have issues with Family Wise Error Rates. This means their conclusions are not as good as they seem. Same thing occurs when you use simple ANOVA analysis. I went on to show them how a 12 sample, Plackett Burman Design is far more powerful for testing 6 different factors than using a t-test ever could be. We also discussed how using Design of Experiment methods, Plackett Burman, Fractional Factorial, Central Composite, Optimal Response Surface Designs, etc, change several testing levels at the same time!!!! And by changing multiple things simultaneously, you get smaller, better designs that give you more robust results. The big idea of the talk was, "Everything you learned about Designing an Experiment is wrong. Here is what you should do." All of my references at the end of the presentation were statistics text books written by statisticians. All of the authors of the text books were either chemists at some point in their careers or worked with chemists often. Even wit hall the overwhelming evidence that t-tests were bad and these methods were made for chemists, often by chemists, they still felt as though these methods would not work for THEIR research. The cure for "Bad Science" is simple. Take 2 classes in Design of Experiments and 1 class in Regression Analysis from a statistics or industrial engineering department. A mere 3 classes can change the world!
  10. Avatar for Mark Livingstone
    Mark Livingstone
    Great article! ...but the idea that irreproducibility comes from good faith mistakes by poorly trained scientists is ridiculous. The corresponding authors who grab whatever data supports their ideas know how to design an experiment correctly.
  11. Avatar for Peter Gaskin
    Peter Gaskin
    Readers should not forget that many preclinical studies which support the Clinical Trial Applications and Marketing Authorisations for pharmaceuticals and biopharmaceuticals are designed rigourously and are conducted under Good Laboratory Practice (GLP) guidelines. These independently audited studies, designed to international guidelines are tightly controlled and GLP requires a high level of documentation (study plans, lot numbers, SOPs, validated computer systems, etc, etc) to enable reproducibility. Not all preclinical studies are alike.
  12. Avatar for Mr. Gunn
    Mr. Gunn
    As co-director of the Reproducibility Initiative and former scientist, I look forward to further developments.
  13. Avatar for Tomi Mattila
    Tomi Mattila
    Not being involved with research (med-student) I don't know how far reaching the consequences of this are. Does this mostly concern bleeding edge research on something very esoteric? Or am I in danger of filling my head with fairy tales reading a review article on something Immunology :)?` Either ways, good that something is being done about it.
  14. Avatar for Michael Lerman
    Michael Lerman
    "Improving Data" was always part of Science, Mendel before deriving his laws removed some experiments that glaringly contradicted the laws; But now we are dealjng with an epidemic.. Michael Lerman,Ph.D., M.D.
  15. Avatar for Casey Ydenberg
    Casey Ydenberg
    Obviously, the biggest problem is the pyramid structure of the biomedical workforce, which imposes perverse incentives on early-career researchers. I don't see anything really changing until that does.

Top Story

BICEP2

The gravitational-wave revolution

Results from a South Pole observatory unleash a new kind of astronomy that can peer all the way back to the Big Bang.

Science jobs from naturejobs