Short Contents | Full Contents Other books @ NCBI


AHRQ Evidence reports and summaries AHRQ Evidence Reports, Numbers 1-60

47. Systems to Rate the Strength Of Scientific Evidence

Evidence Report/Technology Assessment

Number 47


Prepared for:
Agency for Healthcare Research and Quality
U.S. Department of Health and Human Services
2101 East Jefferson Street
Rockville, MD 20852

http://www.ahrq.gov/


Contract No. 290-97-0011


Prepared by:
Research Triangle Institute-University of North Carolina
Evidence-based Practice Center
Research Triangle Park, North Carolina


Suzanne West, Ph.D., M.P.H.
Valerie King, M.D., M.P.H.
Timothy S. Carey, M.D., M.P.H.
Kathleen N. Lohr, M.D.
Nikki McKoy, B.S.
Sonya F. Sutton, B.S.P.H.
Linda Lux, M.P.A.


AHRQ Publication No. 02-E016

April 2002

This document is in the public domain and may be used and reprinted without permission except those copyrighted materials noted for which further reproduction is prohibited without the specific permission of copyright holders.

Suggested Citation:

West S, King V, Carey TS, et al. Systems to Rate the Strength of Scientific Evidence. Evidence Report/Technology Assessment No. 47 (Prepared by the Research Triangle Institute-University of North Carolina Evidence-based Practice Center under Contract No. 290-97-0011). AHRQ Publication No. 02-E016. Rockville, MD: Agency for Healthcare Research and Quality. April 2002.top link

Preface

The Agency for Healthcare Research and Quality (AHRQ), formerly the Agency for Health Care Policy and Research (AHCPR), through its Evidence-Based Practice Centers (EPCs), sponsors the development of evidence reports and technology assessments to assist public- and private-sector organizations in their efforts to improve the quality of health care in the United States. The reports and assessments provide organizations with comprehensive, science-based information on common, costly medical conditions and new health care technologies. The EPCs systematically review the relevant scientific literature on topics assigned to them by AHRQ and conduct additional analyses when appropriate prior to developing their reports and assessments.

To bring the broadest range of experts into the development of evidence reports and health technology assessments, AHRQ encourages the EPCs to form partnerships and enter into collaborations with other medical and research organizations. The EPCs work with these partner organizations to ensure that the evidence reports and technology assessments they produce will become building blocks for health care quality improvement projects throughout the Nation. The reports undergo peer review prior to their release.

AHRQ expects that the EPC evidence reports and technology assessments will inform individual health plans, providers, and purchasers as well as the health care system as a whole by providing important information to help improve health care quality.

We welcome written comments on this evidence report. They may be sent to: Director, Center for Practice and Technology Assessment, Agency for Healthcare Research and Quality, 6010 Executive Blvd., Suite 300, Rockville, MD 20852.


John M. Eisenberg, M.D. Robert Graham, M.D.
Director Director, Center for Practice and
Agency for Healthcare Research and Quality   Technology Assessment
  Agency for Healthcare Research and Quality


The authors of this report are responsible for its content. Statements in the report should not be construed as endorsement by the Agency for Healthcare Research and Quality or the U.S. Department of Health and Human Services of a particular drug, device, test, treatment, or other clinical service.

Acknowledgments

This study was supported by Contract 290-97-0011 from the Agency for Healthcare Research and Quality (AHRQ) (Task No. 7). We acknowledge the continuing support of Jacqueline Besteman, JD, MA, the AHRQ Task Order Officer for this project.

The investigators deeply appreciate the considerable support, commitment, and contributions from Research Triangle Institute staff Sheila White and Loraine Monroe.

In addition, we would like to extend our appreciation to the members of our Technical Expert Advisory Group (TEAG), who served as vital resources throughout our process. They are Lisa Bero, PhD, Co-Director of the San Francisco Cochrane Center, University of California at San Francisco, San Francisco, Calif.; Alan Garber, MD, PhD, Professor of Economics and Medicine, Stanford University, Palo Alto, Calif.; Steven Goodman, MD, MHS, PhD, Associate Professor, School of Medicine, Department of Oncology, Division of Biostatistics, Johns Hopkins University, Baltimore, Md.; Jeremy Grimshaw, MD, PhD, Health Services Research Unit, University of Aberdeen, Scotland; Alejandro Jadad, MD, DPhil, Director of the program in eHealth innovation, University Health Network, Faculty of Medicine, University of Toronto, Toronto, Canada; Joseph Lau, MD, Director, AHRQ Evidence-based Practice Center, New England Medical Center, Boston, Mass.; David Moher, MSc, Director, Thomas C. Chalmers Center for Systematic Reviews, Children's Hospital of Eastern Ontario Research Institute, Ontario, Canada; Cynthia Mulrow, MD, MSc, Founding Director of the San Antonio Evidence-based Practice Center, San Antonio, Texas, and Associate Editor, Annals of Internal Medicine; Andrew Oxman, MD, MSc, Director, Health Services Research Unit, National Institute of Public Health, Oslo, Norway; and Paul Shekelle, MD, MPH, PhD, Director, AHRQ Evidence-based Practice Center, RAND-Southern California, Santa Monica, Calif.

We owe our thanks as well to our external peer reviewers, who provided constructive feedback and insightful suggestions for improvement of our report. Peer reviewers were Alfred O. Berg, MD, MPH, Chairman, U.S. Preventive Services Task Force, and Professor and Chair, Department of Family Medicine, University of Washington, Seattle, Wash.; Deborah Shatin, PhD, Senior Researcher, United Health Group, Minnetonka, Minn.; Edward Perrin, PhD, University of Washington, Seattle, Wash.; Marie Michnich, DrPH, American College of Cardiology, Bethesda, Md.; Steven M. Teutsch, MD, MPH, Senior Director, Outcomes Research and Management, Merck & Co., Inc., West Point, Pa.; Thomas Croghan, MD, Eli Lilly, Indianapolis, Ind.; John W. Feightner, MD, MSc, FCFP, Chairman, Canadian Task Force on Preventive Health Care and St. Joseph's Health Centre for Health Care, London, Ontario, Canada; Steve Lascher, DVM, MPH, Clinical Epidemiologist and Research Manager in Scientific Policy and Education, American College of Physicians-American Society of Internal Medicine, Philadelphia, Pa.; Stephen H. Woolf, MD, MPH, Medical College of Virginia, Richmond, Va.; and Vincenza Snow, MD, Senior Medical Associate, American College of Physicians-American Society of Internal Medicine, Philadelphia, Pa. In addition, we would like to extend our thanks to the seven anonymous reviewers designated by AHRQ.

Finally, we are indebted as well to several senior members of the faculty at the University of North Carolina at Chapel Hill: Harry Guess, MD, PhD, of the Departments of Epidemiology and Statistics, and Vice President of Epidemiology at Merck Research Laboratories, Blue Bell, Pa.; Charles Poole, MPH, ScD, of the Department of Epidemiology; David Savitz, PhD, Chair, Department of Epidemiology; and Kenneth F. Schulz, PhD, MBA, School of Medicine and Vice President of Quantitative Methods, Family Health International, Research Triangle Park, N.C.top link

Structured Abstract

Objectives.

Health care decisions are increasingly being made on research-based evidence, rather than on expert opinion or clinical experience alone. This report examines systematic approaches to assessing the strength of scientific evidence. Such systems allow evaluation of either individual articles or entire bodies of research on a particular subject, for use in making evidence-based health-care decisions. Identification of methods to assess health care research results is a task that Congress directed the Agency for Healthcare Research and Quality to undertake as part of the Healthcare Research and Quality Act of 1999.top link

Search Strategy.

The authors built on an earlier project concerning evaluating evidence for systematic reviews. They expanded this work by conducting a MEDLINE search (covering the years 1995 to mid-2000) for relevant articles published in English on either rating the quality of individual research studies or on grading a body of scientific evidence. Information from other Evidence-based Practice Centers (EPCs) and other groups involved in evidence-based medicine (such as the Cochrane Collaboration Methods Group) was used to supplement these sources.top link

Selection of Studies.

The initial MEDLINE search for systems for assessing study quality identified 704 articles, while the search on strength of evidence identified 679 papers. Each abstract was assessed by two reviewers to determine eligibility. An additional 219 publications were identified from other sources The first 100 Abstracts in each group were used to develop a coding system for categorizing the publications.top link

Data Collection and Analysis.

From the 1,602 titles and abstracts reviewed for the report, 109 were retained for further analysis. In addition, the authors examined 12 reports from various AHRQ-supported EPCs. To account for differences in study designs -- systematic reviews and meta-analyses, randomized controlled trials (RCTs), observational studies, and diagnostic studies -- the authors developed four Study Quality Grids whose columns denote evaluations domains of interest, and whose rows are the individual systems, checklists, scales, or instruments. Taken together, the grids form "evidence tables" that document the characterisitics (strengths and weaknesses) of these different systems.top link

Main Results.

The authors separately analyzed systems found in the literature and those in use by the EPCs. Four non-EPC checklists for use with systematic reviews or meta-analyses accounted for at least six of seven domains needed to be considered high-performing. For analysis of RCTs, the authors concluded that eight systems represent acceptable approaches that could be used without major modifications. Six high-performing systems were identified to evaluate observational studies. Five non-EPC checklists adequately dealt with studies of diagnostic tests. For assessment of the strength of a body of evidence, seven systems fully addressed the quality, quantity, and consistency of the evidence.top link

Conclusions.

Overall, the authors identified 19 generic systems that fully address their key quality domains for a particular type of study. The authors also identified seven systems that address all three quality domains grading the strength of a body of evidence. The authors also recommended future research areas to bridge gaps where information or empirical documentation is needed. The authors hope that these systems will prove useful to those developing clinical practice guidelines or other health-related policy advice.top link

Summary

Introduction

Health care decisions are increasingly being made on research-based evidence rather than on expert opinion or clinical experience alone. Systematic reviews represent a rigorous method of compiling scientific evidence to answer questions regarding health care issues of treatment, diagnosis, or preventive services. Traditional opinion-based narrative reviews and systematic reviews differ in several ways. Systematic reviews (and evidence-based technology assessments) attempt to minimize bias by the comprehensiveness and reproducibility of the search for and selection of articles for review. They also typically assess the methodologic quality of the included studies -- i.e., how well the study was designed, conducted, and analyzed -- and evaluate the overall strength of that body of evidence. Thus, systematic reviews and technology assessments increasingly form the basis for making individual and policy-level health care decisions.

Throughout the 1990s and into the 21st century, the Agency for Healthcare Research and Quality (AHRQ) has been the foremost federal agency providing research support and policy guidance in health services research. In this role, it gives particular emphasis to quality of care, clinical practice guidelines, and evidence-based practice, for instance through its Evidence-based Practice Center (EPC) program. Through this program and a group of 12 EPCs in North America, AHRQ seeks to advance the field's understanding of how best to ensure that reviews of the clinical or related literature are scientifically and clinically robust.

The Healthcare Research and Quality Act of 1999, Part B, Title IX, Section 911(a) mandates that AHRQ, in collaboration with experts from the public and private sectors, identify methods or systems to assess health care research results, particularly "methods or systems to rate the strength of the scientific evidence underlying health care practice, recommendations in the research literature, and technology assessments." AHRQ also is directed to make such methods or systems widely available.

AHRQ commissioned the Research Triangle Institute-University of North Carolina EPC to undertake a study to produce the required report, drawing on earlier work from the RTI-UNC EPC in this area. 1 The study also advances AHRQ's mission to support research that will improve the outcomes and quality of health care through research and dissemination of research results to all interested parties in the public and private sectors both in the United States and elsewhere.

The overarching goals of this project were to describe systems to rate the strength of scientific evidence, including evaluating the quality of individual articles that make up a body of evidence on a specific scientific question in health care, and to provide some guidance as to "best practices" in this field today. Critical to this discussion is the definition of quality. "Methodologic quality" has been defined as "the extent to which all aspects of a study's design and conduct can be shown to protect against systematic bias, nonsystematic bias, and inferential error." 1, p. 472 For purposes of this study, we hold quality to be the extent to which a study's design, conduct, and analysis have minimized selection, measurement, and confounding biases, with our assessment of study quality systems reflecting this definition.

We do acknowledge that quality varies depending on the instrument used for its measurement. In a study using 25 different scales to assess the quality of 17 trials comparing low molecular weight heparin with standard heparin to prevent post-operative thrombosis, Juni and colleagues reported that studies considered to be of high quality using one scale were deemed low quality on another scale. 2 Consequently, when using study quality as an inclusion criterion for meta-analyses, summary relative risks for thrombosis depended on which scale was used to assess quality. The end result is that variable quality in efficacy or effectiveness studies may lead to conflicting results that affect analyst's or decisionmakers' confidence about findings from systematic reviews or technology.

The remainder of this summary briefly describes the methods used to accomplish these goals and provides the results of our analysis of relevant systems and instruments identified through literature searches and other sources. We present a selected set of systems that we believe are ones that clinicians, policymakers, and researchers can use with reasonable confidence for these purposes, giving particular attention to systematic reviews, randomized controlled trials (RCTs), observational studies, and studies of diagnostic tests. Finally we discuss the limitations of this work and of evaluating the strength of the practice evidence for systematic reviews and technology assessments and offer suggestions for future research. We do not examine issues related to clinical practice guideline development or assigning grades or ratings to formal guideline recommendations.top link

Methods

To identify published research related to rating the quality of studies and the overall strength of evidence, we conducted two extensive literature searches and sought further information from existing bibliographies, members of a technical expert panel, and other sources. We then developed and completed descriptive tables -- hereafter "grids" -- that enabled us to compare and characterize existing systems. These grids focus on important domains and elements that we concluded any acceptable instrument for these purposes ought to cover. These elements reflect steps in research design, conduct, or analysis that have been shown through empirical work to protect against bias or other problems in such investigations or that are long-accepted practices in epidemiology and related research fields. We assessed systems against domains and assigned scores of fully met (Yes), partially met (Partial), or not met (No).

Then, drawing on the results of our analysis, we identified existing quality rating scales or checklists that in our view can be used in the production of systematic evidence reviews and technology assessments and laid out the reasons for highlighting these specific instruments. An earlier version of the entire report was subjected to extensive external peer review by experts in the field and AHRQ staff, and we revised that draft as part of the steps to produce this report.top link

Results

Data Collection

We reviewed the titles and abstracts for a total of 1,602 publications for this project. From this set, we retained 109 sources that dealt with systems (i.e., scales, checklists, or other types of instruments or guidance documents) pertinent to rating the quality of individual systematic reviews, RCTs, observational studies, or investigations of diagnostic tests, or with systems for grading the strength of bodies of evidence. In addition, we reviewed 12 reports from various AHRQ-supported EPCs. In all, we considered 121 systems as the basis for this report.

Specifically, we assessed 20 systems relating to systematic reviews, 49 systems for RCTs, 19 for observational studies, and 18 for diagnostic test studies. For final evaluative purposes, we focused on scales and checklists. In addition, we reviewed 40 systems that addressed grading the strength of a body of evidence (34 systems identified from our searches and prior research and 6 from various EPCs). The systems reviewed totals more than 121 because several were reviewed for more than one grid.



Systematic Reviews
  • Study question
  • Search strategy
  • Inclusion and exclusion criteria
  • Interventions
  • Outcomes
  • Data extraction
  • Study quality and validity
  • Data synthesis and analysis
  • Results
  • Discussion
  • Funding or sponsorship
Randomized Clinical Trials
  • Study question
  • Study population
  • Randomization
  • Blinding
  • Interventions
  • Outcomes
  • Statistical analysis
  • Results
  • Discussion
  • Funding or sponsorship
    (Key domains are in Italics)
Systems for Rating the Quality of Individual Articles

Important Evaluation Domains and Elements

For evaluating systems related to rating the quality of individual articles, we defined important domains and elements for four types of studies. Boxes A and B list the domains and elements used in this work, highlighting (in italics) those domains we regarded as critical for a scale or checklist to cover before we could identify a given system as likely to be acceptable for use today.top link

Systematic Reviews

Of the 20 systems concerned with systematic reviews or meta-analyses, we categorized one as a scale 3 and 10 as checklists. 4-14 The remainder are considered guidance documents. 15-23

To arrive at a set of high-performing scales or checklists pertaining to systematic reviews, we took account of seven key domains (see Box A): study question, search strategy, inclusion and exclusion criteria, data abstraction, study quality and validity, data synthesis and analysis, and funding or sponsorship. One checklist fully addressed all seven domains. 7 A second checklist also addressed all seven domains but merited only a "Partial" score for study question and study quality. 8 Two additional checklists 6,12 and the one scale 23 addressed six of the seven domains. These latter two checklists excluded funding; the scale omitted data abstraction and had a Partial score for search strategy.



Observational Studies
  • Study question
  • Study population
  • Comparability of subjects
  • Exposure or intervention
  • Outcome measurement
  • Statistical analysis
  • Results
  • Discussion
  • Funding or sponsorship
Diagnostic Test Studies
  • Study population
  • Adequate description of test
  • Appropriate reference standard
  • Blinded comparison of test and reference
  • Avoidance of verification bias

    (Key domains are in Italics)
Randomized Clinical Trials

In evaluating systems concerned with RCTs, we reviewed 20 scales, 18,24-42 11 checklists, 12-14,43-50 one component evaluation, 51 and seven guidance documents. 1,11,52-57 In addition, we reviewed 10 rating systems used by AHRQ's EPCs. 58-68

We designated a set of high-performing scales or checklists pertaining to RCTs by assessing their coverage of the following seven domains (see Box A): study population, randomization, blinding, interventions, outcomes, statistical analysis, and funding or sponsorship. We concluded that eight systems for RCTs represent acceptable approaches that could be used today without major modifications. 14,18,24,26,36,38,40,45

Two systems fully addressed all seven domains 24,45 and six addressed all but the funding domain. 14,18,26,36,38,40 Two were rigorously developed, 38,40 but the significance of this factor has yet to be tested.

Of the 10 EPC rating systems, most included randomization, blinding, and statistical analysis, 58-61,63-68 and five EPCs covered study population, interventions, outcomes, and results as well. 60,61,63,65,66

Users wishing to adopt a system for rating the quality of RCTs will need to do so on the basis of the topic under study, whether a scale or checklist is desired, and apparent ease of use.top link

Observational Studies

Seventeen non-EPC systems concerned observational studies. Of these, we categorized four as scales 31,32,40,69 and eight as checklists. 12-14,45,47,49,50,70 We classified the remaining five as guidance documents. 1,71-74 Two EPCs used quality rating systems for evaluating observational studies; these systems were identical to those used for RCTs.

To arrive at a set of high-performing scales or checklists pertaining to observational studies, we considered the following five key domains: comparability of subjects, exposure or intervention, outcome measurement, statistical analysis, and funding or sponsorship. As before, we concluded that systems that cover these domains represent acceptable approaches for assessing the quality of observational studies.

Of the 12 scales and checklists we reviewed, all included comparability of subjects either fully or in part. Only one included funding or sponsorship and the other four domains we considered critical for observational studies. 45 Five systems fully included all four domains other than funding or sponsorship. 14,32,40,47,50

Two EPCs evaluated observational studies using a modification of their RCT quality system. 60,64 Both addressed the empirically derived domain comparability of subjects, in addition to outcomes, statistical analysis, and results.

In choosing among the six high-performing scales for assessing study quality, one will have to evaluate which system is most appropriate for the task being undertaken, how long it takes to complete each instrument, and its ease of use. We were unable to evaluate these three instrument properties in the project.top link

Studies of Diagnostic Tests

Of the 15 non-EPC systems we identified for assessing the quality of diagnostic studies, six are checklists. 12,14,49,75-78 Five domains are key for making judgments about the quality of diagnostic test reports: study population, adequate description of the test, appropriate reference standard, blinded comparison of test and reference, and avoidance of verification bias. Three checklists met all these criteria. 49,77,78 Two others did not address test description, but this omission is easily remedied should users wish to put these systems into practice. 12,14 The oldest system appears to be too incomplete for wide use. 75,76

With one exception, the three EPCs that evaluated the quality of diagnostic test studies included all five domains either fully or in part. 59,68,79,80 The one EPC that omitted an adequate test description probably included this information apart from its quality rating measures. 79



Quality: the aggregate of quality ratings for individual studies, predicated on the extent to which bias was minimized.
Quantity: magnitude of effect, numbers of studies, and sample size or power.
Consistency: for any given topic, the extent to which similar findings are reported using similar and different study designs
Systems for Grading the Strength of a Body of Evidence

We reviewed 40 systems that addressed grading the strength of a body of evidence: 34 from sources other than AHRQ EPCs and 6 from the EPCs. Our evaluation criteria involved three domains -- quality, quantity, and consistency (Box C) -- that are well-established variables for characterizing how confidently we can conclude that a body of knowledge provides information on which clinicians or policymakers can act.

The 34 non-EPC systems incorporated quality, quantity, and consistency to varying degrees. Seven systems fully addressed the quality, quantity, and consistency domains. 11,81-86 Nine others incorporated the three domains at least in part. 12,14,39,70,87-91

Of the six EPC grading systems, only one incorporated quality, quantity, and consistency. 93 Four others included quality and quantity either fully or partially. 59, 60,67,68 The one remaining EPC system included quantity; study quality is measured as part of its literature review process, but this domain appears not to be directly incorporated into the grading system. 66 top link

Discussion

Identification of Systems

We identified 1,602 articles, reports, and other materials from our literature searches, web searches, referrals from our technical expert advisory group, suggestions from independent peer reviewers of an earlier version of this report, and a previous project conducted by the RTI-UNC EPC. In the end, our formal literature searches were the least productive source of systems for this report. Of the more than 120 systems we eventually reviewed that dealt with either quality of individual articles or strength of bodies of evidence, the searches per se generated a total of 30 systems that we could review, describe, and evaluate. Many articles from the searches related to study quality were essentially reports of primary studies or reviews that discussed "the quality of the data"; few addressed evaluating study quality itself.

Our literature search was most problematic for identifying systems to grade the strength of a body of evidence. Medical Subject Headings (MeSH) terms were not very sensitive for identifying such systems or instruments. We attribute this phenomenon to the lag in development of MeSH terms specific for the evidence-based medicine field.

For those involved in evidence-based practice and research, we caution that they may not find it productive simply to search for quality rating or evidence grading schemes through standard (systematic) literature searches. This is one reason that we are comfortable with identifying a set of instruments or systems that meet reasonably rigorous standards for use in rating study quality and grading bodies of evidence. Little is to be gained by directing teams seeking to produce systematic reviews or technology assessments (or indeed clinical practice guidelines) to initiate wholly new literature searches in these areas.

At the moment, we cannot provide concrete suggestions for efficient search strategies on this topic. Some advances must await expanded options for coding the peer-reviewed literature. Meanwhile, investigators wishing to build on our efforts might well consider tactics involving citation analysis and extensive contact with researchers and guideline developers to identify the rating systems they are presently using. In this regard, the efforts of at least some AHRQ-supported EPCs will be instructive.top link

Factors Important in Developing and Using Rating Systems

Distinctions Among Types of Studies, Evaluation Criteria, and Systems

We decided early on that comparing and contrasting study quality systems without differentiating among study types was likely to be less revealing or productive than assessing quality for systematic reviews, RCTs, observational studies, and studies of diagnostic tests independently. In the worst case, in fact, combining all such systems into a single evaluation framework risked nontrivial confusion and misleading conclusions, and we were not willing to take the chance that users of this report would conclude that "a single system" would suit all purposes. That is clearly not the case.

We defined quality based on certain critical domains, which comprised one or more elements. Some were based directly on empirical results that show that bias can arise when certain design elements are not met; we considered these factors as critical elements for the evaluation. Other domains or elements were based on best practices in the design and conduct of research studies. They are widely accepted methodologic standards, and investigators (especially for RCTs and observational studies) would probably be regarded as remiss if they did not observe them. Our evaluation of study quality systems was done, therefore, against rigorous criteria.

Finally, we contrasted systems on descriptive factors such as whether the system was a scale, checklist, or guidance document, how rigorously it was developed, whether instructions were provided for its use, and similar factors. This approach enabled us to home in on scales and checklists as the more likely methods for rating articles that might be adopted more or less as is.top link

Numbers of Quality Rating Systems

We identified at least three times as many scales and checklists for rating the quality of RCTs as for other types of studies. Ongoing methodological work addressing the quality of observational and diagnostic test studies will likely affect both the number and the sophistication of these systems. Thus, our findings and conclusions with respect to these latter types of studies may need to be readdressed once results from more methodological studies in these areas are available.top link

Challenges of Rating Observational Studies

An observational study by its very nature "observes" what happens to individuals. Thus, to prevent selection bias, the comparison groups in an observation study are supposed to be as similar as possible except for the factors under study. For investigators to derive a valid result from their observational studies, they must achieve this comparability between study groups (and, for some types of prospective studies, maintain it by minimizing differential attrition). Because of the difficulty in ensuring adequate comparability between study groups in an observational study -- both when the project is being designed or upon review after the work has been published -- we raise the question of whether nonmethodologically trained researchers can identify when potential selection bias or other biases more common with observational studies have occurred.top link

Instrument Length

Older systems for rating individual articles tended to be most inclusive for the quality domains we chose to assess. 24,45 However, these systems also tended to be very long and potentially cumbersome to complete. Shorter instruments have the obvious advantage of brevity, and some data suggest that they will provide sufficient information on study quality. Simply asking about three domains (randomization, blinding, and withdrawals) apparently can differentiate between higher- and lower-quality RCTs that evaluate drug efficacy. 34

The movement from longer, more inclusive instruments to shorter ones is a pattern observed throughout the health services research world for at least 25 years, particularly in areas relating to the assessment of health status and health-related quality of life. Thus, this model is not surprising in the field of evidence-based practice and measurement. However, the lesson to be drawn from efforts to derive shorter, but equivalently reliable and valid, instruments from longer ones (with proven reliability and validity) is that substantial empirical work is needed to ensure that the shorter forms operate as intended. More generally, we are not convinced that shorter instruments per se will always be better, unless demonstrated in future empirical studies.top link

Reporting Guidelines

Reporting guidelines such as the CONSORT, QUOROM, and forthcoming STARD statements are not to be used for assessing the quality of RCTs, systematic reviews, or studies of diagnostic tests, respectively. However, the statements can be expected to lead to better reporting and two downstream benefits. First, the unavoidable tension (when assessing study quality) between the actual study design, conduct, and analysis and the reporting of these traits may diminish. Second, if researchers consider these guidelines at the outset of their work, they are likely to have better designed studies that will be easier to understand when the work is published.top link

Conflicting Findings When Bodies of Evidence Contain Different Types of Studies

A significant challenge arises in evaluating a body of knowledge comprising observational and RCT data. A contemporary case in point is the association between hormone replacement therapy (HRT) and cardiovascular risk. Several observational studies but only one large and two small RCTs have examined the association between HRT and secondary prevention of cardiovascular disease for older women with preexisting heart disease. In terms of quantity, the number of studies and participants is high for the observational studies and modest for the RCTs. Results are fairly consistent across the observational studies and across the RCTs, but between the two types of studies the results conflict. Observational studies show a treatment benefit, but the three RCTs showed no evidence that hormone therapy was beneficial for women with established cardiovascular disease.

Most experts would agree that RCTs minimize an important potential bias in observational studies, namely selection bias. However, experts also prefer more studies with larger aggregate samples and/or with samples that address more diverse patient populations and practice settings -- often the hallmark of observational studies. The inherent tension between these factors is clear. The lesson we draw is that a system for grading the strength of evidence, in and of itself and no matter how good it is, may not completely resolve the tension. Users, practitioners, and policymakers may need to consider these issues in light of the broader clinical or policy questions they are trying to solve.top link

Selecting Systems for Use Today: A "Best Practices" Orientation

Overall, many systems covered most of the domains that we considered generally informative for assessing study quality. From this set, we identified 19 generic systems that fully address our key quality domains (with the exception of funding or sponsorship for several systems). 3,6-8,12,14,18,24,26,32,36,38,40,45,47,49,50,77,78 Three systems were used for both RCTs and observational studies. 14,40,45

In our judgment, those who plan to incorporate study quality into a systematic review, evidence report, or technology assessment can use one or more of these 19 systems as a starting point, being sure to take into account the types of study designs occurring in the articles under review. Other considerations for selecting or developing study quality systems include the key methodological issues specific to the topic under study, the available time for completing the review (some systems seem rather complex to complete), and whether the preference is for a scale or a checklist. We caution that systems used to rate the quality of both RCTs and observational studies -- what we refer to as "one size fits all" quality assessments -- may prove to be difficult to use and, in the end, may measure study quality less precisely than desired.

We identified seven systems that fully addressed all three domains for grading the strength of a body of evidence. The earliest system was published in 1994; 81 the remaining systems were published in 1999 11 and 2000, 82-86 indicating that this is a rapidly evolving field.

Systems for grading the strength of a body of evidence are much less uniform than those for rating study quality. This variability complicates the job of selecting one or more systems that might be put into use today. Two properties of these systems stand out. Consistency has only recently become an integral part of the systems we reviewed in this area. We see this as a useful advance. Also continuing is the use of a study design hierarchy to define study quality as an element of grading overall strength of evidence. However, reliance on such a hierarchy without consideration of the domains discussed throughout this report is increasingly seen as unacceptable. As with the quality rating systems, selecting among the evidence grading systems will depend on the reason for measuring evidence strength, the type of studies that are being summarized, and the structure of the review panel. Some systems appear to be rather cumbersome to use and may require substantial staff, time, and financial resources.

Although several EPCs used methods that met our criteria at least in part, these were topic-specific applications (or modifications) of generic parent instruments. The same is generally true of efforts to grade the overall strength of evidence. For users interested in systems deliberately focused on a specific clinical condition or technology, we refer readers to the citations given in the main report.top link

Recommendations for Future Research

Despite our being able to identify various rating and grading systems that can more or less be taken off the shelf for use today, we found many areas in which information or empirical documentation was lacking. We recommend that future research be directed to the topics listed below, because until these research gaps are bridged, those wishing to produce authoritative systematic reviews or technology assessments will be somewhat hindered in this phase of their work. Specifically, we highlight the need for work on:

  • Identifying and resolving quality rating issues pertaining to observational studies;
  • Evaluating inter-rater reliability of both quality rating and strength-of-evidence grading systems;
  • Comparing the quality ratings from different systems applied to articles on a single clinical or technology topic;
  • Similarly, comparing strength-of-evidence grades from different systems applied to a single body of evidence on a given topic;
  • Determining what factors truly make a difference in final quality scores for individual articles (and by extension a difference in how quality is judged for bodies of evidence as a whole);
  • Testing shorter forms in terms of reliability, reproducibility, and validity;
  • Testing applications of these approaches for "less traditional" bodies of evidence (i.e., beyond preventive services, diagnostic tests, and therapies) -- for instance, for systematic reviews of disease risk factors, screening tests (as contrasted with tests also used for diagnosis), and counseling interventions;
  • Assessing whether the study quality grids that we developed are useful for discriminating among studies of varying quality and, if so, refining and testing the systems further using typical instrument development techniques (including testing the study quality grids against the instruments we considered to be "high quality"); and
  • Comparing and contrasting approaches to rating quality and grading evidence strength in the United States and abroad, because of the substantial attention being given to this work outside this country; such work would identify what advances are taking place in the international community and help determine where these are relevant to the U.S. scene.
top link

Conclusion

We summarized more than 100 sources of information on systems for assessing study quality and strength of evidence for systematic reviews and technology assessments. After applying evaluative criteria based on key domains to these systems, we identified 19 study quality and seven strength of evidence grading systems that those conducting systematic reviews and technology assessment can use as starting points. In making this information available to the Congress and then disseminating it more widely, AHRQ can meet the congressional expectations set forth in the Healthcare Research and Quality Act of 1999 and outlined at the outset of the report. The broader agenda to be met is for those producing systematic reviews and technology assessments to apply these rating and grading schemes in ways that can be made transparent for groups developing clinical practice guidelines and other health-related policy advice. We have also offered a rich agenda for future research in this area, noting that the Congress can enable pursuit of this body of research through AHRQ and its EPC program. We are confident that the work and recommendations contained in this report will move the evidence-based practice field ahead in ways that will bring benefit to the entire health care system and the people it serves.top link


Copyright and Disclaimer