Skip Navigation U.S. Department of Health and Human Services www.hhs.gov
Agency for Healthcare Research Quality www.ahrq.gov
www.ahrq.gov

U.S. Preventive Services Task Force (USPSTF)
Procedure Manual


Section 5: Methods for Arriving at a Recommendation

The preceding two sections have dealt with the processes of question definition and evidence review, processes which are primarily the work of the EPC. This section begins the description of the specific work of the Task Force in examining and judging the cumulative evidence presented to it, and making recommendations. The steps in this process, as described in this section, include assessing the evidence at the Key Question level and across an entire Analytic Framework, assessing both the certainty of the evidence about, and the magnitude of, the harms and benefits of the service, estimating the magnitude of the net benefit for the service, and the certainty of that estimation, and finally arriving at a recommendation grade for that service in the relevant population.

5.1 Assessing Evidence at the Key Question/Linkage Level

In considering the information provided by the body of evidence across a linkage in the analytic framework (e.g., for a key question), all 6 critical appraisal questions (go to section 4.2.1) must be considered. The evidence concerning a key question is often categorized by its strongest research design, and then the level of evidence is classified into one of 3 categories: "convincing", "adequate", and "inadequate."

In making this determination, the Task Force considers the evidence described by the EPC in its review. It considers the "aggregate internal validity" of all studies across the key question. This judgment is not a simple summation of the grading of all of the studies in a body of evidence, but often reflects the best research concerning an issue. The general issue is the extent to which at least some studies meet the criteria for internal validity (Appendix VII. Likewise, aggregate external validity refers to the extent to which the best studies are generalizable to primary care populations, situations, and providers (Appendix VIII).

Coherence is used (in addition to consistency) to indicate that a body of evidence "makes sense," in that it fits together to present an understandable picture of the situation. Coherence in this context includes the concordance between populations, interventions, and outcomes in the studies reviewed. Several studies of an issue may find different results (and thus be inconsistent), but the results may still be understandable (and thus coherent) in terms of the populations they studied or the interventions they used.

5.2 Assessing Certainty of Evidence for the Entire Preventive Service: Evidence Synthesis

As in assessing the evidence across key questions, the Task Force, in discussion with the EPC/topic team leader, also plays the primary role in synthesizing evidence for the entire preventive service.

Assessing evidence at the level of the entire preventive service requires a complex synthesis of all evidence across the entire analytic framework. The question is not simply the level of evidence across each key question/Linkage, but also how these bodies of evidence fit together to provide an accurate estimate of the expected magnitude of benefits, harms, and net benefit (i.e., benefits minus harms) that would be realized from widespread implementation of the preventive service.

The Task Force considers this synthesis of the information provided by the entire body of evidence to be the "certainty" of the overall evidence. The certainty may also be thought of as the width of the "conceptual confidence interval" (CCI) given by the evidence to estimate the magnitude of benefits, harms, and net benefits. This CCI is not a quantitative calculation, but rather a judgment based on the 6 critical appraisal questions given earlier, and how the evidence fits together to complete the linkage from the left side of the analytic framework (population) to the right side (health outcomes). A wide CCI can come from a lack of evidence about one or more key questions; from studies of the wrong study design; from studies of the right design but of poor internal or external validity; from too small and/or too few studies; from inconsistent/incoherent studies; or other aspects of the studies that cloud the interpretation of the magnitude of benefits, harms, and net benefits. When the CCI is wide, then the magnitude cannot be estimated with any confidence, and the entire body of evidence is categorized as having "low certainty."

When the evidence satisfies most criteria in all of the 6 critical appraisal criteria, and fits together "well enough" to make the connections across the analytic framework, then the CCI is considered to be narrower. In this case, we have a better (although not precise) estimate of the magnitude of benefits, harms, and net benefits. This type of body of evidence is categorized as "moderate certainty."

When the evidence satisfies criteria for each of the 6 critical appraisal criteria across the analytic framework, and the evidence fits together well, then the CCI is narrow—we have a more precise estimate of benefits, harms, and net benefits. In this case, the body of evidence is categorized as "high certainty."

The general definitions of the 3 levels of overall evidence are given in Table 2.

The Task Force is careful to separate the concepts of "certainty" of evidence and "magnitude of benefits, harms, and net benefit". For example, the Task Force may have high certainty of the overall evidence and still determine that there is small (or even zero) magnitude of benefits. Or it may have moderate certainty and determine that there is substantial magnitude of net benefits. The TF first assesses the certainty of the evidence, then the magnitude of benefits, harms, and net benefit. These are used together in the Recommendation Grid to determine the TF recommendation letter.

5.3 Dealing with Conflicts among RCTs

When RCTs of the same question appear to contradict each other, the Task Force first clinically appraises the studies, considering whether the studies truly contradict each other. Evidence on the clinical issue from other sources may be useful in this assessment. The Task Force then critically appraises the studies to determine whether the quality of the trials helps to explain any differences. If neither clinical nor epidemiologic reasoning shed light on the differences between the trials, the Task Force at times must admit that it doesn't know why the studies contradict each other. Quantitative synthesis of trials is most useful for aggregating the results of RCTs when the results are generally consistent.

5.4 Assessing Magnitude of Net Benefit

As noted earlier, the Task Force decided that it is important to keep separate the certainty afforded by the evidence from the magnitude of effect (i.e., benefits and harms) of the preventive service. To specify the magnitude of the effect of a preventive service, the Task Force separately assesses the magnitude of benefits and harms, and then combines these into a "net benefit" assessment. The Task Force has adopted a four-tiered grading system for the net benefit rating: substantial, moderate, small, and zero/negative. Thus, "substantial" net benefit indicates that benefits substantially outweigh harms, whereas "zero/negative" net benefit indicates that harms equal or even outweigh benefits. This assessment is conducted by the Task Force, in discussion with the EPC and AHRQ team members.

The Task Force defines net benefit as the magnitude of the benefits of the service minus the magnitude of the harms. The Task Force gives equal attention to both benefits and harms since it is well aware that preventive interventions may result in harms as either a direct consequence of the service or for other "downstream" reasons.

Because of lack of evidence, especially evidence using a single, suitable metric, the assessment of "net benefits" is inherently subjective. Thus, the Task Force has not developed specific criteria to judge net benefit.

The Task Force attempts to quantify the magnitude of benefits and harms that would result from implementing the preventive service in the general primary care population. One way of doing so is by using such metrics as "number needed to treat" (NNT, the number of people that would need to be treated for some defined period of time to prevent one adverse health event) or "number needed to screen" (NNS, the number of people that would need to be screened for some defined period of time to prevent one adverse health event). One can also derive a similar "number needed to harm" (NNH, the number of people needed to treat or screen for a defined time to cause one adverse health event). The Task Force does not have a single NNT, NNS, or NNH that it considers to be a threshold for drawing a conclusion about the magnitude of net benefit, due to the often substantial uncertainty in the evidence used to make the estimates.

The Task Force does have a general way of thinking about the concept of net benefit. Net benefit, as used by the Task Force, is substantial in those situations in which either:

  1. A large proportion of the total burden of suffering from the target condition (minus the additional burden caused by the preventive service) that would be relieved from society by implementing the preventive service, even if the target condition is rare, is large (e.g., screening for PKU).
  2. A large amount of the burden of suffering would be relieved from society (minus the amount of the additional burden caused by the preventive service) by implementing the preventive service (e.g., counseling for smoking cessation).

Note that in both of these situations, a population can be defined that has a substantial burden of suffering from the target condition, even if rare, and there is a prevention strategy that reduces that burden by a substantial amount. Net benefit, however, would only be substantial if harms of the intervention are zero or small (as in the examples cited here). Thus, both the magnitude of harms and the magnitude of benefits are critical factors in determining net benefits.

5.4.1 Assessing Magnitude of Benefits

In situations where the certainty of evidence is high or moderate, the Task Force considers all of the admissible evidence to determine the magnitude of benefit that would be expected from implementing the preventive service in a defined population. Its preferred approach for doing this is the Outcomes Table. In this table, the topic team uses the evidence to estimate the number of people in a hypothetical population who would benefit in specific ways from implementation of the preventive service, over a given time horizon (often 5-10 years). Specific health benefits might include such things as lives extended, cardiovascular events avoided, visual impairment avoided, lung cancers avoided, or alcohol complications avoided. In some situations, the table can be completed easily by simply transferring information from a large, well-conducted RCT of a representative population. Most commonly, however, some cells in the Table are not so easily completed and require calculations based on assumptions—a situation that intrinsically adds uncertainty. Thus the different numbers in an Outcomes Table have different levels of certainty and must be interpreted carefully.

Note that the numbers in an Outcomes Table are meant to shed light on the amount of the burden of suffering from the condition (within a stated population) that can be expected to be prevented by the intervention in question. The magnitude of benefit cannot be greater than the total burden of suffering.

For screening interventions, the benefit may be further limited by such issues as the following:

  1. The prevalence of the target condition.
  2. For heterogeneous conditions, the prevalence of that subtype of the condition that would cause important health problems.
  3. The sensitivity of the screening test (i.e., the degree to which the test will detect that subtype of the condition that would potentially cause health problems; rarely 100%).
  4. The effectiveness of early treatment (compared with later treatment) of the subtype of the condition that would cause health problems. (This quantity is rarely 100%).

The Outcomes Table can show such considerations as these, demonstrating how many people are likely to receive benefit—and in what ways—from implementation of the preventive service. In situations of limited or absent direct evidence, this type of logic is useful to the Task Force in placing an upper bound on the magnitude of benefit. In other situations, the Task Force may logically be able to judge the lower bounds of the benefit.

5.4.2 Assessing Magnitude of Harms

The Task Force starts with the assumption that harmless interventions are rare. For screening interventions, the Task Force looks for harms of screening and also harms of early treatment. Harms of screening may include such things as psychological harm from labeling and the harms of work-ups to confirm the presence of the condition. The harms of treatment may include the actual physical effects of early treatment as well as the effects of "over-treatment." These harms of treatment may accrue to patients whose conditions might never have come to clinical attention or for whom the harms of treatment initiated prior to routine clinical detection were different or occurred earlier and/or over a longer period of time. In other words, these are harms of treatment which would not have occurred in the absence of screening. Although harms of counseling are frequently small, harms may include psychological harms from labeling or harms of treatment.

Although there is often less evidence about potential harms than about potential benefits, the Task Force may draw general conclusions from such evidence as the expected yield of screening in terms of false positive test results. If the prevalence of the condition is low and the specificity of the test is less than 100%, then there will be some false positive tests (i.e., the positive predictive value may be low). If the work-up is invasive, then the Task Force can infer that there will be at least some harms from many people going through an invasive work-up for no possible benefit (i.e., people who had a false positive screening test).

Similarly, if over-treatment is common, and if the treatment has some adverse effects, the Task Force may infer that screening causes at least some harms, even in the absence of a study dedicated to defining harms. This approach does not require an exact estimate of the magnitude of harms but rather a determination that the harms are unlikely to be less than what is known about the number of false positives, the invasiveness of the work-up, and the expected amount of over-treatment. These "lower bounds" of harms can be shown in an Outcomes Table. Care should be taken to call attention to the estimate's lack of precision.

In another situation, the Task Force may determine that a study gives an upper bound of benefit (or harm), rather than a lower bound. For example, the Task Force might consider the estimate of benefit to be an upper bound if it came from a study of an intervention conducted by highly trained physicians using specialized equipment for people at very high risk.

The Task Force also considers the time and effort required by both patients and the health care system (opportunity costs) to implement the preventive care service. If the time and effort are judged to be clinically important these factors are also considered in the "harms" category. The Task Force usually has general rather than precise estimates of opportunity costs.

Although opportunity costs may be considered in the Task Force's letter grades, financial costs are not. The Task Force understands, however, that many of its audiences are interested in issues of financial cost. In situations where there is likely to be some degree of health benefit, the Task Force searches for information about costs and cost-effectiveness and provides a summary of this information under "Other Considerations" in its recommendation statement.

5.4.3 Assessing Magnitude of Net Benefit

Once the Task Force has estimated the magnitude of benefits and harms, it faces the further challenge of synthesizing these assessments into an estimate of the magnitude of net benefit. Weighing the balance of benefits and harms can be challenging since they are often measured in different metrics. Benefits are often quantified in terms of lives extended or illness events averted. Harms may be measured in different metrics, such false positive screening tests or adverse effects of treatment.

As noted above, the Outcomes Table is a critical tool in the Task Force's approach to determining the magnitude of net benefit. The estimates in an outcomes table may not have a great deal of precision but are useful in giving a general idea of the magnitude of the benefits and harms. Both estimates from direct evidence and also estimates based on explicit assumptions should be included, in order to provide likely upper and lower bounds of the magnitude of specific benefits and harms. A Decision Analysis is another approach to provide information about magnitude of benefits and harms based on best estimates from direct evidence and from explicit assumptions. A Decision Analytic model would typically describe benefits to a population over a life time horizon rather than the five or ten years represented in an Outcomes Table.

It is common for direct evidence to be inadequate to complete one or more critical cells in the Outcomes Table. This may be due either to a lack of direct evidence or to gaps in the direct evidence that is available. Common gaps in direct evidence include such factors as lack of evidence about all populations (including risk groups) of interest; lack of availability of the exact interventions (or the experts administering them) used in the large RCTs; or insufficient follow-up to determine long-term effects of interventions. Thus, the Task Force needs to use indirect evidence to calculate upper or lower bounds of benefits and harms. The conceptual confidence interval CCI, discussed above (section 5.2), places upper and lower limits on the estimated net benefit. This range is bounded by the best-case and worst-case scenario estimates based on available evidence. The interval is not meant to have a statistical interpretation. The Task Force, however, recognizes the danger in this approach, and considers such bounds with appropriate skepticism, using them only with great care.

After data on assembled expected outcomes, whether from an Outcomes Table, or from a Decision Analytic Model, are presented, the Task Force must still weigh benefits and harms (usually very different types of health effects in different metrics) to arrive at net benefits. Clearly, value judgments are involved in this balancing of effects. In making its determination of net benefit, the Task Force strives to consider what it believes are the general values of most people. When the Task Force perceives that preferences among individuals vary greatly, and that these variations are sufficient to change the balance of benefits and harms, it will often suggest shared decision making to incorporate the individual's perspective into the decision.

The Task Force has standardized the Outcomes Table to the extent possible. There will invariably be some variation, depending on the topic. The standard Outcomes Table format is given in Appendix IX.

5.5 Translating Evidence into USPSTF Recommendations

Clarity and comprehensibility are critical for recommendations' widespread use. The Task Force and AHRQ are aware that the recommendations and the letter grades used to define them may be misunderstood, and are therefore taking pains to clarify and refine them. AHRQ has conducted multiple focus groups of clinicians over the period of years from 2004 to 2006) to solicit feedback about the readability and usability of the Task Force recommendations. Themes that emerged included requests for: simplified, succinct recommendations and an easier-to-use format (bold face type, bulleted sections and boxes to highlight key information); recommendations of other professional organizations to easily compare to the USPSTF recommendation; and Web sites and references for additional information on the topic.

5.6 Principles for Making Recommendations

Task Force recommendations are coded to reflect both the certainty of the evidence and the magnitude of effect (i.e., net benefit as discussed in section 5.4.3). To be as explicit as possible about its approach to making recommendations, the Task Force developed a set of principles for making recommendations. These principles, listed below, describe in detail the factors that the Task Force does and does not take into consideration in making recommendations, and to whom the recommendations apply.

  1. Recommendations are evidence-based: there must be scientific evidence that persons who receive the preventive service experience better health outcomes than those who do not, and that the benefits are large enough to outweigh the harms.
    • The supporting evidence can be compiled from data regarding specific linkages in the analytic framework, but in the end the complete causal chain from intervention to outcome must be supported by acceptable evidence.
    • Inferences about supporting evidence can include generalizations from one population to another when there are acceptable grounds to assume the evidence is applicable to both. A screening test can also be considered effective if evidence supports the value of treatment for early stage disease.
    • Recommendations are not based largely on opinion, such as expert opinion or subjective perceptions based on clinical experience. Subjective judgments do enter into the evaluation of evidence and the weighing of benefits and harms.
    • The scientific rationale for the recommendations and the methods used to review and judge the evidence are stated explicitly along with the recommendations.
    • Recommendations describe services that should or should not routinely be offered based on scientific evidence, although it is recognized that in clinical practice and public policy concerns other than scientific evidence (e.g., feasibility, public expectations) may take precedence.
  2. The outcomes that matter most in weighing the evidence and making recommendations are health benefits and harms.
    • In assessing health benefits, outcomes that patients can feel or care about (e.g., visual acuity, pain, survival) receive more weight than intermediate/surrogate outcomes.
    • In judging the magnitude of benefit, absolute reductions in risk matter more than relative risk reductions.
    • Effectiveness is considered as valuable, if not more valuable, than efficacy. The ability of patients, providers, and the health care system to perform or maintain interventions over time is considered. Interventions may not be recommended at the population level because of concerns about compliance (adherence), but may be advocated for patients and providers who are willing and able to perform the intervention.
    • The direct and indirect harms of preventive services must also be considered, ensuring that they do not outweigh the benefits to the individual and/or population. Because of the ethical imperative to do no harm, especially when caring for asymptomatic persons, in selected circumstances the quality of evidence for harms need not be as strong as that for benefits. Both physical and psychological harms are considered.
    • Judgments about tradeoffs between benefits and harms are generally made at the population level, and involve subjective estimations by the Task Force of the average utilities of the population. For interventions that involve tradeoffs that are highly sensitive to patient utilities, interventions for which the relationship between benefits and harms is influenced heavily by personal preferences, the Task Force may abandon population-based recommendations and advocate shared decision-making at the individual level.
    • Consideration of benefits and harms should not be limited to the perspective of individuals but should also consider population effects (e.g., population attributable risk, decreased exposure to infectious diseases, herd immunity).
  3. The economic costs (direct and indirect) of preventive services, both to individuals and to society, warrant consideration in making recommendations but are not the first priority.
    • Although the USPSTF does not consider economic costs in making recommendations, it realizes that these costs are important in the decision to implement preventive services. Thus, in situations where there is likely to be some effectiveness of the service, the TF searches for evidence of the costs and cost-effectiveness of implementation, presenting this information separately from its recommendation.
  4. Recommendations are not modified to accommodate concerns about insurance coverage of preventive services, medicolegal liability, or legislation, but users of the recommendations may need to do so.
  5. Recommendations apply only to asymptomatic persons or to those with unrecognized signs or symptoms of the target condition for which the preventive service is intended. They also apply only to preventive services delivered in, or referable from the clinical setting.
    • Persons living in the United States are the target population, although it is understood that the evidence reviews and recommendations may be useful in other countries. Recommendations are not intended for populations with markedly different disease patterns and health care services (e.g., developing countries).
    • The clinical setting to which the recommendations apply are typically primary care ambulatory practices but can also include offices and clinics of specialists, hospitals, emergency departments, public health departments, urgent care facilities, student health centers, worksites, family planning clinics, nursing homes, and home care.
    • The evidence for preventive services delivered outside the traditional clinical context (e.g., non-clinic based programs at schools, worksites, shopping centers) is often the same, but the recommendations are not primarily intended for this setting.
    • Recommendations apply only to asymptomatic persons or to those with unrecognized signs or symptoms of the target condition for which the preventive service is intended. They also apply only to those preventive services for which at least one component, e.g., identification or referral, may be delivered in the primary care clinical setting.

5.7 Grades

The Task Force also adopted a set of grades to apply to the evidence. For graded recommendations, appended rationale statements and statements about clinical considerations allow readers to clearly understand the Task Force's judgment about the certainty of the evidence, the net benefit of implementation, and the overall recommendation about the use of each preventive service.

The Task Force includes a grade that indicates when evidence is insufficient to make any recommendation, these grade 'I' topics are accompanied by the same type of rationale and clinical considerations, but are considered "statements" rather than "recommendations."

The grades may be best understood in the grid in Table 3.

The Task Force also adopted a plan for appending to the recommendation grade an explicit rationale statement, giving the Task Force's assessment of the overall certainty of the evidence and the magnitude of net benefit. After the rationale statement, the Task Force adds a statement about Clinical Intervention, providing more specific guidance to clinicians.

5.8 Wording of Recommendation or Conclusion Statements

The Task Force also adopted standardized language for the grades given in Figure 5, as shown below:

A: The USPSTF recommends X service for Y population.

B: The USPSTF recommends X service for Y population.

C: The USPTF recommends against routinely (providing) X service for Y population. There may be considerations that support (providing) the service in an individual patient.

D: The USPSTF recommends against X service for Y population.

I: The USPSTF concludes that the current evidence is insufficient to assess the balance of benefits and harms of X service in Y population.

Table 4 provides detailed definitions of the grades, with suggestions for clinical practice.

Return to Contents
Proceed to Next Section

 

AHRQ Advancing Excellence in Health Care