Section 5: Methods for Arriving at a Recommendation
The preceding two sections have dealt with the processes of question definition
and evidence review, processes which are primarily the work of the EPC. This section
begins the description of the specific work of the Task Force in examining and
judging the cumulative evidence presented to it, and making recommendations. The
steps in this process, as described in this section, include assessing the evidence
at the Key Question level and across an entire Analytic Framework, assessing both
the certainty of the evidence about, and the magnitude of, the harms and benefits
of the service, estimating the magnitude of the net benefit for the service, and
the certainty of that estimation, and finally arriving at a recommendation grade
for that service in the relevant population.
5.1 Assessing Evidence at the Key Question/Linkage Level
In considering the information provided by the body of evidence across a linkage
in the analytic framework (e.g., for a key question), all 6 critical appraisal
questions (go to section 4.2.1) must be considered. The evidence concerning a key question
is often categorized by its strongest research design, and then the level of
evidence is classified into one of 3 categories: "convincing",
"adequate", and "inadequate."
In making this determination, the Task Force considers the evidence described by
the EPC in its review. It considers the "aggregate internal validity" of
all studies across the key question. This judgment is not a simple summation of the
grading of all of the studies in a body of evidence, but often reflects the best
research concerning an issue. The general issue is the extent to which at least
some studies meet the criteria for internal validity (Appendix VII. Likewise,
aggregate external validity refers to the extent to which the best studies are
generalizable to primary care populations, situations, and providers (Appendix VIII).
Coherence is used (in addition to consistency) to indicate that a body of
evidence "makes sense," in that it fits together to present an
understandable picture of the situation. Coherence in this context includes the
concordance between populations, interventions, and outcomes in the studies
reviewed. Several studies of an issue may find different results (and thus be
inconsistent), but the results may still be understandable (and thus coherent)
in terms of the populations they studied or the interventions they used.
5.2 Assessing Certainty of Evidence for the Entire Preventive Service: Evidence
Synthesis
As in assessing the evidence across key questions, the Task Force, in discussion with
the EPC/topic team leader, also plays the primary role in synthesizing evidence for
the entire preventive service.
Assessing evidence at the level of the entire preventive service requires a
complex synthesis of all evidence across the entire analytic framework. The question
is not simply the level of evidence across each key question/Linkage, but also how
these bodies of evidence fit together to provide an accurate estimate of the expected
magnitude of benefits, harms, and net benefit (i.e., benefits minus harms) that would
be realized from widespread implementation of the preventive service.
The Task Force considers this synthesis of the information provided by the entire
body of evidence to be the "certainty" of the overall evidence. The
certainty may also be thought of as the width of the "conceptual confidence
interval" (CCI) given by the evidence to estimate the magnitude of benefits,
harms, and net benefits. This CCI is not a quantitative calculation, but rather a
judgment based on the 6 critical appraisal questions given earlier, and how the
evidence fits together to complete the linkage from the left side of the analytic
framework (population) to the right side (health outcomes). A wide CCI can come
from a lack of evidence about one or more key questions; from studies of the
wrong study design; from studies of the right design but of poor internal or
external validity; from too small and/or too few studies; from inconsistent/incoherent
studies; or other aspects of the studies that cloud the interpretation of the
magnitude of benefits, harms, and net benefits. When the CCI is wide, then the
magnitude cannot be estimated with any confidence, and the entire body of evidence
is categorized as having "low certainty."
When the evidence satisfies most criteria in all of the 6 critical appraisal criteria,
and fits together "well enough" to make the connections across the
analytic framework, then the CCI is considered to be narrower. In this case, we have
a better (although not precise) estimate of the magnitude of benefits, harms, and
net benefits. This type of body of evidence is categorized as "moderate
certainty."
When the evidence satisfies criteria for each of the 6 critical appraisal criteria
across the analytic framework, and the evidence fits together well, then the CCI is
narrow—we have a more precise estimate of benefits, harms, and net benefits.
In this case, the body of evidence is categorized as "high certainty."
The general definitions of the 3 levels of overall evidence are given in Table 2.
The Task Force is careful to separate the concepts of "certainty"
of evidence and "magnitude of benefits, harms, and net benefit".
For example, the Task Force may have high certainty of the overall evidence and
still determine that there is small (or even zero) magnitude of benefits. Or it
may have moderate certainty and determine that there is substantial magnitude of
net benefits. The TF first assesses the certainty of the evidence, then the magnitude
of benefits, harms, and net benefit. These are used together in the Recommendation
Grid to determine the TF recommendation letter.
5.3 Dealing with Conflicts among RCTs
When RCTs of the same question appear to contradict each other, the Task Force first
clinically appraises the studies, considering whether the studies truly contradict each
other. Evidence on the clinical issue from other sources may be useful in this assessment.
The Task Force then critically appraises the studies to determine whether the quality of
the trials helps to explain any differences. If neither clinical nor epidemiologic
reasoning shed light on the differences between the trials, the Task Force at times must
admit that it doesn't know why the studies contradict each other. Quantitative synthesis
of trials is most useful for aggregating the results of RCTs when the results are generally
consistent.
5.4 Assessing Magnitude of Net Benefit
As noted earlier, the Task Force decided that it is important to keep separate the
certainty afforded by the evidence from the magnitude of effect (i.e., benefits and harms)
of the preventive service. To specify the magnitude of the effect of a preventive service,
the Task Force separately assesses the magnitude of benefits and harms, and then combines
these into a "net benefit" assessment. The Task Force has adopted a four-tiered
grading system for the net benefit rating: substantial, moderate, small, and zero/negative.
Thus, "substantial" net benefit indicates that benefits substantially outweigh
harms, whereas "zero/negative" net benefit indicates that harms equal or even
outweigh benefits. This assessment is conducted by the Task Force, in discussion with the
EPC and AHRQ team members.
The Task Force defines net benefit as the magnitude of the benefits of the service
minus the magnitude of the harms. The Task Force gives equal attention to both benefits
and harms since it is well aware that preventive interventions may result in harms as
either a direct consequence of the service or for other "downstream" reasons.
Because of lack of evidence, especially evidence using a single, suitable metric, the
assessment of "net benefits" is inherently subjective. Thus, the Task Force has
not developed specific criteria to judge net benefit.
The Task Force attempts to quantify the magnitude of benefits and harms that would
result from implementing the preventive service in the general primary care population.
One way of doing so is by using such metrics as "number needed to treat" (NNT,
the number of people that would need to be treated for some defined period of time to
prevent one adverse health event) or "number needed to screen" (NNS, the number
of people that would need to be screened for some defined period of time to prevent one
adverse health event). One can also derive a similar "number needed to harm"
(NNH, the number of people needed to treat or screen for a defined time to cause one adverse
health event). The Task Force does not have a single NNT, NNS, or NNH that it considers
to be a threshold for drawing a conclusion about the magnitude of net benefit, due to the
often substantial uncertainty in the evidence used to make the estimates.
The Task Force does have a general way of thinking about the concept of net benefit.
Net benefit, as used by the Task Force, is substantial in those situations in which either:
- A large proportion of the total burden of suffering from the target condition
(minus the additional burden caused by the preventive service) that would be relieved
from society by implementing the preventive service, even if the target condition is rare,
is large (e.g., screening for PKU).
- A large amount of the burden of suffering would be relieved from society (minus the
amount of the additional burden caused by the preventive service) by implementing the
preventive service (e.g., counseling for smoking cessation).
Note that in both of these situations, a population can be defined that has a
substantial burden of suffering from the target condition, even if rare, and there is
a prevention strategy that reduces that burden by a substantial amount. Net benefit,
however, would only be substantial if harms of the intervention are zero or small (as in
the examples cited here). Thus, both the magnitude of harms and the magnitude of benefits
are critical factors in determining net benefits.
5.4.1 Assessing Magnitude of Benefits
In situations where the certainty of evidence is high or moderate, the Task Force
considers all of the admissible evidence to determine the magnitude of benefit that would
be expected from implementing the preventive service in a defined population. Its preferred
approach for doing this is the Outcomes Table. In this table, the topic team uses the evidence
to estimate the number of people in a hypothetical population who would benefit in specific
ways from implementation of the preventive service, over a given time horizon (often 5-10 years). Specific health benefits might include such things as lives extended, cardiovascular
events avoided, visual impairment avoided, lung cancers avoided, or alcohol complications
avoided. In some situations, the table can be completed easily by simply transferring
information from a large, well-conducted RCT of a representative population. Most commonly,
however, some cells in the Table are not so easily completed and require calculations based
on assumptions—a situation that intrinsically adds uncertainty. Thus the different
numbers in an Outcomes Table have different levels of certainty and must be
interpreted carefully.
Note that the numbers in an Outcomes Table are meant to shed light on the amount of the
burden of suffering from the condition (within a stated population) that can be expected to
be prevented by the intervention in question. The magnitude of benefit cannot be greater
than the total burden of suffering.
For screening interventions, the benefit may be further limited by such issues as
the following:
- The prevalence of the target condition.
- For heterogeneous conditions, the prevalence of that subtype of the condition that
would cause important health problems.
- The sensitivity of the screening test (i.e., the degree to which the test will detect
that subtype of the condition that would potentially cause health problems; rarely 100%).
- The effectiveness of early treatment (compared with later treatment) of the subtype of
the condition that would cause health problems. (This quantity is rarely 100%).
The Outcomes Table can show such considerations as these, demonstrating how many
people are likely to receive benefit—and in what ways—from implementation of
the preventive service. In situations of limited or absent direct evidence, this type of
logic is useful to the Task Force in placing an upper bound on the magnitude of benefit.
In other situations, the Task Force may logically be able to judge the lower bounds of
the benefit.
5.4.2 Assessing Magnitude of Harms
The Task Force starts with the assumption that harmless interventions are rare. For
screening interventions, the Task Force looks for harms of screening and also harms of
early treatment. Harms of screening may include such things as psychological harm from
labeling and the harms of work-ups to confirm the presence of the condition. The harms
of treatment may include the actual physical effects of early treatment as well as the
effects of "over-treatment." These harms of treatment may accrue to patients
whose conditions might never have come to clinical attention or for whom the harms of
treatment initiated prior to routine clinical detection were different or occurred earlier
and/or over a longer period of time. In other words, these are harms of treatment which
would not have occurred in the absence of screening. Although harms of counseling are
frequently small, harms may include psychological harms from labeling or harms of treatment.
Although there is often less evidence about potential harms than about potential benefits,
the Task Force may draw general conclusions from such evidence as the expected yield of
screening in terms of false positive test results. If the prevalence of the condition is
low and the specificity of the test is less than 100%, then there will be some false
positive tests (i.e., the positive predictive value may be low). If the work-up is invasive,
then the Task Force can infer that there will be at least some harms from many people
going through an invasive work-up for no possible benefit (i.e., people who had a
false positive screening test).
Similarly, if over-treatment is common, and if the treatment has some adverse effects,
the Task Force may infer that screening causes at least some harms, even in the absence
of a study dedicated to defining harms. This approach does not require an exact estimate
of the magnitude of harms but rather a determination that the harms are unlikely to be
less than what is known about the number of false positives, the invasiveness of the
work-up, and the expected amount of over-treatment. These "lower bounds" of
harms can be shown in an Outcomes Table. Care should be taken to call attention to the
estimate's lack of precision.
In another situation, the Task Force may determine that a study gives an upper bound
of benefit (or harm), rather than a lower bound. For example, the Task Force might
consider the estimate of benefit to be an upper bound if it came from a study of an
intervention conducted by highly trained physicians using specialized equipment for
people at very high risk.
The Task Force also considers the time and effort required by both patients and the
health care system (opportunity costs) to implement the preventive care service. If the
time and effort are judged to be clinically important these factors are also considered
in the "harms" category. The Task Force usually has general rather than
precise estimates of opportunity costs.
Although opportunity costs may be considered in the Task Force's letter grades,
financial costs are not. The Task Force understands, however, that many of its audiences
are interested in issues of financial cost. In situations where there is likely to be
some degree of health benefit, the Task Force searches for information about costs
and cost-effectiveness and provides a summary of this information under "Other
Considerations" in its recommendation statement.
5.4.3 Assessing Magnitude of Net Benefit
Once the Task Force has estimated the magnitude of benefits and harms, it faces the
further challenge of synthesizing these assessments into an estimate of the magnitude
of net benefit. Weighing the balance of benefits and harms can be challenging since
they are often measured in different metrics. Benefits are often quantified in terms of
lives extended or illness events averted. Harms may be measured in different metrics,
such false positive screening tests or adverse effects of treatment.
As noted above, the Outcomes Table is a critical tool in the Task Force's approach
to determining the magnitude of net benefit. The estimates in an outcomes table may not
have a great deal of precision but are useful in giving a general idea of the magnitude
of the benefits and harms. Both estimates from direct evidence and also estimates based
on explicit assumptions should be included, in order to provide likely upper and lower
bounds of the magnitude of specific benefits and harms. A Decision Analysis is another
approach to provide information about magnitude of benefits and harms based on best
estimates from direct evidence and from explicit assumptions. A Decision Analytic model
would typically describe benefits to a population over a life time horizon rather than
the five or ten years represented in an Outcomes Table.
It is common for direct evidence to be inadequate to complete one or more critical
cells in the Outcomes Table. This may be due either to a lack of direct evidence or to
gaps in the direct evidence that is available. Common gaps in direct evidence include
such factors as lack of evidence about all populations (including risk groups) of interest;
lack of availability of the exact interventions (or the experts administering them)
used in the large RCTs; or insufficient follow-up to determine long-term effects of
interventions. Thus, the Task Force needs to use indirect evidence to calculate upper
or lower bounds of benefits and harms. The conceptual confidence interval CCI, discussed
above (section 5.2), places upper and lower limits on the estimated net benefit. This
range is bounded by the best-case and worst-case scenario estimates based on available
evidence. The interval is not meant to have a statistical interpretation. The Task Force,
however, recognizes the danger in this approach, and considers such bounds with
appropriate skepticism, using them only with great care.
After data on assembled expected outcomes, whether from an Outcomes Table, or from
a Decision Analytic Model, are presented, the Task Force must still weigh benefits and
harms (usually very different types of health effects in different metrics) to arrive
at net benefits. Clearly, value judgments are involved in this balancing of effects. In
making its determination of net benefit, the Task Force strives to consider what it
believes are the general values of most people. When the Task Force perceives that
preferences among individuals vary greatly, and that these variations are sufficient
to change the balance of benefits and harms, it will often suggest shared decision
making to incorporate the individual's perspective into the decision.
The Task Force has standardized the Outcomes Table to the extent possible. There will
invariably be some variation, depending on the topic. The standard Outcomes Table format
is given in Appendix IX.
5.5 Translating Evidence into USPSTF Recommendations
Clarity and comprehensibility are critical for recommendations' widespread use. The
Task Force and AHRQ are aware that the recommendations and the letter grades used to
define them may be misunderstood, and are therefore taking pains to clarify and refine
them. AHRQ has conducted multiple focus groups of clinicians over the period of years
from 2004 to 2006) to solicit feedback about the readability and usability of the Task
Force recommendations. Themes that emerged included requests for: simplified, succinct
recommendations and an easier-to-use format (bold face type, bulleted sections and boxes
to highlight key information); recommendations of other professional organizations to
easily compare to the USPSTF recommendation; and Web sites and references for additional
information on the topic.
5.6 Principles for Making Recommendations
Task Force recommendations are coded to reflect both the certainty of the evidence
and the magnitude of effect (i.e., net benefit as discussed in section 5.4.3). To
be as explicit as possible about its approach to making recommendations, the Task Force
developed a set of principles for making recommendations. These principles, listed below,
describe in detail the factors that the Task Force does and does not take into consideration
in making recommendations, and to whom the recommendations apply.
- Recommendations are evidence-based: there must be scientific evidence that persons
who receive the preventive service experience better health outcomes than those who do
not, and that the benefits are large enough to outweigh the harms.
- The supporting evidence can be compiled from data regarding specific linkages
in the analytic framework, but in the end the complete causal chain from intervention
to outcome must be supported by acceptable evidence.
- Inferences about supporting evidence can include generalizations from one
population to another when there are acceptable grounds to assume the evidence is
applicable to both. A screening test can also be considered effective if evidence
supports the value of treatment for early stage disease.
- Recommendations are not based largely on opinion, such as expert opinion or
subjective perceptions based on clinical experience. Subjective judgments do enter
into the evaluation of evidence and the weighing of benefits and harms.
- The scientific rationale for the recommendations and the methods used to review
and judge the evidence are stated explicitly along with the recommendations.
- Recommendations describe services that should or should not routinely be offered
based on scientific evidence, although it is recognized that in clinical practice and
public policy concerns other than scientific evidence (e.g., feasibility, public
expectations) may take precedence.
- The outcomes that matter most in weighing the evidence and making recommendations
are health benefits and harms.
- In assessing health benefits, outcomes that patients can feel or care
about (e.g., visual acuity, pain, survival) receive more weight than
intermediate/surrogate outcomes.
- In judging the magnitude of benefit, absolute reductions in risk matter
more than relative risk reductions.
- Effectiveness is considered as valuable, if not more valuable, than efficacy.
The ability of patients, providers, and the health care system to perform or
maintain interventions over time is considered. Interventions may not be recommended
at the population level because of concerns about compliance (adherence), but may
be advocated for patients and providers who are willing and able to perform the
intervention.
- The direct and indirect harms of preventive services must also be considered,
ensuring that they do not outweigh the benefits to the individual and/or population.
Because of the ethical imperative to do no harm, especially when caring for
asymptomatic persons, in selected circumstances the quality of evidence for harms
need not be as strong as that for benefits. Both physical and psychological harms
are considered.
- Judgments about tradeoffs between benefits and harms are generally made at
the population level, and involve subjective estimations by the Task Force of the
average utilities of the population. For interventions that involve tradeoffs that
are highly sensitive to patient utilities, interventions for which the relationship
between benefits and harms is influenced heavily by personal preferences, the Task
Force may abandon population-based recommendations and advocate shared decision-making
at the individual level.
- Consideration of benefits and harms should not be limited to the perspective
of individuals but should also consider population effects (e.g., population
attributable risk, decreased exposure to infectious diseases, herd immunity).
- The economic costs (direct and indirect) of preventive services, both to
individuals and to society, warrant consideration in making recommendations but
are not the first priority.
- Although the USPSTF does not consider economic costs in making recommendations,
it realizes that these costs are important in the decision to implement preventive
services. Thus, in situations where there is likely to be some effectiveness of the
service, the TF searches for evidence of the costs and cost-effectiveness of
implementation, presenting this information separately from its recommendation.
- Recommendations are not modified to accommodate concerns about insurance coverage
of preventive services, medicolegal liability, or legislation, but users of the
recommendations may need to do so.
- Recommendations apply only to asymptomatic persons or to those with unrecognized
signs or symptoms of the target condition for which the preventive service is intended.
They also apply only to preventive services delivered in, or referable from the
clinical setting.
- Persons living in the United States are the target population, although it is
understood that the evidence reviews and recommendations may be useful in other
countries. Recommendations are not intended for populations with markedly different
disease patterns and health care services (e.g., developing countries).
- The clinical setting to which the recommendations apply are typically primary
care ambulatory practices but can also include offices and clinics of specialists,
hospitals, emergency departments, public health departments, urgent care facilities,
student health centers, worksites, family planning clinics, nursing homes, and home
care.
- The evidence for preventive services delivered outside the traditional clinical
context (e.g., non-clinic based programs at schools, worksites, shopping centers) is
often the same, but the recommendations are not primarily intended for this setting.
- Recommendations apply only to asymptomatic persons or to those with unrecognized
signs or symptoms of the target condition for which the preventive service is intended.
They also apply only to those preventive services for which at least one component,
e.g., identification or referral, may be delivered in the primary care clinical
setting.
5.7 Grades
The Task Force also adopted a set of grades to apply to the evidence. For graded
recommendations, appended rationale statements and statements about clinical
considerations allow readers to clearly understand the Task Force's judgment about the
certainty of the evidence, the net benefit of implementation, and the overall
recommendation about the use of each preventive service.
The Task Force includes a grade that indicates when evidence is insufficient to make
any recommendation, these grade 'I' topics are accompanied by the same type of rationale
and clinical considerations, but are considered "statements" rather than
"recommendations."
The grades may be best understood in the
grid in Table 3.
The Task Force also adopted a plan for appending to the recommendation grade an
explicit rationale statement, giving the Task Force's assessment of the overall
certainty of the evidence and the magnitude of net benefit. After the rationale
statement, the Task Force adds a statement about Clinical Intervention, providing
more specific guidance to clinicians.
5.8 Wording of Recommendation or Conclusion Statements
The Task Force also adopted standardized language for the grades given in Figure 5,
as shown below:
A: The USPSTF recommends X service for Y population.
B: The USPSTF recommends X service for Y population.
C: The USPTF recommends against routinely (providing) X service for Y population.
There may be considerations that support (providing) the service in an individual
patient.
D: The USPSTF recommends against X service for Y population.
I: The USPSTF concludes that the current evidence is insufficient to assess the
balance of benefits and harms of X service in Y population.
Table 4 provides detailed definitions of the grades, with suggestions for clinical practice.
Return to Contents
Proceed to Next Section