Breast Cancer Screening, Summary of Evidence

Appendix

Analytic Framework

Because of the availability of population-based, randomized trials, mammography has the most direct type of evidence of any cancer screening program.98 Nevertheless, mammography has been controversial since it was first proposed in the 1960s. To understand why, it is helpful to consider the assumptions underlying the steps in the causal chain from screening test to health outcomes. In the analytic framework (Appendix Figure 1, 63 KB), this evidence is shown by the overarching arc connecting screening with the outcomes, reduced morbidity and mortality.

Mammography is aimed at early detection of invasive cancer, which is treated by major surgery (mastectomy or tumorectomy). This differs from screening for colorectal cancer and cervical cancer, which is aimed at detecting and removing precancerous lesions to prevent invasive cancer and to preserve the involved organ (colon or uterine cervix). This is one reason why, although it may be reasonable to endorse one cancer screening test (Papanicolaou smear) based on observational, indirect evidence, it may also be reasonable to require experimental evidence before endorsing another (mammography or prostate cancer screening).

It is important to note that the mammography trials do not necessarily provide the highest level of evidence about the efficacy of early treatment. While there is no doubt that screening results in earlier diagnosis of invasive breast cancer, the efficacy of earlier treatment of invasive cancer has not been established independently of the trials.99 That is, there is no direct evidence from trials of surgical therapy (versus watchful waiting) that earlier treatment of invasive cancer reduces mortality. The mammography trials do not attempt to link specific treatments, such as radical mastectomy or adjuvant radiation, to improved outcomes.

The reliance on a theory of treatment rather than on evidence about the efficacy of treatment increases the burden of proof placed on the trials of mammography. It also distinguishes cancer screening from other screening services considered by the USPSTF, such as chlamydia, depression, or osteoporosis screening, for which randomized, placebo-controlled trials of treatment have been done.

The threshold for sufficient evidence about efficacy also depends on the balance of benefits and harms. Because mammography technology, the timing and type of information provided to patients, and treatment approaches have changed over time, the adverse consequences of screening in current practice might be very different from those in the trials. Other sources of data must be used to estimate these consequences.

Identification and Selection of Articles

We identified controlled trials and meta-analyses by searching the Cochrane Controlled Trials Registry (all dates), as well as searching for recent publications in MEDLINE® (January 1994 to December 2001). Other sources were a PREMEDLINE search (December 2001 through February 2002); the reference lists of previous reviews, commentaries, and meta-analyses.5,8,27,32,50,53,55,56,60,87,100-103; the results of a broader search conducted for the systematic evidence review on which this article is based.46; and suggestions from experts.

In the electronic searches, the terms breast neoplasms and breast cancer were combined with the terms mammography and mass screening and with terms for controlled or randomized trials to yield 954 citations. Titles and abstracts were reviewed to identify publications that were randomized, controlled trials of breast cancer screening and had a relevant clinical outcome (advanced breast cancer, breast cancer mortality, or all-cause mortality). In all, the searches identified 146 controlled trials, of which 132 were excluded at the title and abstract phase because they concerned promoting screening rather than the efficacy of mammography (Appendix Figure 2, 64 KB).

Four of the remaining 12 trials were excluded. Two were randomized trials of screening with mammography that have not yet presented outcomes of mortality or advanced breast cancer.104,105 The third was a controlled trial that reported a reduction in breast cancer mortality but was not randomized.106,107 The fourth, the Malmö Prevention Study, was apparently a randomized trial of a variety of preventive interventions, including mammography.108 It reported significantly fewer deaths from cancer among women younger than 40 years of age at study entry but provided no information about the mammography protocol, referring readers to another randomized trial, the Malmö Mammographic Screening Program, for further information. We believe that the two trials were in fact separate and that the results of the Malmö Mammographic Screening Program probably do not include results for the 8,000 women who participated in the Malmö Prevention Study.

The remaining eight randomized trials of mammography were conducted between 1963 and 1994. Four of these were Swedish studies: the Malmö, Kopparberg, Ostergotland, Stockholm, and Gothenburg studies. (Kopparberg and Ostergotland together are known as the Swedish Two-County Study.) The remaining studies were the Edinburgh study, the New York Health Insurance Plan (HIP) study, and the two Canadian National Breast Screening Studies (CNBSS-1 and CNBSS-2). Using the electronic searches and other sources, we retrieved the full text of 157 publications about these trials (these are listed in the bibliography accompanying the full systematic evidence review.46 We also identified 10 previous systematic reviews of the trials. Seven of these concerned breast cancer mortality, and three addressed test performance.36,37,45 The searches identified three nonrandomized, controlled trials109-111 that are not included in the meta-analysis but are discussed in the larger report 46 Two randomized trials of breast self-examination were identified and reviewed.

Two of the authors abstracted information about each randomized, controlled trial. We compiled an appendix consisting of detailed information about the patient population, design, potential flaws, missing information, and analysis conducted in each trial. For the primary end point of breast cancer mortality, we abstracted results for each reported length of followup. Whenever possible, we abstracted data separately for participants by decade of age.

The randomized trials of screening provide little information about morbidity or the adverse effects of screening or treatment. A systematic review of adverse effects was beyond the scope of our review. In examining titles and abstracts, we obtained the full text of and reviewed recent articles reporting the frequency of false-positive results on screening mammography in the community and surveys of women's reactions to positive results on screening tests.

Assessment of Study Quality: General Approach

We used predefined criteria developed by the USPSTF to assess the internal validity of each study (Appendix Table 1).9 Two authors rated each study as "good," "fair," or "poor," resolving disagreements by discussion among the authors after review of the data and of comments by 12 peer reviewers of earlier drafts of the report. We tried to apply the same standards to the mammography trials as we have applied to other prevention topics. We based our quality ratings on the entire set of publications from a trial rather than on individual articles.

The USPSTF criteria were designed to be adaptable to the circumstances of different clinical questions. Like other current systems to assess the quality of trials, the criteria are based as much as possible on empirical evidence of bias in relation to study characteristics. However, although the body of such evidence is growing, it does not permit a high degree of certainty about the importance of specific quality criteria in judging the mammography trials. This is because nearly all empirical evidence of the impact of bias on effect size examined drug treatment or other therapies, rather than screening.112,113 Generalization of these findings to large, population-based trials of screening is not straightforward. In recognition of this fact, cancer screening literature from the 1970s emphasizes that design standards for conventional trials of treatment should not always be applied to cancer screening trials.114

The quality of reporting of trials limits precision in critical appraisal.115 This is a particular issue in the mammography screening trials, many of which were conducted in the 1960s and 1970s. Their methods were poorly described, which limits precision in critical appraisal. Although some reviewers have promoted extensive query of trial authors to fill in gaps in published articles, the reliability of such data, as well as the appropriate interpretation of query data that contradicts what has been published in multiauthored, peer-reviewed papers, is uncertain. Moreover, authors are often unable to provide clarifying information.116

Assessment of Study Quality: Application of Specific Criteria

All of the trials clearly defined interventions and co-interventions (CBE and BSE), all considered mortality outcomes, and all used intention-to-screen analysis. For this reason, the following received particular emphasis in judging the quality of the mammography trials:

  1. Initial assembly of comparable groups.
  2. Maintenance of comparable groups and minimization of differential or overall loss to followup.
  3. Use of outcome measurements that were equal, reliable, and valid.

As described below, we used a systematic approach to assess the flaws of the trials in each of these areas.

Initial Assembly of Comparable Groups

In the mammography trials, randomization was done individually or by clusters. Randomization of individuals is preferable because it is less likely to result in baseline differences among compared groups. In individually randomized trials, we classified allocation concealment as adequate, inadequate, or poorly described, according to the criteria used by Schulz and colleagues.115 In a cluster-randomized trial, it is impossible to conceal the assignment of individual patients, and the importance of concealing the allocation of clusters is unclear. Accordingly, we placed more importance on concealment in individually randomized trials.

We rated the way in which each trial compared participants in the screened and control groups. To obtain the highest rating in this category, a trial had to obtain baseline data on possible covariates before randomization, and the distribution of these covariates had to be similar in screening and control groups. In a large, individually randomized trial, baseline differences in sociodemographic variables would suggest that randomization failed, especially if there were opportunities for subversion (that is, if allocation was not concealed).

This standard applies only if baseline data can be reliably collected in all patients in both groups. In several of the mammography screening trials, participants in the usual care group were followed passively, and there was no opportunity to collect baseline data from all of them. The decision not to contact each individual in the control group has logistic advantages and probably reduced contamination, but it limits comparison between the screened and control groups. Moreover, when clusters are used, some baseline differences in the compared groups are almost inevitable.

We evaluated whether the method of identifying clusters (for example, geographic areas, month or year of birth) was likely to result in bias and whether measures such as matching were used to reduce it. If bias in assigning clusters to intervention or control groups seemed likely, we considered this a major flaw that was enough to invalidate the findings and rated the study as "poor." However, in contrast to individually randomized trials, we did not take small differences in the mean age of compared groups to be an indication that randomization failed to distribute more important confounders equally among the groups.

Several of the trials measured mortality rates from causes other than breast cancer to establish the comparability of the mammography and control groups. We recorded this information when it was available. Although comparable total mortality supports balanced randomization, it does not assure it. However, if there were dramatic differences in death from other causes, we considered it to be evidence that randomization failed.

Maintenance of Comparable Groups and Minimization of Differential or Overall Loss to Followup

Exclusions after randomization are considered to be a serious flaw in the execution of randomized trials, although empirical evidence of this bias is inconsistent.112,113 Postrandomization exclusions were poorly described in several of the mammography trials and could have resulted in bias if the exclusions resulted in different levels of risk for death from breast cancer between the groups. In most of the mammography trials, however, exclusion of participants after randomization was an expected consequence of the protocol; some exclusion criteria, such as previous mastectomy, could not be applied to all participants before randomization because participants were not individually contacted. We examined the number of, reasons for, and methods for exclusion of participants after randomization. We based our rating on whether the methods used to ascertain patients were objective and consistent, not on the numbers of exclusions in the compared groups. Since ascertainment of clinical variables that might result in exclusion of a participant will be greater among intervention participants and is an expected consequence of the study design, we did not consider unequal numbers of excluded participants in the treatment and control groups after randomization to be definitive evidence of bias.

Use of Outcome Measurements That Were Equal, Reliable, and Valid (Including Masking of Outcome Assessment)

Over the duration of most of the trials, death from breast cancer (the primary end point) occurred in 2 to 9 per 1,000 participants. The relatively low numbers of events means that misclassification or biased exclusion of a few deaths could change the direction and statistical significance of the trial results. For this reason, selection of cases for review of cause of death on broad criteria, use of reliable sources of information to ascertain vital status (death certificates, medical records, autopsies, registries), and use of independent blinded review of the cause of death are important measures to prevent bias. We considered blinded review of deaths a requirement for a quality rating of fair or better.

Approach to Multiple Analyses

The mammography trials have been criticized for decades,99,117-119 and the trialists have responded by conducting additional analyses intended to address these criticisms. In our assessment of quality, we took into account the results of these supplemental analyses. For example, the cluster-randomized trials have been criticized because they analyzed results using statistical methods appropriate only to individually randomized trials. However, an independent reanalysis using the correct statistical method found that the results were unchanged.48 The Canadian trialists addressed criticisms that women who had palpable nodes might have been enrolled preferentially in the mammography group120 by reanalyzing their data and showing that the exclusion of these participants did not affect the results.22

Data Synthesis

Four of the trials compared mammography alone with usual care, and four compared mammography plus CBE with usual care. Because of lack of certainty that CBE is effective, and in consultation with USPSTF members, we decided that these trials were qualitatively homogeneous. The homogeneity of the trials was also assessed by using the standard chi-square test. The P value was greater than 0.1, indicating the effect sizes estimated by the studies are homogeneous.

We conducted two meta-analyses to address two key questions posed by the USPSTF:

  1. Does mammography reduce breast cancer mortality rates among women over a broad range of ages when compared with usual care?
  2. If so, does mammography reduce breast cancer mortality rates among women 40 to 49 years of age when compared with usual care?

In the first analysis, we included all data from the seven fair-quality trials, treating the two Canadian studies as one trial in participants 40 to 59 years of age. In the second analysis, we included the six fair-quality trials that reported results for women younger than 50 years of age.

We conducted each meta-analysis in two parts. First, using WinBUGS software, we constructed a two-level Bayesian random-effects model to estimate the effect size from multiple data points for each study and to derive a pooled estimate of relative risk reduction and credible interval for a given length of followup.11 The purpose of this analysis was to use repeated measures of the effect over time to estimate the relationship between length of followup and effect size. Appendix Table 2 shows the data we used in this analysis. Second, we pooled the most recent results of each trial to calculate the absolute and relative risk reduction, using the results of the first analysis to estimate the mean length of observation. Risks were modeled on the logit scale.

To model the relationship between length of followup and relative risk, a two-level hierarchical model was used. The first level was the result of a trial at a given average or median followup time, xij , where i indexes the trial and j indexes the data point within a trial. The second level was the trial itself. The model allows for within-trial and between-trial variability. Specifically, the model was:

a* ~ Normal(.,.)
b* ~ Normal(.,.)
ai ~ Normal(a*,s2a)
bi ~ Normal(b*,s2b)
mij= ai + bixij + t zij
t ~ G(.,.)
zij ~ Normal(0,1)
log RRij ~ Normal(mij, s2).

A global regression curve was estimated as log RR= a* + b* x. The random effect was t zij. The model to estimate summary risk was:

# deathscontrol, i ~ Binomial(pcontrol,i, ncontrol,i)
# deathsintervention,i ~ Binomial(pintervention,i, nintervention, i)
logit(pcontrol,i)= a+ t zi
logit(pintervention,i)= a + b + t zi
a ~ Normal(.,.)
b* ~ Normal(.,.)
t ~ G(.,.)

Absolute risk difference was calculated as pcontrol,i - pintervention,i. Relative risk was calculated as exp(b).

The models were estimated by using a Bayesian data analytic framework.121 The data were analyzed by using WinBUGS,11 which uses Gibbs sampling to simulate posterior probability distributions. Noninformative (proper) prior probability distributions were used: Normal(0, 106) and G(0.001, 0.001). Five separate Markov chains with overdispersed initial values were used to generate draws from posterior distributions. Point estimates (mean) and 95 percent credible intervals (2.5 and 97.5 percentiles) were derived from the subsequent 5 × 10,000 draws after reasonable convergence of the five chains was attained. The code to model the data in WinBUGS is available from the authors on request.

Return to Contents
Return to Document