Skip Navigation

What Works Clearinghouse


WWC Procedures and Standards Handbook
WWC Procedures and Standards Handbook
Version 2.0 – December 2008

Appendix D – Benjamini-Hochberg Correction of the Statistical Significance of Effects Estimated with Multiple Comparisons

  1. Multiple Outcome Measures with Single Comparison Group
  2. Single Outcome Measure with Multiple Comparison Groups
  3. Multiple Outcome Measures with Multiple Comparison Groups

In addition to clustering, another factor that may inflate Type I error and the statistical significance of findings occurs when study authors perform multiple hypothesis tests simultaneously. The traditional approach to addressing the problem is the Bonferroni method, which lowers the critical p-value for individual comparisons by a factor of 1/m, with m being the total number of comparisons made. The Bonferroni method, however, has been shown to be unnecessarily stringent for many practical situations; therefore, the WWC has adopted a more recently developed method to correct for multiple comparisons or multiplicity—the Benjamini-Hochberg (BH) method (Benjamini & Hochberg, 1995). The BH method adjusts for multiple comparisons by controlling false discovery rate (FDR) instead of family-wise error rate (FWER). It is less conservative than the traditional Bonferroni method, yet it still provides adequate protection against Type I error in a wide range of applications. Since its conception in the 1990s, there has been growing evidence showing that the FDR-based BH method may be the best solution to the multiple comparisons problem in many practical situations (Williams, Jones, & Tukey, 1999).

As is the case with clustering correction, the WWC applies the BH correction only to statistically significant findings, because nonsignificant findings will remain nonsignificant after correction. For findings based on analyses in which the unit of analysis was properly aligned with the unit of assignment, we use the p-values reported in the study for the BH correction. If the exact p-values were not available, but the ESs could be computed, we would convert the ESs to t-statistics and then obtain the corresponding p-values.15 For findings based on mismatched analyses, we first correct the author-reported p-values for clustering and then use the clustering-corrected p-values for the BH correction.

Although the BH correction procedure just described was originally developed under the assumption of independent test statistics (Benjamini & Hochberg, 1995), Benjamini and Yekutieli (2001) point out that it also applies to situations in which the test statistics have positive dependency, and that the condition for positive dependency is general enough to cover many problems of practical interest. For other forms of dependency, a modification of the original BH procedure could be made, which, however, is “very often not needed, and yields too conservative a procedure” (p. 1183).16 Therefore, the WWC has chosen to use the original BH procedure rather than its more conservative modified version as the default approach to correcting for multiple comparisons.

In the remainder of this section, we describe the specific procedures for applying the BH correction in three types of situations: studies that tested multiple outcome measures in the same outcome domain with a single comparison group, studies that tested a given outcome measure with multiple comparison groups, and studies that tested multiple outcome measures in the same outcome domain with multiple comparison groups.

Top

A. Benjamini-Hochberg Correction of the Statistical Significance of Effects on Multiple Outcome Measures within the Same Outcome Domain Tested with a Single Comparison Groups

The most straightforward situation that may require the BH correction occurs when the study authors assessed an intervention’s effect on multiple outcome measures within the same outcome domain using a single comparison group. For such studies, the review team needs to check first whether the study authors’ analyses already took into account multiple comparisons (for example, through a proper multivariate analysis). If so, obviously no further correction is necessary. If the authors did not address the multiple comparison problem in their analyses, then the review team will need to correct the statistical significance of the authors’ findings using the BH method. For studies that examined measures in multiple outcome domains, the BH correction will be applied to the set of findings within the same domain rather than across different domains. Assuming that the BH correction is needed, the review team will apply the BH correction to multiple findings within a given outcome domain tested with a single comparison group as follows:

Rank order statistically significant findings within the domain in ascending order of the p-values, such that p1≤ p2 ≤ p3 ≤ … ≤ pm, with m being the number of significant findings within the domain.

For each p-value (pi), compute:

Conducted Computations: pi' = ia/M

where i is the rank for pi, with i = 1, 2, … m; M is the total number of findings within the domain reported by the WWC; and a is the target level of statistical significance.

Note that the M in the denominator may be less than the number of outcomes that the study authors actually examined in their study for two reasons: (1) the authors may not have reported findings from the complete set of comparisons that they had made, and (2) certain outcomes assessed by the study authors may be deemed irrelevant to the WWC’s review. The target level of statistical significance, a, in the numerator allows us to identify findings that are significant at this level after correction for multiple comparisons. The WWC’s default value of a is 0.05, although other values of a could also be specified. If, for instance, a is set at 0.01 instead of 0.05, then the results of the BH correction would indicate which individual findings are statistically significant at the 0.01 level instead of the 0.05 level after taking multiple comparisons into account.

Identify the largest i—denoted by k—that satisfies the condition: pi ≤ pi’. This establishes the cutoff point, and allows us to conclude that all findings with p-values smaller than or equal to pk are statistically significant, and findings with p-values greater than pk are not significant at the prespecified level of significance (a = 0.05 by default) after correction for multiple comparisons.

One thing to note is that unlike clustering correction, which produces a new p-value for each corrected finding, the BH correction does not generate a new p-value for each finding but rather indicates only whether the finding is significant or not at the prespecified level of statistical significance after the correction. As an illustration, suppose a researcher compared the performance of the intervention group and the comparison group on eight measures in a given outcome domain, and reported six statistically significant effects and two nonsignificant effects based on properly aligned analyses. To correct the significance of the findings for multiple comparisons, we would first rank order the p-values of the six author-reported significant findings in the first column of Table E1, and list the p-value ranks in the second column. We then compute pi’= i * a/M, using with M = 8 and a = 0.05, and record the values in the third column. Next, we identify k, the largest i, that meets the condition: pipi’. In this example, k = 4, and pk = 0.014. Thus, we can claim that the four finding associated with a p-value of 0.014 or smaller are statistically significant at the 0.05 level after correction for multiple comparisons. The other two findings, although reported as being statistically significant, are no longer significant after the correction.

Table D1. An Illustration of Applying the Benjamini-Hochberg Correction for Multiple Comparisons

Author-reported or clustering
corrected p-value
(pi)
P-value rank (i)
pi’= i* 0.05/8
pipi’?
Statistical significance
after BH correction
(a = .05)
0.002
1
0.006
Yes
significant
0.009
2
0.013
Yes
significant
0.011
3
0.019
Yes
significant
0.014
4
0.025
Yes
significant
0.034
5
0.031
No
n.s.
0.041
6
0.038
No
n.s.
Note: n.s.= not statistically significant.

Top

B. Benjamini-Hochberg Correction of the Statistical Significance of Effects on a Given Outcome Tested with Multiple Comparison Groups

The discussion in the previous section pertains to the multiple comparisons problem when the study authors tested multiple outcomes within the same domain with a single comparison group. Another type of multiple comparisons problem occurs when the study authors tested an intervention’s effect on a given outcome by comparing the intervention group with multiple comparison groups. The WWC’s recommendation for handling such studies is as follows:

  1. In consultation with the PI and the study authors if needed, the review team selects a single comparison group that best represented the “business as usual” condition or that is considered most relevant to the WWC’s review. Only findings based on comparisons between the intervention group and this particular comparison group would be included in the WWC’s review. Findings involving the other comparison groups would be ignored, and the multiplicity due to one intervention group being compared with multiple comparison groups would also be ignored.

  2. If the PI and the review team believe that it is appropriate to combine the multiple comparison groups, and if adequate data are available for deriving the means and SDs of the combined group, the team may present the findings based on comparisons of the intervention group and the combined comparison group instead of findings based on comparisons of the intervention group and each individual comparison group. The kind of multiplicity due to one intervention group being compared with multiple comparison groups would no longer be an issue in this approach.

    The PI and the review team may judge the appropriateness of combining multiple comparison groups by considering whether there was enough common ground among the different comparison groups to warrant such a combination and, particularly, whether the study authors themselves conducted combined analyses or indicated the appropriateness, or the lack thereof, of combined analyses. When the study authors did not conduct or suggest combined analyses, it is advisable for the review team to check with the study authors before combining the data from different comparison groups.

  3. If the PI and the review team believe that neither of these two options is appropriate for a particular study, and that findings from comparisons of the intervention group and each individual comparison group should be presented, they need to make sure that the findings presented in the WWC’s intervention report are corrected for multiplicity due to multiple comparison groups if necessary. The review team needs to check the study report or check with the study authors to determine whether the comparisons of the multiple groups were based on a proper statistical test that already took multiplicity into account (for example, Dunnett’s test [Dunnett, 1955], the Bonferroni method [Bonferroni, 1935], Scheffe’s test [1953], and Tukey’s HSD test [1949]). If so, then there would be no need for further corrections. It is also advisable for the team to check with the study authors regarding the appropriateness of correcting its findings for multiplicity due to multiple comparison groups, as the authors might have theoretical or empirical concerns about considering the findings from comparisons of the intervention group and a given comparison group without consideration of other comparisons made within the same study. If the team decides that multiplicity correction is necessary, it will apply such correction using the BH method in the same way as it would apply the method to findings on multiple outcomes within the same domain tested with a single comparison group as described in the previous section.

Top

C. Benjamini-Hochberg Correction of the Statistical Significance of Effects on Multiple Outcome Measures in the Same Outcome Domain Tested with Multiple Comparison Groups

A more complicated multiple comparison problem arises when a study tested an intervention’s effect on multiple outcome measures in a given domain with multiple comparison groups. The multiplicity problem thus may originate from two sources. Assuming that both types of multiplicity need to be corrected, the review team will apply the BH correction in accordance with the following three scenarios:

Scenario 1: The study author’s findings did not take into account either type of multiplicity.

In this case, the BH correction will be based on the total number of comparisons made. For example, if a study compared one intervention group with two comparison groups on five outcomes in the same domain without taking multiplicity into account, then the BH correction would be applied to the 10 individual findings based on a total of 10 comparisons.

Scenario 2: The study author’s findings took into account the multiplicity due to multiple comparisons but not the multiplicity due to multiple outcomes.

In some studies, the authors may have performed a proper multiple comparison test (for example, Dunnett’s test) on each individual outcome that took into account the multiplicity due to multiple comparison groups. For such studies, the WWC will need to correct only the findings for the multiplicity due to multiple outcomes. Specifically, separate BH corrections will be made to the findings based on comparisons involving different comparison groups. With two comparison groups, for instance, the review team would apply the BH correction to the two sets of findings separately—one set of findings (one finding for each outcome) for each comparison group.

Scenario 3: The study author’s findings took into account the multiplicity due to multiple outcomes, but not the multiplicity due to multiple comparison groups.

Although this scenario may be relatively rare, it is possible that the study authors performed a proper multivariate test (for example, MANOVA or MANCOVA) to compare the intervention group with a given comparison group that took into account the multiplicity due to multiple outcomes and performed separate multivariate tests for different comparison groups. For such studies, the review team will need to correct only the findings for multiplicity due to multiple comparison groups. Specifically, separate BH corrections will be made to the findings on different outcomes. With five outcomes and two comparison groups, for instance, the review team will apply the BH correction to the five sets of findings separately—one set of findings (one finding for each comparison group) for each outcome measure.

The decision rules for the three scenarios described are summarized in Table D2.

Table D2. Decision Rules for Correcting the Significance Levels of Findings from Studies That had a Multiple Comparison Problem due to Multiple Outcomes in a Given Domain and/or Multiple Comparison Groups, by Scenario

Authors’ Analyses
Benjamini-Hochberg Correction

1. Did not correct for multiplicity from any source

-BH correction to all 10 individual findings

2. Corrected for multiplicity due to multiple comparison groups only

-BH correction to the 5 findings based on T vs. C1 comparisons

-BH correction to the 5 findings based on T vs. C2 comparisons

3. Corrected for multiplicity due to multiple outcomes only

-BH correction to the 2 findings based on T vs. C1 and T vs. C2 comparisons on O1

-BH correction to the 2 findings based on T vs. C1 and T vs. C2 comparisons on O2

-BH correction to the 2 findings based on T vs. C1 and T vs. C2 comparisons on O3

-BH correction to the 2 findings based on T vs. C1 and T vs. C2 comparisons on O4

-BH correction to the 2 findings based on T vs. C1 and T vs. C2 comparisons on O5

Note. T: treatment (intervention) group; C1 and C2: comparison groups 1 and 2; O1, O2, O3, O4, and O5: five outcome measures within a given outcome domain.

On a final note, although the BH corrections are applied in different ways to the individual study findings in different scenarios, such differences do not affect the way in which the intervention rating is determined. In all three scenarios in the previous example, the 10 findings would be presented in a single outcome domain, and the characterization of the intervention’s effects for this domain in this study would be based on the corrected statistical significance of each individual finding as well as the magnitude and statistical significance of the average effect size across the 10 individual findings within the domain.

15 The p-values corresponding to the t-statistics can either be looked up in a t-distribution table, or computed using the t-distribution function in Excel: p = TDIST(t, df, 2), where df is the degrees of freedom, or the total sample size minus 2 for findings from properly aligned analyses.
16 The modified version of the BH procedure uses a over the sum of the inverse of the p-value ranks across the m comparisons instead of a.

Top

PO Box 2393
Princeton, NJ 08543-2393
Phone: 1-866-503-6114