Skip Navigation

What Works Clearinghouse


WWC Procedures and Standards Handbook
WWC Procedures and Standards Handbook
Version 2.0 – December 2008

Appendix B - Effect Size Computations

  1. Student-Level Analyses
    1. Continuous Outcomes —ES as Standardized Mean Difference (Hedges’s g)
    2. Continuous — ES Computation Based on Results from Student-Level t-Tests or ANOVA
    3. Continuous — ES Computation Based on Results from Student-Level ANCOVA
    4. Continuous — Difference-in-Differences Approach
    5. Dichotomous Outcomes
  2. Cluster-Level Analyses
    1. Computing Student-Level ESs for Studies with Cluster-Level Analyses
    2. Handling Studies with Cluster-Level Analyses if Student-Level ESs Cannot Be Computed
    3. ES Based on Results from HLM Analyses in Studies with Cluster-Level Assignment

Different types of effect size (ES) indices have been developed for different types of outcome measures, given their distinct statistical properties. The purpose of this appendix is to provide the rationale for the specific computations conducted by the WWC, as well as their underlying assumptions.

A. Student-Level Analyses

1. Continuous Outcomes — ES as Standardized Mean Difference (Hedges’s g)

For continuous outcomes, the WWC has adopted the most commonly used ES index—the standardized mean difference, which is defined as the difference between the mean outcome of the intervention group and the mean outcome of the comparison group divided by the pooled within-group standard deviation (SD) on that outcome measure. Given that the WWC generally focuses on student-level findings, the default SD used in ES computation is the student-level SD.

The basic formula for computing standardized mean difference is as follows:

Conducted Computations: Standardized mean difference = (X1 – X2) / Spooled

where X1 and X2 are the means of the outcome for the intervention group and the comparison group, respectively, and Spooled is the pooled within-group SD of the outcome at the student level. Formulaically,

 

Conducted Computations: Spooled = sqrt{[(n1-1)S1^2+(n2-1)S2^2]/(n1+n2-2)}

Conducted Computations: Standardized mean difference (g) = (X1 – X2)/sqrt{[(n1-1)S1^2+(n2-1)S2^2]/(n1+n2-2)}

where n1 and n2 are the student sample sizes, and S1 and S2 are the student-level SDs for the intervention group and the comparison group, respectively.

The ES index thus computed is referred to as Hedges’s g.10 This index, however, has been shown to be upwardly biased when the sample size is small. Therefore, we have applied a simple correction for this bias developed by Hedges (1981), which produces an unbiased ES estimate by multiplying the Hedges’s g by a factor of (1 - 3/[4N - 9]), with N being the total sample size. Unless otherwise noted, Hedges’s g corrected for small-sample bias is the default ES measure for continuous outcomes used in the WWC’s review.

In certain situations, however, the WWC may present study findings using ES measures other than Hedges’s g. If, for instance, the SD of the intervention group differs substantially from that of the comparison group, the PIs and review teams may choose to use the SD of the comparison group instead of the pooled within-group SD as the denominator of the standardized mean difference and compute the ES as Glass’s D instead of Hedges’s g. The justification for doing so is that when the intervention and comparison groups have unequal variances, as occurs when the variance of the outcome is affected by the intervention, the comparison group variance is likely to be a better estimate of the population variance than is the pooled within-group variance (Cooper, 1998; Lipsey & Wilson, 2001). The WWC may also use Glass’s D, or other ES measures used by the study authors, to present study findings if there is not enough information available for computing Hedges’s g. These deviations from the default will be clearly documented in the WWC’s review process.

The sections to follow focus on the WWC’s default approach to computing student-level ESs for continuous outcomes. We describe procedures for computing Hedges’s g based on results from different types of statistical analysis most commonly encountered in the WWC reviews.

Top

2. Continuous — ES Computation Based on Results from Student-Level t-Tests or ANOVA


For randomized controlled trials, study authors may assess an intervention’s effects based on student-level t-tests or analyses of variance (ANOVA) without adjustment for pretest or other covariates, assuming group equivalence on pre-intervention measures achieved through random assignment. If the study authors report posttest means and SD as well as sample sizes for both the intervention group and the comparison group, the computation of ESs will be straightforward using the standard formula for Hedges’s g.

Where the study authors did not report the posttest mean, SD, or sample size for each study group, the WWC computes Hedges’s g based on t-test or ANOVA F-test results, if they were reported along with sample sizes for both the intervention group (n1) and the comparison group (n2). For ESs based on t-test results,

Conducted Computations: Hedges's g = t * sqrt[(n1+n2)/n1n2]

For ESs based on ANOVA F-test results,

Conducted Computations: Hedges's g = sqrt[F(n1+n2)/n1n2]

Top

3. Continuous — ES Computation Based on Results from Student-Level ANCOVA

Analysis of covariance (ANCOVA) is a commonly used analytic method for quasi-experimental designs. It assesses the effects of an intervention while controlling for important covariates, particularly pretests, which might confound the effects of the intervention. ANCOVA is also used to analyze data from randomized controlled trials so that greater statistical precision of parameter estimates can be achieved through covariate adjustment.

For study findings based on student-level ANCOVA, the WWC computes Hedges’s g as covariate adjusted mean difference divided by unadjusted pooled within-group SD. The use of the adjusted mean difference as the numerator of ES ensures that the ES estimate is adjusted for covariate difference between the intervention and the comparison groups that might otherwise bias the result. The use of unadjusted pooled within-group SD as the denominator of ES allows comparisons of ES estimates across studies by using a common metric to standardize group mean differences—that is, the population SD as estimated by the unadjusted pooled within-group SD.

Specifically, when sample sizes adjusted means and unadjusted SDs of the posttest from an ANCOVA are available for both the intervention and the comparison groups, the WWC computes Hedges’s g as follows:

Conducted Computations: Hedges's g = (X1’ – X2’)/sqrt{[(n1-1)S1^2+(n2-1)S2^2]/(n1+n2-2)}

where X1’ and X2’ are adjusted posttest means, n1 and n2 are the student sample sizes, and S1 and S2 are the student-level unadjusted posttest SD for the intervention group and the comparison group, respectively.

A final note about ANCOVA-based ES computation is that Hedges’s g cannot be computed based on the F-statistic from an ANCOVA. Unlike the F-statistic from an ANOVA, which is based on unadjusted within-group variance, the F-statistic from an ANCOVA is based on covariate-adjusted within-group variance. Hedges’s g, however, requires the use of unadjusted within-group SD. Therefore, we cannot compute Hedges’s g with the F-statistic from an ANCOVA in the same way as we can compute it with the F-statistic from an ANOVA. If the pretest-posttest correlation is known, however, we can derive Hedges’s g from the ANCOVA F-statistic as follows:

Conducted Computations: F(n1+n2)(1-r^2)/sqrt n1n2

where r is the pretest-posttest correlation, and n1 and n2 are the sample sizes for the intervention group and the comparison group, respectively.

Top

4. Continuous — Difference-in-Differences Approach

It is not uncommon, however, for study authors to report unadjusted group means on both pretest and posttest, but not report adjusted group means or adjusted group mean differences on the posttest. Absent information on the correlation between the pretest and the posttest, as is typically the case, the WWC’s default approach is to compute the numerator of ES—the adjusted mean difference—as the difference between the pretest-posttest mean difference for the intervention group and the pretest-posttest mean difference for the comparison group. Specifically,

Conducted Computations: g = [(X1 – X1-pre)-(X2- X2-pre)]/sqrt{[(n1-1)S1^2+(n2-1)S2^2]/(n1+n2-2)}

where X1 and X2 are unadjusted posttest means, X1-pre and X2-pre unadjusted pretest means, n1 and n2 are the student sample sizes, and S1 and S2 are the student-level unadjusted posttest SD for the intervention group and the comparison group, respectively.

This “difference-in-differences” approach to estimating an intervention’s effects while taking into account group difference in pretest is not necessarily optimal, as it is likely to either overestimate or underestimate the adjusted group mean difference, depending on which group performed better on the pretest.11 Moreover, this approach does not provide a means for adjusting the statistic significance of the adjusted mean difference to reflect the covariance between the pretest and the posttest. Nevertheless, it yields a reasonable estimate of the adjusted group mean difference, which is equivalent to what would have been obtained from a commonly used alternative to the covariate adjustment-based approach to testing an intervention’s effect—the analysis of gain scores.

Another limitation of the “difference-in-differences” approach is that it assumes that the pretest and the posttest are the same test. Otherwise, the means on the two types of tests might not be comparable, and hence it might not be appropriate to compute the pretest-posttest difference for each group. When different pretest and posttests were used and only unadjusted means on pretest and posttest were reported, the principal investigators (PIs) will need to consult with the WWC Statistical, Technical, and Analysis Team to determine whether it is reasonable to use the difference-in-differences approach to compute the ESs.

The difference-in-differences approach presented earlier also assumes that the pretest-posttest correlation is unknown. In some areas of educational research, however, empirical data on the relationships between pretest and posttest may be available. If such data are dependable, the WWC PIs and the review team in a given topic area may choose to use the empirical relationship to estimate the adjusted group mean difference that is unavailable from the study report or study authors, rather than using the default difference-in-differences approach. The advantage of doing so is that if, indeed, the empirical relationship between pretest and posttest is dependable, the covariate-adjusted estimates of the intervention’s effects will be less biased than those based on the difference-in-differences (gain score) approach. If the PIs and review teams choose to compute ESs using an empirical pretest-posttest relationship, they will need to provide an explicit justification for their choice as well as evidence on the credibility of the empirical relationship. Computationally, if the pretest and posttest have a correlation of r, then

    Conducted Computations:  g = [(X1 – X2)-r(X1-pre – X2-pre)]/sqrt{[(n1-1)S1^2+(n2-1)S2^2]/(n1+n2-2)}

Top

5. Dichotomous Outcomes

Although not as common as continuous outcomes, dichotomous outcomes are sometimes used in studies of educational interventions. Examples include dropout versus stay in school, grade promotion versus retention, and pass versus fail on a test. Group mean differences, in this case, appear as differences in proportions or differences in the probability of the occurrence of an event. The ES measure of choice for dichotomous outcomes is the odds ratio, which has many statistical and practical advantages over alternative ES measures such as the difference between two probabilities, the ratio of two probabilities, and the phi coefficient (Fleiss, 1994; Lipsey & Wilson, 2001).

The measure of odds ratio builds on the notion of odds. For a given study group, the odds for the occurrence of an event are defined as follows:

Conducted Computations: Odds = p/(1-p)

where p is the probability of the occurrence of an event within the group. The odds ratio (OR) is simply the ratio between the odds for the two groups compared:

Conducted Computations: OR = Odds1/Odds2 = [p1(1-p2)]/[p2(1-p1)]

where p1 and p2 are the probabilities of the occurrence of an event for the intervention group and the comparison group, respectively.

As is the case with ES computation for continuous variables, the WWC computes ESs for dichotomous outcomes based on student-level data in preference to aggregate-level data for studies that have a multilevel data structure. The probabilities (p1and p2) used in calculating the odds ratio represent the proportions of students demonstrating a certain outcome among students across all teachers/classrooms or schools in each study condition, which are likely to differ from the probabilities based on aggregate-level data (for example, means of school-specific probabilities) unless the classrooms or schools in the sample were of similar sizes.

Following conventional practice, the WWC transforms odds ratio to logged odds ratio (LOR); that is, the natural log of the odds ratio, to simplify statistical analyses:

Conducted Computations: LOR = ln(OR)

The logged odds ratio has a convenient distribution form, which is approximately normal with a mean of 0 and a SD of pi/sqrt(3), or 1.81.

The logged odds ratio can also be expressed as the difference between the logged odds, or logits, for the two groups compared.

Conducted Computations: LOR = ln(Odds1) – ln(Odds2)

which shows more clearly the connection between the logged odds ratio index and the standardized mean difference index (Hedges’s g) for ESs. To make the logged odds ratio comparable to the standardized mean difference and thus facilitate the synthesis of research findings based on different types of outcomes, researchers have proposed a variety of methods for “standardizing” logged odds ratio. Based on a Monte Carlo simulation study of seven different types of ES indices for dichotomous outcomes, Sanchez-Meca, Marin-Martinez, and Chacon-Moscoso (2003) concluded that the ES index proposed by Cox (1970) is the least biased estimator of the population standardized mean difference, assuming an underlying normal distribution of the outcome. The WWC, therefore, has adopted the Cox index as the default ES measure for dichotomous outcomes. The computation of the Cox index is straightforward:

Conducted Computations: LORCox = LOR / 1.65

The preceding index yields ES values very similar to the values of Hedges’s g that one would obtain if group means, SDs, and sample sizes were available—assuming that the dichotomous outcome measure is based on an underlying normal distribution. Although the assumption may not always hold, as Sanchez-Meca and his colleagues (2003) note, primary studies in social and behavioral sciences routinely apply parametric statistical tests that imply normality. Therefore, the assumption of normal distribution is a reasonable conventional default.

Top

B. Cluster-Level Analyses

All the ES computation methods described earlier are based on student-level analyses, which are appropriate analytic approaches for studies with student-level assignment. The case is more complicated, however, for studies with assignment at the cluster level (for example, assignment of teachers, classrooms, or schools to conditions), in which data may have been analyzed at the student or the cluster level or through multilevel analyses. Although there has been a consensus in the field that multilevel analysis should be used to analyze clustered data (for example, Bloom, Bos, & Lee, 1999; Donner & Klar, 2000; Flay & Collins, 2005; Murray, 1998; Snijders & Bosker, 1999), student-level analyses and cluster-level analyses of such data still frequently appear in the research literature despite their problems.

The main problem with student-level analyses in studies with cluster-level assignment is that they violate the assumption on the independence of observations underlying traditional hypothesis tests and result in underestimated standard errors and inflated statistical significance (see Appendix D for details about how to correct for such bias). The estimate of the group mean difference in such analyses, however, is unbiased and, therefore, can be appropriately used to compute the student-level ES using methods explained in the previous sections.

For studies with cluster-level assignment, analyses at the cluster level, or aggregated analyses, are also problematic. Other than the loss of power and increased Type II error, potential problems with aggregated analysis include shift of meaning and ecological fallacy (that is, relationships between aggregated variables cannot be used to make assertions about the relationships between individual-level variables), among others (Aitkin & Longford, 1986; Snijders & Bosker, 1999). Such analyses also pose special challenges to ES computation during WWC reviews. In the remainder of this section, we discuss these challenges and describe WWC’s approach to handling them during reviews.

Top

1. Computing Student-Level ESs for Studies with Cluster-Level Analyses

For studies that reported findings from only cluster-level analyses, it might be tempting to compute ESs using cluster-level means and SDs. This, however, is not appropriate for the purpose of the WWC reviews for at least two reasons. First, because cluster-level SDs are typically much smaller than student-level SDs,12 ESs based on cluster-level SDs will be much larger than and, therefore, incomparable with student-level ESs that are the focus of WWC reviews. Second, the criterion for “substantively important” effects in the WWC Intervention Rating Scheme (ES of at least 0.25) was established specifically for student-level ESs and does not apply to cluster-level ESs. Moreover, there is not enough knowledge in the field as yet for judging the magnitude of cluster-level effects. A criterion of “substantively important” effects for cluster-level ESs, therefore, cannot be developed for intervention rating purposes. An intervention rating of potentially positive effects based on a cluster-level ES of 0.25 or greater (that is, the criterion for student-level ESs) would be misleading.

In order to compute the student-level ESs, we need to use the student-level means and SDs on the findings. This information, however, is often not reported in studies with cluster-level analyses. If the study authors could not provide student-level means, the review team may use cluster-level means (that is, the mean of cluster means) to compute the group mean difference for the numerator of student-level ESs if (1) the clusters were of equal or similar sizes, (2) the cluster means were similar across clusters, or (3) it is reasonable to assume that cluster size was unrelated to cluster means. If any of these conditions holds, then group means based on cluster-level data would be similar to group means based on student-level data and, hence, could be used for computing student-level ESs. If none of these conditions holds, however, the review team would have to obtain the group means based on student-level data in order to compute the student-level ESs.

Although it is possible to compute the numerator (that is, the group mean difference) for student-level ESs based on cluster-level findings for most studies, it is generally much less feasible to compute the denominator (that is, pooled SD) for student-level ESs based on cluster-level data. If the student-level SDs are not available, we could compute them based on the cluster-level SDs and the actual intra-class correlation (ICC) (student-level SD = [cluster-level SD]/sqrt[ICC]). Unfortunately, the actual ICCs for the data observed are rarely provided in study reports. Without knowledge about the actual ICC, one might consider using a default ICC, which, however, is not appropriate, because the resulting ES estimate would be highly sensitive to the value of the default ICC and might be seriously biased even if the difference between the default ICC and the actual ICC is not large.

Another reason that the formula for deriving student-level SDs (student-level SD = [cluster-level SD]/sqrt[ICC]) is unlikely to be useful is that the cluster-level SD required for the computation was often not reported either. Note that the cluster-level SD associated with the ICC is not exactly the same as the observed SD of cluster means that was often reported in studies with cluster-level analyses, because the latter reflects not only the true cluster-level variance, but also part of the random variance within clusters (Raudenbush & Liu, 2000; Snijder & Bosker, 1999).

It is clear from this discussion that in most cases, requesting student-level data, particularly student-level SDs, from the study authors will be the only way that allows us to compute the student-level ESs for studies reporting only cluster-level findings. If the study authors cannot provide the student-level data needed, then we will not be able to compute the student-level ESs. Nevertheless, such studies will not be automatically excluded from the WWC reviews; they could still potentially contribute to intervention ratings as explained in the next section.

Top

2. Handling Studies with Cluster-Level Analyses if Student-Level ESs Cannot Be Computed

A study’s contribution to the effectiveness rating of an intervention depends mainly on three factors: (1) the quality of the study design, (2) the statistical significance of the findings, and (3) the effect size(s). For studies that report only cluster-level findings, the quality of their designs is not affected by whether student-level ESs could be computed. Such studies could still meet WWC evidence standards with or without reservations and be included in intervention reports even if student-level ESs were not available.

Although cluster-level ESs cannot be used in intervention ratings, the statistical significance of cluster-level findings could contribute to intervention ratings. Cluster-level analyses tend to be underpowered; hence, estimates of the statistical significance of findings from such analyses tend to be conservative. Therefore, significant findings from cluster-level analyses would remain significant had the data been analyzed using appropriate multilevel models, and they should be taken into account in intervention ratings. The size of the effects based on cluster-level analyses, however, could not be considered in determining “substantively important” effects in intervention ratings for the reasons described earlier. In WWC’s intervention reports, cluster-level ESs are excluded from the computation of domain average ESs and improvement indices, both of which are based exclusively on student-level findings.

Top

3. ES Based on Results from HLM Analyses in Studies with Cluster-Level Assignment

As explained in the previous section, multilevel analysis is generally considered the preferred method for analyzing data from studies with cluster-level assignment. With recent methodological advances, multilevel analysis has gained increased popularity in education and other social science fields. More and more researchers have begun to employ the hierarchical linear modeling (HLM) method to analyze data of a nested nature (for example, students nested within classes and classes nested within schools) (Raudenbush & Bryk, 2002).13 Similar to student-level ANCOVA, HLM can also adjust for important covariates such as pretest when estimating an intervention’s effect. Unlike student-level ANCOVA that assumes independence of observations, however, HLM explicitly takes into account the dependence among members within the same higher-level unit (for example, the dependence among students within the same class). Therefore, the parameter estimates, particularly the standard errors, generated from HLM are less biased than those generated from ANCOVA when the data have a multilevel structure.

Hedges’s g for intervention effects estimated from HLM analyses is defined in a similar way to that based on student-level ANCOVA: adjusted group mean difference divided by unadjusted pooled within-group SD. Specifically,

Conducted Computations:  g = y/sqrt{[(n1-1)S1^2+(n2-1)S2^2]/(n1+n2-2)}

where γ is the HLM coefficient for the intervention’s effect, which represents the group mean difference adjusted for both level-1 and level-2 covariates, if any; n1 and n2 are the student sample sizes, and S1 and S2 are the posttest student-level SDs for the intervention group and the comparison group, respectively.14

One thing to note about the denominator of Hedges’s g based on HLM results is that the level-1 variance, also called “within-group variance,” estimated from a typical two-level HLM analysis is not the same as the conventional unadjusted pooled within-group variance that should be used in ES computation. The within-group variance from an HLM model that incorporates level-1 covariates has been adjusted for these covariates. Even if the within-group variance is based on an HLM model that does not contain any covariates (that is, a fully unconditional model), it is still not appropriate for ES computation, because it does not include the variance between level-2 units within each study condition that is part of the unadjusted pooled within-group variance. Therefore, the level-1 within-group variance estimated from an HLM analysis tends to be smaller than the conventional unadjusted pooled within-group variance, and it would thus lead to an overestimate of the ES if used in the denominator of the ES.

The ES computations for outcomes explained here pertain to individual findings within a given outcome domain examined in a given study. If the study authors assessed the intervention’s effects on multiple outcome measures within a given domain, the WWC computes a domain average ES as a simple average of the ESs across all individual findings within the domain.

10 The Hedges’s g index differs from the Cohen’s d index in that Hedges’s g uses the square root of degrees of freedom (sqrt[N - k] for k groups) for the denominator of the pooled within-group SD (Spooled), whereas Cohen’s d uses the square root of sample size (sqrt[N]) to compute Spooled (Rosenthal, 1994; Rosnow, Rosenthal, & Rubin, 2000).
11 If the intervention group had a higher average pretest score than the comparison group, the difference-in-difference approach is likely to underestimate the adjusted group mean difference. If the opposite occurs, it is likely to overestimate the adjusted group mean difference.
12 Cluster-level SD = (student-level SD)*sqrt(ICC).
13 Multilevel analysis can also be conducted using other approaches, such as the SAS PROC MIXED procedure. Although the various approaches to multilevel analysis may differ in their technical details, they are all based on similar ideas and underlying assumptions.
14 The level-2 coefficients are adjusted for the level-1 covariates under the condition that the level-1 covariates are either uncentered or grand-mean centered, which are the most common centering options in an HLM analysis (Raudenbush & Bryk, 2002). The level-2 coefficients are not adjusted for the level-1 covariates if the level-1 covariates are group-mean centered. For simplicity purposes, the discussion here is based on a two-level framework (that is, students nested with clusters). The idea could easily be extended to a three-level model (for example, students nested with teachers who were in turn nested within schools).

Top

PO Box 2393
Princeton, NJ 08543-2393
Phone: 1-866-503-6114