Chapter 4: Evaluation of Health Care Efficiency Measures
In this section we present criteria for evaluating health care efficiency measures, and discuss
to what degree existing measures meet these criteria. Our original intention had been to rate
each identified measure on the evaluation criteria, but this proved to be not feasible or
meaningful since the available evidence is so sparse.
Therefore, we present our evaluation
criteria, and then discuss in more general terms the strengths and limitation of available measures
in terms of these criteria. We conclude with a discussion of potential next steps.
We suggest that measures of health care efficiency be evaluated using the same framework as
measures of quality:
- Important—is the measure assessing an aspect of efficiency that is important to
providers, payers, and policymakers? Has the measure been applied at the level of
interest to those planning to use the measure? Is there an opportunity for improvement?
Is the measure under the control of the provider or health system?
- Scientifically sound—can the measure be assessed reliably and reproducibly? Does the
measure appear to capture the concept of interest? Is there evidence of construct or
predictive validity?
- Feasible—are the data necessary to construct this measure available? Is the cost and
burden of measurement reasonable?
- Actionable - are the results interpretable? Can the intended audience use the information
to make decisions or take action?
The ideal set of measures would cover all of the major aspects of efficiency identified in the
typology of efficiency measures presented above; would have evidence that they can be
measured reliably by different analysts using the same methods, that higher scores are observed
in providers that are judged by other means to be more efficient than providers receiving lower
scores, and that higher scores are observed for providers after they have successfully
implemented changes designed to improve efficiency; and could be calculated using existing
data.
This ideal set does not exist, and therefore the selection of measures will involve tradeoffs
between these desirable criteria (important, valid, feasible, actionable).
Return to Contents
Important
Although the “importance” of measures abstracted from peer-reviewed literature is difficult
to assess, it seems that a majority of efficiency measures published in the peer-reviewed
literature have not been adopted by providers, payers, and policymakers.
One aspect of
efficiency that is important to stakeholders is the relative efficiency of various providers, health
plans, or other units of the health system. Many of the articles reviewed did not explicitly report
comparisons of the efficiency of the providers or other units of analysis studied.
Only 31 of 158
articles reported such a comparison. The other 127 articles reported efficiency at a grouped
level, and often studied the effect of a factor or factor(s) on group efficiency.
For example, an
article might compare the relative efficiency of non-profit versus for-profit hospitals.
This type of analysis could potentially be used to answer another question of importance to
stakeholders—how can efficiency be improved? Although many articles studied factors that
were found to influence efficiency, it was unclear if any findings of factors associated with
improved efficiency were strong enough to influence policy.
At the same time the utility of
existing efficiency measures for policy has been questioned, most explicitly by Newhouse.20
The vendor-developed measures that are most commonly used differ substantially from measures
reported in the peer-reviewed literature, suggesting that stakeholders found the measures
developed in the academic world inadequate for answering the questions most important to them.
We note, however, that many of the vendor-developed measures are based on methods originally
developed in the academic world (e.g., Adjusted Clinical Groups). The measures developed in
the academic world are more complex to implement than vendor-developed measures.
These
measures often present and test sophisticated statistical or mathematical approaches for
constructing a multi-input, multi-output efficiency frontier, but focus relatively little on the
specification of inputs and outputs, often using whatever variables are readily available in
existing data sources.
In contrast, the vendor-developed measures often include a more complex
specification of the outputs used, such as episodes of care. It is not clear that one approach is
necessarily superior to the other. A critical question in evaluating importance of a measure is
whether it satisfies the intended use.
The vendor-developed measures seem to reflect areas of importance to payers, purchasers,
and providers based on how they have been used. The measures have been used by payers and
purchasers to profile providers to include in their networks.
In addition, a number of these
measures are currently under consideration for various pay-for-performance initiatives. These
measures assess efficiency both at the organizational level (e.g., hospitals or medical groups) and
at the individual physician level.
They offer both a global perspective on the drivers of total
costs and resource utilization, as well as drilled down specifics for individual clinical areas and
providers. In this respect, efficiency measures commonly used by health plans and purchasers
respond to the perceived needs in the market.
One area of importance that is poorly reflected by existing measures is social efficiency.
Despite a widespread acceptance that the allocation of resources in the current health care system
is very inefficient, there appear to be no accepted measures of efficiency in this important area.
Return to Contents
Scientifically Sound
Very little research on the reliability and validity of efficiency measures has been published
to date. This includes measures developed by vendors as well as those published in the peerreviewed
literature.
Of the 158 peer-reviewed articles found containing efficiency measures,
only three reported any evidence of the validity of the measures and one reported evidence of
reliability. It was slightly more common for articles to test the specifications of SFA or other
regression models or DEA models using sensitivity analyses; 59 of 137 measures using DEA,
SFA, or other regression-based approaches reported the results of sensitivity analyses.
Vendors
typically supply tools (e.g., methods for aggregating claims to construct episodes of care or
methods for aggregating the costs of care for a population) from which measures can be
constructed; thus, the assessment of scientific soundness requires an evaluation of the application
as well as the underlying tools.
Several studies have examined some of the measurement properties of vendor-developed
measures, but the amount of evidence available is still limited at this time. Thomas, Grazier, and
Ward58 tested the consistency of 6 groupers (some episode-based and some population-based) for
measuring the efficiency of primary care physicians.
They found “moderate to high” agreement
between physician efficiency rankings using the various measures (weighted kappa = .51 to .73).
Thomas and Ward59 tested the sensitivity of measures of specialist physician efficiency to
episode attribution methodology and cost outlier methodology.
Thomas60 also tested the effect
of risk adjustment on an ETG-based efficiency measure. He found that episode risk scores were
generally unrelated to costs and concluded that risk adjustment of ETG-based efficiency
measures may be unnecessary.
MedPAC61 compared episode-based measures and populationbased
measures for area-level analyses and found that they can produce different results. For
example, Miami was found to have lower average per-episode costs for coronary artery disease
episodes than Minneapolis but higher average per-capita costs due to lower episode volume.
The lack of testing of the scientific soundness of efficiency measures reflects in part the
pressure to develop tools that can be used quickly and with relative ease in implementation. One
major measurement problem in efficiency measures is the difficulty in observing the full range of
outputs a hospital, physician, or other unit produces.
As described in the results section, many
measures capture the quantity of health care delivered, but very few are able to capture the
quality or outcomes of this care. Most measures are not able to capture the full range of
quantities of interest. As we would expect, most measures are based on quantities that are
readily observable in existing datasets: hospital days, discharges, physician hours, etc.
In some
cases the way these variables are described to “proxy” for the real quantities of interest is
questionable. For example, in some studies the number of beds is used as a proxy measure for
capital, while no further evidence is presented on the correlation between these two.
A second area that concerns validity is the specification of the econometric models
underlying the measures. The literature shows a wide variation here, with some articles
estimating just one single model, and others estimating a whole range of models using various
combinations of inputs, outputs, and methods.
At a minimum, authors have made some very
basic assumptions about the existence and nature of a random component to outputs. It has been
shown that efficiency ratings can be very sensitive to the model chosen.62 When there are
conflicting results under different models, it is often not obvious which model and results are
preferable.
A third area of potential assessment is the reliability and validity of efficiency measures
when implemented in different administrative data sets. This becomes particularly challenging
when data sets are aggregated or when data from different entities (e.g., health plans, hospitals)
are compared for evaluative purposes.
Data sets from multiple insurers may need to be
aggregated for the purposes of developing larger samples of patients. Some of the key
challenges include: the effect of benefit design differences, the impact of different methods of
paying physicians, use of local codes, differential use of carve out/contracted providers, missing
data, and so on.
Administrative/billing data are the most common source of information for
constructing efficiency measures but users should be aware of the threats to validity when
comparing different entities.
A fourth area is whether the measures take into account and adjust for both case mix (i.e., the
nature and volume of the types of patients being treated) and risks (i.e., severity of illness of the
patients), such as other co-morbidities.
A final area revolves around the implicit assumptions about the comparability of the outputs
measured, particularly with regard to quality of care. While most users of efficiency measures
are likely to use separate methods for evaluating quality, the methodological work to link these
two constructs has not been done.
In the absence of explicit approaches to measuring quality, the
efficiency measures assume that the quality of the output is equivalent. In most cases this
assumption is likely not valid.
Return to Contents
Feasible
Since most of the efficiency measures abstracted in the literature review are based on existing
public-use data sources, they could feasibly be reconstructed. Most articles appeared to specify
the best possible measure given the limitations of existing public-use data, rather than collect or
compile data sets to construct the best possible measure.
That is, the measures in the peerreviewed
literature generally seemed primarily shaped by feasibility, and secondarily by
scientific soundness.
All of the efficiency measures identified through the grey literature also rely on existing data
(e.g., insurance claims). Most of the efficiency measures identified through the grey literature
have been developed by vendors with feasibility of use by their clients in mind.
However, most
vendor-developed measures are proprietary, and therefore may impose cost barriers during
implementation. In fact, one of the stakeholders interviewed specifically mentioned feasibility
related to the cost of purchasing vendor-developed product as one of the primary reasons for
their organization creating their own efficiency measure.
Existing public-use data sets available for research use may pose several difficulties for the
specification of scientifically sound, important efficiency measures, however. For example, it
may be difficult to assign responsibility for measures to specific providers based on claims, or it
may be difficult to group claims into episodes or other units.
MedPAC has tested the feasibility of using episode-based efficiency measures in the
Medicare program. They tested MEG and ETG based measures using 100% Medicare claims
files for 6 geographic areas.
They found that most Medicare claims could be assigned to
episodes, most episodes can be assigned to physicians, and outlier physicians can be identified,
although each of these processes is sensitive to the criteria used.
The percentage of claims that
can be assigned to episodes and the percentage of episodes that can be assigned to physicians
were consistent between the 2 measures.
Return to Contents
Actionable
Stakeholders are using efficiency measures for a variety of applications including internal
quality improvement, pay-for-performance, public reporting, and construction of product lines
that include differential copayments (tiering) for different providers.
Each of these applications
requires that the results of the measures be transmitted in a way that facilitates both
understanding and appropriate action on the part of the target audience (actionability).
However,
relatively little research has been done to understand the ability of different audiences to interpret
and use the information. Two examples are provided here based on interviews with
stakeholders.
- Flexible pricing—measures should be flexible to allow plans or groups to add their own
pricing information if the measure was originally constructed using standardized prices.
In many cases, standardized prices are used instead of the actual prices paid. This
approach eliminates differences in prices paid by different providers, which providers
often argue are not under their control. Insurers or provider groups may also favor
standardized pricing so that they do not reveal the prices they have negotiated with
suppliers. However, some users may wish to apply actual prices for certain applications
and desire this flexibility.
- Clinical relevance—measures need to provide actionable information to guide
improvements in clinical practice. Measures cannot be a “black box” of statistics that
lack transparency.
Return to Contents
Application of Efficiency Measures
Table 10 presents a matrix framework for evaluation of efficiency measures based on their
applications and their importance, scientific soundness, and feasibility. The columns are ordered
to reflect the hierarchy of decisionmaking about measures:
- Important—if it is not important, why go any further?
- Scientifically sound—if it is important but not sound then one cannot have confidence in
the data.
- Feasible—if it is important and scientifically sound, is it feasible to implement this
measure?
- Actionable—if it is important, scientifically sound, and feasible can the target audiences
understand and act on the information provided?
Reflecting this hierarchy, these four domains are listed from left to right in the columns of
the evaluation framework presented in Table 10.
Some applications of measures have a stronger requirement for the availability of rigorous
information in these four domains than others because of a greater possibility of unintended
consequences.
The rows of Table 10 are ordered to reflect the increasing need for rigor across
all four domains. When using a measure for provider network selection or tiered copayments in
a health plan, it is more important to ensure that the measure is scientifically sound, actionable,
etc., due to the potential effects on provider payment, patient choice, and other potential
unintended consequences.
In contrast, using a measure for internal review and improvement or
research has less potential for unintended consequences and thus has less stringent requirements
for information on measure properties as measures are in the process of being evaluated. As
measures are tested in these applications, further information on their properties will be available
that can be used to assess their appropriateness in other applications.
For example, if a new
measure is developed that assesses physician efficiency, it should first be used for research and
possibly internal review and improvement while information on its scientific soundness is
collected.
Before it is used for public reporting, pay-for-performance, or other applications, its
importance and scientific soundness should be well-established, and feasibility and actionability
become increasingly important.
None of the health care efficiency measures we identified met our criteria for use in public
reporting, tiered network design, or pay-for-performance, since no identified measure has
published evidence of sufficient scientific soundness to make it acceptable to all or even most
stakeholders.
To supplement the published evidence, we explicitly requested during the peer
review process that reviewers indicate which measures were acceptable for current use. The
responses we received ranged from those indicating that all current measures are acceptable for
internal use but none are acceptable for public use, to some vendor-developed measures are
acceptable for use in tiered network design, to frank skepticism that any of the measures are
useful.
We therefore conclude that for many of the uses proposed for efficiency measures, such
as public reporting, tiered network design, and pay-for-performance, there is insufficient
published evidence and stakeholder consensus for any existing measure.
We contrast this to the
field of quality measures, where there exist at least a handful of measures that have broad
acceptance internationally among stakeholders as being useful measures of quality, including
their use for public reporting and pay-for-performance.
In terms of advancing the field of efficiency measures, measurement scientists would prefer
that steps be taken to improve these metrics in the laboratory before implementing them in
operational uses. Purchasers and health plans are already using vendor-developed products for a
variety of applications and believe that these measures will improve with use.
Although this
report will likely not change the current tension between these different stakeholders, we believe
that a substantial contribution to the field could be made by investing adequate resources in
testing vendor-developed measures, exploring whether academically developed measures could
be made feasible and actionable for real world applications, and funding the development of new
measures and measurement approaches in this area.
Such work might best be done with multistakeholder
advisory groups that can help guide measurement teams to find an appropriate
balance between scientific rigor and practical utility.
Select for Table 10: Application of efficiency measures.
Return to Contents
Return to Previous Section
Proceed to Next Section