GPRA concepts. For purposes
of annual performance reporting, GPRA seeks information about
program outputs and outcomes. The immediately observable products
of program activity such as publications or graduates are termed
outputs. Intermediate and longer-term results for
which the program is designed, such as producing knowledge or
enabling improved health or national security, are referred to
as outcomes.
Outputs and outcomes should be distinguished from
inputs, such as researchers' knowledge and time,
use of equipment, instruments, facilities, and supplies
1
The OMB guidance for implementation of GPRA also
discusses program impacts. These are the total long-run direct
or indirect effects or consequences of the program. These effects
may be intended or unintended, and may be positive or negative.
Although information about impacts is useful for understanding
the eventual effects of government programs, it is not required
under the GPRA legislation.
In principle, it should be easy to distinguish among
indicators for inputs, outputs, intermediate outcomes, and end
outcomes. In practice, these four concepts represent a continuum
for which indicators can blend one into another. A simplified
description of the process of training in science and engineering
illustrates the point. High school graduates represent a final
output from secondary schools and an input to colleges and universities
(as well as an input to employers who hire high school graduates
directly). Baccalaureate recipients represent a final output
from colleges and universities when they move directly into science
and engineering employment. Baccalaureate recipients represent
an intermediate output when they move on to graduate training.
Individuals who complete doctoral or post doctoral programs represent
outputs of those programs and inputs to renew the scientific work
force's human resource base. Meanwhile, maintenance of a top
quality science and engineering work force, appropriately employed,
represents an outcome that enables continued conduct of world-class
science, training of the next generation of scientists and engineers,
and deployment of scientific expertise throughout the many sectors
of the economy to assure development and application of new knowledge
and techniques. And all of these, in combination with other factors,
enable attainment of the Federal over-arching national goals of
improved health, environment, prosperity, national security, and
quality of life.
Whether a particular product or result is an output
or intermediate or end outcome depends upon the mission and goals
of the agency which produced it and upon whether it is viewed
from the perspective of that agency's individual plan or from
the perspective of over-arching national goals. Program-specific
information about goals, institutional setting, and overall context
is needed before definitions of outputs, intermediate outcomes,
and end outcomes can be tailored to program reporting and before
indicators can be identified. Appendix C provides an example
from the Department of Agriculture that illustrates how overall
agency goals and specific program goals can be used to derive
performance indicators.
Pre-existing measures.
Because pre-existing measures of research results were developed
primarily for other purposes, they have not yet been adapted for
use in reporting at the agency level. Pre-existing measures capture
only a subset of the spectrum of research outputs and outcomes.
Unfortunately, they do not map neatly or cleanly onto GPRA concepts.
Although there are many measures of potential applicability to
the science enterprise, most track inputs or levels of research
activity. Some could be used as a starting point for examination
of output. A few could be considered to capture selected aspects
of outcomes. Thus far, comprehensive efforts to determine impacts
appear to be rare. Consequently, pre-existing measures can serve
only as a starting point for agency thinking about how to design
the most effective assessment methods. Some well-known pre-existing
measures are discussed below.
Publication counts. Publication
counts have been used in non-GPRA contexts as measures of the
quantity of knowledge produced by a research program. Publication
itself is a tangible indictor of the transfer of research findings
to the public domain, and publication in a peer reviewed journal
is an indicator of a positive scientific evaluation of the information.
Although publication counts provide useful information when combined
with a larger richer set of indicators and analyses, their use
alone or without sufficient information about other aspects of
performance and the circumstances of the research can produce
an incomplete, if not inaccurate, picture. For example, differences
in publication rates between scientific disciplines may reflect
differences in propensity to publish, in the definition of the
smallest publishable unit, and in patterns of collaboration rather
than differences in productivity. Also, the mere introduction
of the counting of publications as a performance indicator, depending
on how it is done, can influence publication patterns or publication
rates--setting up incentives that focus on the production of more
articles, rather than on the discovery of new knowledge.
Patent counts. Counts
of patents, new devices, computer programs, and other inventions
do not say much about whether a program is conducting world-class
science at the frontier of knowledge; but, some mission agencies
may use them to gain insight about connections between their program
and the agency mission. If such counts are used in the assessment
of a fundamental science program, they should be used in
combination with other sources of information as part
of a richer, more detailed assessment. Further, any use made
of such counts should be undertaken only with full awareness of
their limitations. In particular, patent counts and other indicators
of inventive activity tend to be low for basic research programs.
The statistical instability of the variability in small numbers
from year to year suggests that inclusion of measures of inventive
activity among a few summary indicators in a short program report
would be a risky strategy for a fundamental science program; however,
it should be possible to handle the problem of high variability
in small numbers by using, for example, a rolling three-year average.
Citation counts. In the
program evaluation literature, citation counts are sometimes described
as an unobtrusive form of wide-scale scientific review. As with
publication counts, their use and interpretation should be undertaken
with certain caveats: in a few cases, high numbers of citations
may indicate a negative evaluation (e.g., the disputed cold fusion
results); possible citation "clubs" or "spirals"
do not say much about the underlying science; citation rates vary
among fields; in many fields, experimental work tends to be cited
more frequently than theoretical, and occasional methods papers
achieve extremely high levels of perfunctory citation with the
consequence that citation counts, in general, may under-value
advances in understanding and over-value sheer experimental activity;
and, in fields for which the writing of a book is a major publication
outlet (e.g., some social sciences), citation counts are an unfair
assessment of value, since Science Citation Index, on which such
counts are based, only includes references from journals.
Since prior experience with program evaluation suggests
that retrospective scientific review and citation counts seem
to provide complementary perspectives, evaluators generally advocate
using the two together for detailed program evaluation. For example,
a scientific review panel's judgments may be sharpened when it
is required to evaluate and respond to literature-based data regarding
the program being evaluated. A perception seems to be emerging
that citation counts could be usefully combined
with other descriptive information in summary reports of overall
performance.
Contributions to other goals.
A program may also contribute to other Federal goals, and such
contributions are relevant aspects of program performance whether
or not they are listed among specific program objectives. Such
contributions can be included in program reports (and can be added
to program objectives). Measures have been attempted for other
aspects of research programs such as the development of human
resources and physical infrastructure, the building of cross-disciplinary
and cross-sectoral partnerships, or the numbers of undergraduates
involved in a research program or in informal science education
activities.
Output indicators for some activities might be available
from published reports. Others could be collected from principal
investigators at the completion of research projects and aggregated
at the agency level. If data are collected from individual investigators
and program managers, it should be made clear that such data will
be aggregated and reported at the agency level. It should also
be made clear that not all projects or programs need to contribute
to all of an agency's goals. This makes good management sense,
and communicating the point should help assure researchers that
they have the flexibility that they need for creative work.
Some experimental efforts at the National Science
Foundation to develop new sets of indicators are reported in Appendix C.
Rate-of-return and other measures developed by
economists. Economists have developed
a number of techniques intended to estimate the benefits of, or
returns to, research. These generally involve efforts to link,
directly or indirectly, the knowledge produced by research to
the benefits eventually produced by use of the knowledge in practical
applications. Basic approaches include efforts to (1) compute
the benefits associated with the results of a research program
or aggregation of programs, (2) compare the benefits of the research
to the costs of conducting the research by constructing a benefit-cost
ratio, and (3) compare benefits to costs by computing the implicit
rate of return.
The findings of the "Assessment Process"
(the Process is described in Appendix A) and of other sources
(e.g., American Enterprise Institute et al. 1994) indicate
that existing economic methods and data are sufficient to measure
only a subset of important dimensions of the outcomes and impacts
of fundamental science. Sufficiency varies among Federal programs--economic
methods are perhaps best suited to assessing programs in some
mission agencies and least suited to assessing programs not directly
aimed at specific applications. When methods and data permit,
economic techniques can be used to communicate the size and significance
of the benefits of research. Two examples of the computation
of economic benefits of research are given inAppendix C. One
discusses the estimation of the cost savings flowing from biomedical
research at the National Institutes of Health; the other, the
economic impacts of research in metrology at the National Institute
of Standards and Technology.
Economists have also developed substantive information
about the determinants of the level and pattern of investments
in research and the adoption and diffusion of new products and
processes. However, there are complexities among the ever-changing
pattern of innovative activities that are not well understood
and for which the limitations of the data preclude study. In
particular, what economists cannot now do is estimate (1) the
benefit compared to the cost "at the margin"
regarding the start of one more research program in comparison
to something else, or (2) the benefit compared to the cost "at
the margin" for an additional research program in
one field or application as compared to another
2. Since economists
require information about benefits and costs "at the
margin" to make decisions about resource allocation,
many suggest that existing economic methods and data do not provide
useful criteria for allocating resources among potential areas
for future research. Of course, to the extent that existing data
permit computations of benefits or returns "on average"
(rather than "at the margin"), economic methods can
be used to gain retrospective insight about past performance.
Future benefits. Decision
makers and policy makers sometimes seek information about what
can be expected in the future if investments are made in one line
of research or another. There are no measures (in the conventional
sense of the word) of what the future benefits of
research will be, at least in part because the future pattern
and course of research impacts cannot be known. The "Assessment
Process" did not attempt to develop measures of future
benefits. Nor did it attempt to develop methods for setting priorities
for future spending
3.
Other approaches. Performance
reports need not and should not rely on quantitative measurement
alone. Annual performance reports might, for example, document
progress toward enabling goals over a rolling historical period
of, say, the last twenty years; they might present examples of
outstanding or more typical research accomplishments; or they
might build descriptive case studies of how the accretion of knowledge
through research eventually leads to long-run applications which
contribute to over-arching national goals.
Merit review with peer evaluations.
The insufficiency of measures per se is one reason why merit
review with peer evaluations of past performance provide important
information for retrospective performance assessment. The focus
of such assessments for responding to GPRA would be at the program
level. Since agencies are just now developing their approaches
for assessment under GPRA, it is not yet clear how the expert
assessments would be structured. Individual assessment panels
might focus on key agency programs or groups of related agency
programs (covering each such program or group of related programs
every five years or so).
It should be recalled that a program under GPRA is an activity or project listed in the Federal budget; however, GPRA gives agencies the option to aggregate or dis-aggregate activities for GPRA reporting, as long as it does not omit or minimize the significance of any major function or operation of the agency. In practice, the definition of a program for reporting under GPRA seems to be evolving to include a major function or operation of an agency or a major mission-directed goal that cuts across agency components or organizations.
The credibility and effectiveness of scientific review
for retrospective assessment is critically dependent on how it
is organized and on the types of participants. A review panel
clearly must have competencies that are a good match to the program
content; it must have reviewers who are respected and objective
(for example, not likely to be influenced by concerns that the
panel's conclusions will influence future funding for their own
work). Only then will a review be credible.
An assessment panel should consider whether research
performance has been at the frontiers of scientific knowledge.
In addition, program managers may seek expert assessment of the
program's contributions to other enabling goals--for example,
contributions to maintaining a high quality scientific work force
appropriately employed or to ensuring that facilities and instrumentation
are maintained to support work at the cutting edge.
Program managers may also seek expert insight about
whether the program has made contributions to the knowledge base
for specific mission goals as well as over-arching national goals.
Intended users of the results of the program could provide information
about the relevance or importance of the program's results. Their
perspectives could be tapped by including them in the review panels.
For "mission agencies," intended users might be those
expected to apply the results of the science program (e.g., industry,
agriculture, or users within the agency). For "non-mission
agencies" such as the National Science Foundation, users
might be researchers in areas for which the program's work is
claimed to have impact. For agencies that support general knowledge
development and scientific training, it might be appropriate to
include stakeholders for the general pools of knowledge and talent
to which the agency contributes.
To assure objective judgments from expert panel members,
input should in principle be sought from researchers who were
not among those supported by the program or involved in selecting
projects funded by the program.
An example of the use of assessment panels at the
National Institute of Standards and Technology is given in Appendix C.
International standing.
Maintaining leadership across the frontiers of scientific knowledge
is a critical element in our investment strategy for science.
As noted above, for an individual agency, the evaluation criterion
is whether the agency's research is conducted at the frontiers
of scientific knowledge. For evaluation from an NSTC or national
perspective, information is needed about United States standing
internationally. The findings of the "Assessment Process"
indicate that, although some data and methods exist for international
comparisons of a nation's research activity and some aspects of
overall research output, the methods for international comparison
are still in their infancy. Further work is needed to develop
cost-effective strategies for assessing American standing on the
world stage. An inter-agency group, such as the Committee on
Fundamental Science, might consider how this can best be accomplished.
We stress that leadership evaluation does not entail simplistic
numerical ranking of national programs. Our national interest
in leadership rests in having our research
and educational programs perform at the cutting edge--sometimes
in competition, but often in explicit collaboration, with scientists
from other nations.