Enhancing Peer Review at NIH - Improving Scoring & Transparency Scale

Improving Scoring & Transparency Scale

Background

Beginning with the summer 2009 review cycle/September Council, all applications will receive preliminary scores on each of the review criteria. Applications that are discussed in the review meetings will also be given an overall impact score by committee members. This score will be averaged and them multiplied by 10 to obtain the priority score which will be between 10 and 90. Applications that are not discussed at the meeting will not be given an overall impact score or a priority score, but the applicant as well as NIH staff will have the preliminary scores on each of the review criteria as additional feedback with their summary statement.

Scoring System
The new scoring system will be effective for all applications for research grants and cooperative agreements that are submitted for funding consideration for fiscal year 2010 (FY2010) and thereafter. The first standing due date for FY2010 is January 25, 2009; the new scoring system will be used for applications submitted in response to Parent Announcements and Program Announcements, including PARs and PASs. An important aspect of the implementation of the new scoring system is to use it in a consistent manner for applications considered in a given fiscal year. Therefore, some RFAs and PARs for funding consideration in FY2010 have due dates before January 25, 2009, and responses to those will be evaluated using the new scoring system. Likewise some RFAs and PARs for FY2009 have due dates after January 25, 2009, and responses to those will be evaluated using the present scoring system.

Frequently Asked Questions

Why is NIH changing the grant application scoring system? The most reliable, consistent rating system is one that reflects reviewers’ abilities to discriminate among grant applications. In current practice, each scored application is assigned a single, overall score that reflects the consideration of all review criteria. Individual reviewers mark scores (i.e., 1.0 to 5.0, on a 41-point scale) to two significant figures (e.g.,2.2), and the reviewers’ individual scores are averaged and then multiplied by 100 to yield a single overall priority score for each scored application (e.g., 253).

Research has shown that rater reliability drops or fails to increase with a rating scale extended beyond 9 points. The NIH’s current scale of 41 points (i.e., 1.0 to 5.0) for initial scoring far exceeds that recommended by the psychometric literature. Further, once the scores are averaged and multiplied by 100, the resulting priority score appears to have more precision than it actually has. Based upon the psychometric literature, the recommendations from the 2008 NIH peer review self-study, and the 2003 and 1997 studies of the NIH scoring system, a change in the scoring system is warranted.

What is the new scoring system? The new scoring system uses a 9-point rating scale. Although a 7-point scale was initially planned, the selection of a 9-point scale provides a scale with sufficient range to allow reviewers to make additional distinctions among applications. A reasonable case could be made for smaller or larger ranges (e.g., 7- or 10-point scales) but two factors led to primary consideration of a 9-point scale. First, measurement experts from a previous peer review evaluation considered a 9-point scale an acceptable number of discriminations that reviewers are likely to be able to make reliably. Second, the NIH has some prior experience with what was essentially a 9-point scale using 1.0 to 5.0 with only 0.5 increment assignments allowed. This 1 to 5 scale in 0.5 increments produced a reasonable distribution of scores.

In addition to providing an overall rating of the grant application using this scale, the assigned reviewers and discussants will also assign a score to each of the primary review criterion (significance, investigator, innovation, approach, environment) using the same rating scale. These criterion-specific ratings will give both investigators and NIH staff a clearer sense of the relative strengths and weaknesses in the application (e.g., is the idea good but the design flawed, or is the design strong but the idea insignificant?).

What are the advantages of this 9-point scoring system? With fewer rating options (9 vs. 41), reviewers should be able to more reliably assign a rating to an application. To improve the reliable and consistent use of this 9-point rating scale, the rating options have verbal anchors, called descriptors, associated with them. These same anchors can be applied to both the overall score and each review criterion. Although there will always be differences among reviewers in their judgments of the merit of a grant application, routine use of these descriptors to guide ratings should result in more consistent and reliable ratings of applications across meetings and across the NIH.
Should the overall rating be based on the rating of the specific criteria? The specific review criteria are to inform the overall score but need not be used in a formulaic calculation. Each reviewer’s overall score is to reflect the reviewer’s evaluation of the scientific merit of the project in its entirety. Reviewers should weigh the importance of each specific criterion to the project being proposed, and provide an overall rating based on the importance of each criterion and the quality of the application as a whole. For example, more weight may be placed on approach and innovation for a basic science study than a clinical trials study in which public health significance may be more important than innovation. Investigator and environment may be more important in a study that uses advanced methodologies and technologies that only a few investigators and facilities are capable of performing than in a study that uses common methodologies that most investigators are capable of performing.
What will the new scores look like? The average of the reviewers’ scores will be multiplied by 10 to yield the priority score and range from 10 to 90. These priority scores will be percentiled against an appropriate base and reported to the nearest whole number.
Won’t this new scoring system result in more applications with identical scores? How will tie scores be considered in making funding decisions? There will indeed be more numeric ties. Tie scores indicate that applications cannot be reliably distinguished from each other based on scientific merit alone and that other important factors will be considered in making funding decisions (e.g., mission relevance and portfolio balance).

This page was last reviewed on December 3, 2008