Skip navigation
Skip Navigation
small header image
Click for menu... About NAEP... Click for menu... Subject Areas... Help Site Map Contact Us Glossary NewsFlash
Sample Questions Analyze Data State Profiles Publications Search the Site
NAEP Technical Documentation
The Nation's Report Card (home page)

Table of Contents  |  Search Technical Documentation  |  References

NAEP Scoring → Scoring NAEP U.S. History Assessments → U.S. History Interrater Reliability

U.S. History Interrater Reliability

      

2001 U.S. History Assessment Item-by-Item Rater Reliability

A subsample of the U.S. history responses for each constructed-response item is scored by a second scorer to obtain statistics on interrater reliability. In general, items administered only to the national main sample receive 25 percent second scoring. This reliability information is also used by the scoring supervisor to monitor the capabilities of all raters and maintain uniformity of scoring across raters. Reliability reports are generated on demand by the scoring supervisor, trainer, scoring director, or item development subject area coordinator. Printed copies are reviewed daily by lead scoring staff. In addition to the immediate feedback provided by the online reliability reports, each scoring supervisor can also review the actual responses scored by a scorer with the backreading tool. In this way, the scoring supervisor can monitor each scorer carefully and correct difficulties in scoring almost immediately with a high degree of efficiency.

Interrater reliability ranges, by assessment year, U.S. history national main assessment: 2001
Assessment Number of unique items Number of items between 60% and 69% Number of items between 70% and 79% Number of items between 80% and 89% Number of items above 90%
2001 U.S. history 47 2 16 16 13
1994 U.S. history 79 0 1 33 45
SOURCE: U.S. Department of Education, Institute of Education Sciences, National Center for Education Statistics, National Assessment of Educational Progress (NAEP), 2001 U.S. History Assessment.

During the scoring of an item or the scoring of a calibration set, scoring supervisors monitor progress using an interrater reliability tool. This display tool functions in either of two modes:

  • to display information of all first readings versus all second readings; or

  • to display all readings of an individual that were also scored by another scorer versus the scores assigned by the other raters.

The information is displayed as a matrix with scores awarded during first readings displayed in rows and scores awarded during second readings displayed in columns (for mode one) and the individual's scores in rows and all other raters in columns (for mode two.) In this format, instances of exact agreement fall along the diagonal of the matrix. For completeness, data in each cell of the matrix contain the number and percentage of cases of agreement (or disagreement). The display also contains information on the total number of second readings and the overall percentage of reliability on the item. Since the interrater reliability reports are cumulative, a printed copy of the reliability of each item is made periodically and compared to previously generated reports. Scoring staff members save printed copies of all final reliability reports and archive them with the training sets.

Last updated 17 June 2008 (MH)

Printer-friendly Version

1990 K Street, NW
Washington, DC 20006, USA
Phone: (202) 502-7300 (map)