The U.S. Census Bureau

Quality of Very Large Databases

William E. Winkler

KEY WORDS: record linkage, editing, imputation, data mining

ABSTRACT

Analyses and data mining of large computer files are affected by the quality of the information in the files. For large population registers and for files that are created by merging two or more files, duplicate entries must be identified. Duplicate identification can depend on record linkage software that can deal with name, address, and date-of-birth data containing many typographical errors. Quantitative and qualitative data must be edited to assure that mutually contradictory or missing items are changed automatically and quickly. This paper describes computational methods and software that are suitable for groups of files where individual files contain between 1 million and 4 billion records.

CITATION:

Source: U.S. Census Bureau, Statistical Research Division

Created: July 25,2001
Last revised: July 25 2001