The U.S. Census Bureau

Approximate String Comparator Search Strategies for Very Large Administrative Lists

William E. Winkler

KEY WORDS: search mechanisms, approximate string comparison, computer matching

ABSTRACT

Rather than collect data from a variety of surveys, it is often more efficient to merge information from administrative lists. Matching of person files might be done using name and date-of-birth as the primary identifying information. There are obvious difficulties with entities having a commonly occurring name such as John Smith that may occur 30,000+ times (1.5 for each date-of-birth). If there are 5% typographical errors in each field, then using fast character-by-character searches can miss 20% of true matches among non-commonly occurring records where name plus date-of-birth might be unique. This paper describes some existing solutions and current research directions.

CITATION:

Source: U.S. Census Bureau, Statistical Research Division

Created: March 21, 2005
Last revised: March 21, 2005