SED navigation bar go to SED home page go to SED seminars page go to NIST home page SED Home Page SED Contacts SED Projects SED Products and Publications Search SED Pages

Statistical Engineering Division
Seminar Series

Record Linkage and Machine Learning

William E. Winkler
U.S. Census Bureau
Lecture Room A, Admimistration Building
July 15, 2003, 10:30am

Although terminology differs, there is considerable overlap between record linkage (data cleaning) methods based on the Fellegi-Sunter model and Bayesian networks used in machine learning. Both are based on a formal probabilistic model that can be shown to be equivalent in many practical situations. When no missing data are present in identifying fields and training data are available, then both can efficiently estimate parameters of interest. When missing data are present, the EM algorithm can be used for parameter estimation in Bayesian Networks when there are training data and in record linkage when there are no training data (unsupervised learning). This talk describes some of the current methods of approximate string comparison to account for typographical errors between strings, hidden Markov models for adaptive name and address parsing, methods of semi-supervised learning, fast indexing and retrieval methods for comparing records from files with hundreds of millions of records, and methods of creating information and data structures during linkage processes.

NIST Contact: Walter Liggett, x-2851.

Date created: 2/28/2003
Last updated: 7/7/2003
Please email comments on this WWW page to sedwww@cam.nist.gov.