Statistical Engineering Division
Seminar Series
Record Linkage and Machine Learning
William E. Winkler
U.S. Census Bureau
Lecture Room A, Admimistration Building
July 15, 2003, 10:30am
Although terminology differs, there is considerable overlap between
record linkage (data cleaning) methods based on the Fellegi-Sunter
model and Bayesian networks used in machine learning. Both are
based on a formal probabilistic model that can be shown to be
equivalent in many practical situations. When no missing data are
present in identifying fields and training data are available, then
both can efficiently estimate parameters of interest. When missing
data are present, the EM algorithm can be used for parameter
estimation in Bayesian Networks when there are training
data and in record linkage when there are no training data
(unsupervised learning). This talk describes some of the current
methods of approximate string comparison to account for typographical
errors between strings, hidden Markov models for adaptive name and
address parsing, methods of semi-supervised learning, fast indexing
and retrieval methods for comparing records from files with hundreds
of millions of records, and methods of creating information and data
structures during linkage processes.
NIST Contact:
Walter Liggett, x-2851.