Modeling and Quality of Masked Microdata
William E. Winkler
KEY WORDS: Data Mining; Likelihood, Loglinear, Multivariate
ABSTRACT
Statistical organizations collect data via survey forms and other methods. The microdata are valuable for modeling and analysis. To produce a public-use file, the organizations mask the data in a manner that may prevent re-identification of data associated with individual entities. The public-use microdata may allow one or two sets of analyses that approximately reproduce analyses that could be performed on the original microdata. This paper describes a general method of creating models of data that is related to methods of creating appropriate aggregates of data that are needed for sufficient statistics in general classes of models (Moore and Lee 1998, DuMouchel et al. 2000, Owen 2003). If the aggregates can be approximately reproduced, then the masked microdata may allow one or more analyses that correspond to analyses on the original, non-public microdata. It will typically not yield data suitable for general analyses.
CITATION: