U.S. Census Bureau

Modeling and Quality of Masked Microdata

William E. Winkler

KEY WORDS: Data Mining; Likelihood, Loglinear, Multivariate


Statistical organizations collect data via survey forms and other methods. The microdata are valuable for modeling and analysis. To produce a public-use file, the organizations mask the data in a manner that may prevent re-identification of data associated with individual entities. The public-use microdata may allow one or two sets of analyses that approximately reproduce analyses that could be performed on the original microdata. This paper describes a general method of creating models of data that is related to methods of creating appropriate aggregates of data that are needed for sufficient statistics in general classes of models (Moore and Lee 1998, DuMouchel et al. 2000, Owen 2003). If the aggregates can be approximately reproduced, then the masked microdata may allow one or more analyses that correspond to analyses on the original, non-public microdata. It will typically not yield data suitable for general analyses.


Source: U.S. Census Bureau, Statistical Research Division

Created: January 13, 2006
Last revised: January 13, 2006