U.S. Census Bureau
Link to Census 2000 Gateway

The U.S. Census Bureau's Plans for the Census 2000
Public Use Microdata Sample Files



December 20, 2001
  1. INTRODUCTION

    The U.S. Census Bureau will provide two sets of Public Use Microdata Sample (PUMS) files: a 1 percent national characteristics file and 5 percent state files. These files will provide the greatest possible detail, while protecting the confidential nature of the data. For Puerto Rico, 1 percent and 5 percent files also will be created.1

    Because of the rapid advances in computer technology and the increased accessibility of census data to the user community, the Census Bureau has had to adopt more stringent measures to protect the confidentiality of public use microdata through disclosure-limitation techniques. At the same time, the Census Bureau recognizes the needs of data users for greater characteristic detail and greater geographic specificity. Hence, two sets of files will be produced: one that provides a fuller range of detailed characteristics (the 1 percent national characteristics file) and one that provides greater geographic detail but less characteristic detail (the 5 percent state files).

    This paper describes the confidentiality protection, the content of the two file types, and the approximate release dates.

  2. CONFIDENTIALITY

    Confidentiality will be protected by the use of the following processes: data-swapping, top-coding of selected variables, geographic population thresholds, age perturbation for large households, and reduced detail on some categorical variables.

    Data swapping is a method of disclosure limitation designed to protect confidentiality in tables of frequency data (the number or percent of the population with certain characteristics). Data swapping is done by editing the source data or exchanging records for a sample of cases. Swapping is applied to individual records and, therefore, also protects microdata.

    Top-coding is a method of disclosure limitation in which all cases in or above a certain percentage of the distribution are placed into a single category.

    Geographic population thresholds prohibit the disclosure of data for individuals or households for geographic units with population counts below a specified level (see descriptions of Public Use Microdata Areas (PUMAs) and super-PUMAs in Section III).

    Age perturbation, that is, modifying the age of household members, will be required for large households (households containing ten people or more) due to concerns about confidentiality.

    Detail for categorical variables will be collapsed if the categories do not meet a specified national minimum population threshold.

  3. FILE TYPES

    1. National Characteristics 1 Percent PUMS File

      The national characteristics file will provide the maximum amount of social, economic, and housing information available. The goal of this file is to provide as close as possible the amount of detail that was in the 1990 PUMS files (and, in some cases, more detail). No national minimum population threshold for the identification of variable categories is planned, with the exceptions of race and Hispanic origin. Limits on certain variables, deemed necessary to protect confidentiality, are covered in Section C below.

      To maintain the level of detail described above, however, the minimum geographic population threshold must be raised above 100,000 (the PUMA minimum). A new geographical entity is being created--the super-PUMA. Super-PUMAs have a minimum population of 400,000 and are composed of a PUMA or PUMAs delineated on the companion state-level PUMA file.2 Each state will be identified, and any state with a population of 800,000 or greater can be subdivided into two or more super-PUMAs.

    2. State-Level 5 Percent PUMS Files

      State-level 5 percent PUMS files will provide information for PUMAs that will represent many metropolitan areas, cities, and more populous counties, as well as groups of less populous counties. In order to protect confidentiality, characteristic information for these smaller areas will be less detailed than in the national 1 percent file.

      1. Population Thresholds for PUMAs

        Each geographic unit in the 5 percent files--PUMAs--must meet a minimum population threshold of 100,000. The minimum PUMA threshold will be held at 100,000 people by increasing the degree of variable collapsing to an appropriate level to maintain confidentiality. There are two main arguments favoring this approach.

        First, from a user's standpoint, raising the minimum population threshold for PUMAs above 100,000 would greatly restrict a wide variety of local-level geographic analyses, such as studies of nonmetropolitan, metropolitan, and intrametropolitan areas, conducted by public agencies, academic researchers, and others in the private sector.

        Second, the 100,000 minimum population threshold--the threshold set for both the 1980 and 1990 PUMS files--permits historical comparability. Users interested in time-series analysis were clearly displeased at the possibility of an increase in the threshold for Census 2000. Those users noted the difficulty in comparing the results from different decades if the PUMA threshold was raised. Additionally, the Census Bureau's use of 250,000 as the minimum threshold for PUMAs in 1970 was criticized by users--an important reason for the decision to lower the minimum threshold to 100,000 people for the 1980 PUMS files and to maintain it in the 1990 PUMS files.

      2. Minimum Population Threshold for Categorical Variables

        To maintain confidentiality, while retaining as much characteristic detail as possible, a minimum threshold of 10,000 in the national population will be set for the identification of groups within categorical variables in the state-level PUMS files. At the PUMS Users Conference held in Alexandria, Virginia, on May 22, 2000, some users suggested a minimum population threshold of 25,000 in response to concerns about confidentiality. The Census Bureau subsequently determined that a minimum threshold of 10,000 would maintain the confidentiality of responses, while providing greater detail to the user.

      3. Post-processing

        The state-level files will require significant post-processing. Instead of identifying variable categories based upon pretabulation assumptions about the composition of the population, the approach develops variable collapsing requirements after the microdata samples have been drawn. Each variable will be analyzed, and only those values that do not meet the 10,000 minimum national population threshold will be collapsed into more general categories.

        Post-processing will improve the PUMS products by offering a more precise means of ensuring confidentiality. However, this procedure will increase the processing and analytic work load and delay the release of the 5 percent PUMS products to the public by approximately six months.

    3. Additional Specifications for the PUMS Files

      Additional PUMS file specifications are included for the following variables in the national characteristics and state-level files.

      1. Dollar Amounts

        Dollar amounts will be rounded before all summations, ratio calculations, or presentations of amounts. The dollar amounts will be represented, including negative amounts, as follows:

        Dollar Amounts:
        No income $0
        $1-$7 $4
        $8-$999 round to the nearest $10
        $1,000-$49,999 round to the nearest $100
        $50,000 or more round to the nearest $1,000

        This rule will be applied to income types, utility costs, mortgage costs, rent, condominium fees, hazard insurance costs, and mobile home fees.

        Implementing income top-coding: An individual's income will be rounded on a graduated scale and independently top-coded by variable type. The value inserted for observations at and above the top-code will be the state mean of all cases at and above the top-code minimum value. Incomes will then be summed across household members to obtain household totals, without any additional top-coding. The bottom-coding for all income types that can have negative dollar values will be set at a maximum negative value of $10,000.

        Housing-related dollar amount variables: Property taxes will be categorized in a similar way to 1990, with the exception of the higher tax categories. The categories shown below will be used for the 1 percent file. The categories for the 5 percent file may have to be collapsed in order to protect confidentiality.

        Property tax ranges:
        Not applicable
        None
        $50 increments from $1 to $999
        $100 increments from $1,000 to $4,999
        $500 increments from $5,000 to $5,999
        $1,000 increments from $6,000 to $9,999
        $10,000 or more3

        All other housing-related dollar amounts will be treated similarly to income (see above). That is, the variables will use the same rounding scale as for income, and each case will receive the state mean of top-coded cases for each respective variable. For the items that are aggregated to create selected monthly owner costs (SMOC) and gross rent, each item will be rounded independently and top-coded before summing to the SMOC or gross rent total. No further rounding will be performed on the aggregated amount.

      2. Race and Hispanic Origin Data

        Data on race will include "yes/no" variables for the five Office of Management and Budget (OMB) races4 and Some other race on both the 1 percent and the 5 percent files. This will allow data users to construct the 63 possible race combinations shown on the redistricting data file.

        In addition, in both the 1 percent and the 5 percent files, we will show all combinations of the 15 race categories shown on the census questionnaire, specific American Indian and Alaska Native tribes alone, and detailed Asian and Native Hawaiian and Other Pacific Islander groups alone that meet the relevant thresholds.5 In the 1 percent file, we are planning a national minimum population threshold of 8,000 for the identification of categories in the race and Hispanic origin variables; in the 5 percent files, we are planning a national minimum population threshold of 10,000 for the identification of categories in these variables. For example, the racial category "Black or African American and Filipino" will be shown on both files because there are more than 10,000 people in the United States who reported this combination on Census 2000.

      3. Age Detail

        For both the state-level and national characteristics files, single-year age categories will be provided through age 89. There is one nationwide top-code (age 90) and each state receives the mean age of individuals in the state 90 years and over.

      4. Ancestry Variables

        The Census Bureau codes up to two responses for the ancestry question. For the state-level files, if the combined total national population from both of these responses for an ancestry group is 10,000 or greater, that group will be identified by itself in both the first response and second response variables, even if the total for the category in either or both of the individual ancestry variables does not meet the 10,000 threshold.6

      5. Industry and Occupation

        Two sets of codes for each occupation and industry will be provided: (1) the census code and the Standard Occupational Classification (SOC)-based code for occupation and (2) the census code and the North American Industry Classification System (NAICS)-based code for industry.

      6. Continuous Variables

        Continuous variables are treated the same on both files. Additional specifications for departure time (when a person usually left for work in the week before their census form was filled out) and year of entry into the U.S. are described below.

        Departure time will be categorized as follows:

        12 midnight - 2:59 a.m. in 30-minute increments
        3 a.m. - 4:59 a.m. in 10-minute increments
        5 a.m. - 10:59 a.m. in 5-minute increments
        11 a.m. - 11:59 p.m. in 10-minute increments

        Year of entry into the country will have a bottom-code of 1910.

  4. TIMETABLES FOR PUMS FILES

    The 1 percent national characteristics file will be the first file released to the public. It is planned for release in 2002. The 5 percent state-level files, requiring more time for post-processing, will be released to the public in 2003.




    Prepared by: Paul J. Mackun
    Population Division
    U.S. Census Bureau


1 For two Island Areas, Guam, and the U.S. Virgin Islands, 10 percent PUMS files will be created.

2 The super-PUMAs will be identified in the 5 percent files, as well.

3 Each state receives the mean value of all cases in that state at and above the national top-code value.

4 The five OMB races include White, Black or African American, American Indian and Alaska Native, Asian, and Native Hawaiian and Other Pacific Islander.

5 The following 15 race categories appear on the form: White, Black or African American, American Indian or Alaska Native, Asian Indian, Chinese, Filipino, Japanese, Korean, Vietnamese, Other Asian, Native Hawaiian, Guamanian or Chamorro, Samoan, Other Pacific Islander, and Some other race.

6 For example, if there are 9,638 individuals who identify themselves as Alsatian in the "ancestry, first response" variable and 6,782 individuals who identify themselves as Alsatian in the "ancestry, second response" variable, Alsatian will appear as a separate category in both variables--first response and second response--because the total number of Alsatian responses nationwide, 16,420, surpasses the 10,000 national minimum population threshold.