Your browser doesn't support JavaScript. Please upgrade to a modern browser or enable JavaScript in your existing browser.
Skip Navigation U.S. Department of Health and Human Services www.hhs.gov
Agency for Healthcare Research Quality www.ahrq.gov
www.ahrq.gov

2. Methods and Data (continued)

2.4 Using the Algorithm to Provide an Improved Race/Ethnicity Variable

Upon combining the Hispanic and Asian/Pacific Islander naming algorithms and verifying the combined algorithm's success on the CAHPS data, we created the NEWRACE variable for the entire Medicare population found in the EDB. The first step was to obtain from CMS all 41.7 million records of active beneficiaries in the 10 segments of the unloaded EDB from mid-2003. After we had uploaded the EDB records, we were able to run the algorithm on the EDB records creating NEWRACE for each living beneficiary in the EDB.

Table 2.3 demonstrates the differences in the EDBRACE and NEWRACE variables for the entire population of active beneficiaries listed in the EDB. The number and percentage of Hispanic and A/PI beneficiaries increased, while they decreased for the White and Other race/ethnicity categories. The number and percent of Black beneficiaries also decreased slightly.

Table 2.3 Comparison of the distribution of race/ethnicity according to EDBRACE and NEWRACE for the entire EDB

  Original EDB race variable (EDBRACE) New EDB race variable (NEWRACE)
  Frequency Percent Frequency Percent
White 35,141,623 84.2 33,424,922 80.1
Black 4,014,799 9.6 3,933,634 9.4
Hispanic 913,069 2.2 2,912,244 7.0
Asian/Pacific Islander (A/PI) 593,456 1.4 854,182 2.0
American Indian/Alaska Native (AI/AN) 137,989 0.3 136,498 0.3
Other 838,744 2.0 394,375 0.9
Unknown 101,095 0.2 85,254 0.2
Missing 1,631 0.0 1,297 0.0
Total 41,742,406 100.0 41,742,406 100.0

Source: EDBRACE is from Medicare EDB from mid-2003; and NEWRACE is the result of having run the combined surname algorithm on race/ethnicity in the Medicare EDB from mid-2003.

Table 2.4 shows that overall, 1,998,9097 beneficiaries listed in the EDB had their race/ethnicity recoded to Hispanic as a result of using the combined improved naming algorithm. Most of these beneficiaries were originally classified in the EDB as White (83.5 percent), followed by Other/Unknown (11.1 percent), and Black (3.8 percent). Very few beneficiaries were originally coded as Asian/Pacific Islander (1.5 percent) or American Indian/Alaska Native (less than 0.05 percent). Overall, more female beneficiaries (1,068,033) than males (930,875) were recoded to Hispanic. This pattern holds true for White, Black, and Asian/Pacific Islander beneficiaries. The largest number of "new" Hispanic beneficiaries was created in the 65-to-74-year-old age group. This is true regardless of the beneficiaries' original EDB race/ethnicity code and gender. Not surprisingly, the 85-year- old-and-older age group had the fewest beneficiaries with their race/ethnicity recoded. This undoubtedly reflects the overall age distribution of Medicare beneficiaries.

Table 2.4 Distribution of "new" Hispanic beneficiaries (NEWRACE) according to their EDBRACE, gender, and age group

EDBRACE White Black Asian/
Pacific Islander
American Indian/
Alaska Native
Other or unknown Total
Gender and
age group
Number % Number % Number % Number % Number % Number %
Total 1,669,047 83.5 76,837 3.8 30,090 1.5 995 0.0 221,940 11.1 1,998,909 100.0
 Male 767,952 82.5 36,070 3.9 12,499 1.3 520 0.1 113,834 12.2 930,875 100.0
  Under 65 170,155 77.9 10,650 4.9 1,789 0.8 287 0.1 35,501 16.3 218,382 100.0
  65-74 406,797 84.0 17,447 3.6 5,978 1.2 132 0.0 53,924 11.1 484,278 100.0
  75-84 142,310 84.7 5,467 3.3 3,873 2.3 92 0.1 16,303 9.7 168,045 100.0
  85 and  Older 48,690 80.9 2,506 4.2 859 1.4 9 0.0 8,106 13.5 60,170 100.0
 Female 901,095 84.4 40,767 3.8 17,591 1.6 475 0.0 108,105 10.1 1,068,033 100.0
  Under 65 144,235 80.4 8,947 5.0 1,539 0.9 223 0.1 24,461 13.6 179,405 100.0
  65-74 468,252 85.7 19,395 3.5 9,122 1.7 151 0.0 49,458 9.1 546,378 100.0
  75-84 193,255 85.4 7,540 3.3 5,651 2.5 83 0.0 19,835 8.8 226,364 100.0
  85 and  Older 95,353 82.3 4,885 4.2 1,276 1.1 18 0.0 14,351 12.4 115,883 100.0

Source: EDBRACE is from Medicare EDB from mid-2003; and NEWRACE is the result of having run the combined surname algorithm on race/ethnicity in the Medicare EDB from mid-2003.

As can be seen from Table 2.5, among Asian/Pacific Islander beneficiaries, 290,7488 were recoded as a result of using the combined improved naming algorithm. Unlike the Hispanic beneficiaries whose race/ethnicity was most often originally coded in the EDB as White, the majority of the new Asian/Pacific Islander beneficiaries were originally coded as Other/Unknown in the EDB. Exactly 82.0 percent of the newly coded Asian/Pacific Islander beneficiaries were originally coded as Other/Unknown. In addition, 16.4 percent were originally coded in the EDB as White, 1.5 percent as Black, and 0.2 percent as American Indian/Alaska Native. Note that we did not recode any beneficiaries to Asian/Pacific Islander who were originally coded as Hispanic in the EDB.

Table 2.5 Distribution of "new" Asian/Pacific Islander beneficiaries (NEWRACE) according to their EDBRACE, gender, and age group

EDBRACE White Black American Indian/
Alaska Native
Other or unknown Total
Gender and
age group
Number Percent Number Percent Number Percent Number Percent Number Percent
Total 47,654 16.4 4,328 1.5 496 0.2 238,270 82.0 290,748 100.0
 Male 15,594 11.6 1,519 1.1 230 0.2 117,661 87.2 135,004 100.0
  Under 65 2,392 11.6 473 1.1 49 0.2 9,809 87.2 12,723 100.0
  65-74 7,858 9.0 770 0.9 114 0.1 78,366 90.0 87,108 100.0
  75-84 4,157 15.6 226 0.8 60 0.2 22,241 83.3 26,684 100.0
  85 and
 Older
1,187 14.0 50 0.6 7 0.1 7,245 85.3 8,489 100.0
 Female 32,060 20.6 2,809 1.8 266 0.2 120,609 77.4 155,744 100.0
  Under 65 4,263 36.0 596 5.0 40 0.3 6,947 58.6 11,846 100.0
  65-74 16,607 18.2 1,529 1.7 142 0.2 72,726 79.9 91,004 100.0
  75-84 8,274 22.3 503 1.4 71 0.2 28,267 76.2 37,115 100.0
  85 and
 Older
2,916 18.5 181 1.1 13 0.1 12,669 80.3 15,779 100.0

Source: EDBRACE is from Medicare EDB from mid-2003; and NEWRACE is the result of having run the combined surname algorithm on race/ethnicity in the Medicare EDB from mid-2003.

With respect to gender and age, the Asian/Pacific Islander recodes were very similar to the Hispanic recodes. Across original EDB race/ethnicity and age groups, with the exception of the Asian/Pacific Islander group under 65 years of age, more females have been recoded to Asian/Pacific Islander than males. Overall 155,744 females were recoded compared to 135,004 males. As with Hispanic beneficiaries, the group of Asian/Pacific Islander beneficiaries 65 to 74 years of age was recoded most, while the group 85 and older was recoded least.

Overall, the combined improved naming algorithm recoded the race/ethnicity of 2,290,027 Medicare beneficiaries. Females and those 65 to 74 years of age were most often recoded to a new race/ethnicity when we used the combined improved naming algorithm on the full 10 segments of the unloaded EDB. For the new Hispanic beneficiaries, more were originally coded as White, compared to new Asian/Pacific Islander beneficiaries who were most often originally coded as Other/Unknown.

Return to Contents

2.5 Geocoding Beneficiary Addresses to Link SES Data from the Census to the Beneficiaries in the EDB

Geocoding refers to the process of assigning a code number to each Medicare beneficiary's address that allows it to be linked to the U.S. Census data that describes characteristics of the beneficiary's place of residence. The primary reason to geocode the address of Medicare beneficiaries in the EDB is to enable the association of geographic-based U.S. Census measures of socioeconomic status (SES) with the beneficiaries, as there are now none on the EDB. While U.S. Census SES measures are not individual-level measures, they can be aggregated to specified geographic units, such as the census block, block group, tract, county, or state, that are associated with every beneficiary. We wanted to geocode beneficiary addresses so we could use the socioeconomic characteristics of their neighborhood (block group) to impute their SES. Examples of the SES characteristics from the Census that we chose to associate with Medicare beneficiaries were the median household income, the percentage of the population unemployed, the median value of owner occupied homes, and the percentage of the population below the federally-defined poverty level. Such characteristics can be used individually to examine the effects of SES or be combined in some way to more fully represent the concept of SES. As was discussed earlier, one of the objectives of this project was to create a multi-component measure of SES. The details of Census geography and related data elements are described more fully in the U.S. Census Bureau's Geographic Area Reference Manual located on-line at http://www.census.gov/geo/www/garm.html.

Return to Contents

2.5.1 Address Cleaning

In order to link the beneficiaries in the EDB to the Census information available for the beneficiaries' residential area, there must be something in common on both records. The U.S. Census data is identified by a federal information processing standard (FIPS) code that can identify values for areas as small as blocks and block groups for the SES data in which we were interested. The beneficiary's residential area on the other hand is identified by an address. We needed some mechanism for efficiently translating the addresses in the EDB to FIPS codes that corresponded to those in the Census. We obtained a computer database product from GeoLytics Incorporated of East Brunswick, New Jersey — GeoCode program 2003 Version 1.02 - that was promoted by the manufacturer as being able to correctly assign FIPS codes to the level of Census blocks to addresses that were read into it.

Address information on Medicare beneficiaries is stored in the EDB in six address fields, each with a length of 22 characters. These address fields are generic, and labeled ADDRESS1, ADDRESS2, etc., and thus there is the potential for great variation in the type and order of information contained within the address fields. Upon examination, it appeared that the six fields were simply filled from left to right with whatever information had been collected about the beneficiary's address. The one exception was the beneficiary's zip code, which was always stored in the RESZIP field. However, the GeoLytics GeoCode program product requires that the beneficiaries' address input files be formatted in the following way:

STREET, CITY, STATE ZIP

The GeoCode program requires that STREET contains the street number and street name, separated by a space, with street name followed by a comma; then city followed by a comma, and then the two-letter state postal abbreviation code, a space, and the five digit zip code. It was a challenge and extremely time-consuming to extract, validate, and format these four pieces of information from the EDB address fields so they could be used as input for the GeoCode program. To meet this challenge, we developed the following procedures to apply to the EDB records:

  1. Identify, for each beneficiary, what information is contained in each EDB address field.
  2. Extract the necessary information from the address fields, and create separate street, city, state, and zip code variables.
  3. Verify that street, city, and state variables contain the information they are supposed to, check that the information is in the correct format, and, if not, put it in the correct format.
  4. Output a text file (an ASCII text file, *.txt) in the proper format required as input for the GeoCode program.
  5. Run the GeoCode program.
    1. Input the address text file.
    2. Output.
      1. A text file summarizing the results of the address matching program.
      2. A database file (*.dbf) containing block IDs, error and accuracy codes, and other information related to the matched addresses.
  6. Import the database file (*.dbf) into SAS, which transforms the *.dbf file to a *.sas7bdat file.
  7. Merge the full transformed address file back onto the EDB records. This step adds a US Census-based geographic identifier (a string of FIPS codes) to each person-level beneficiary record.

This process was used to geocode the 10 separate segments of the unloaded EDB. The final step in the process allows the EDB to be linked to Census data files using the block group FIPS code that is common to both.

Time and resources did not permit us to identify and perform all of the necessary address preparation and verification activities manually on all 41 million-plus beneficiaries in the EDB. Instead, we used a random sample of addresses to identify incorrect patterns present in the beneficiaries' addresses in the EDB. Thus, we took a smaller batch of EDB records, specifically those EDB records corresponding to the 830,728 beneficiaries who responded to the CAHPS surveys we used earlier to develop the algorithm to improve on the EDB race/ethnicity coding to identify the various patterns exhibited in the EDB address fields. We developed SAS programs to extract, reformat, and validate the address information we needed, and then tested the performance of the GeoCode CD program. The following are the steps we performed to get the addresses from the EDB in good enough shape to run through the GeoCode program.

Identify and extract the information in each address field. EDB address fields could potentially follow many different patterns, and some did contain a good deal of superfluous or invalid information. Fortunately, the majority of records did follow a standard pattern:

  1. ADDRESS1 contained the beneficiary's street address – both the street number and the street name. In some cases, this field also contained a direction (e.g., "East 1st Street," or "E 1st Street," or "1st Street E"), and/or an apartment number.9
  2. ADDRESS2 contained either the beneficiary's city and state of residence or the beneficiary's apartment number
  3. ADDRESS3, in cases where the ADDRESS2 field contained the apartment number or the like, contained the beneficiary's city and state of residence.
  4. The last field with non-missing data typically contained the city and state of residence. So, in most cases, address fields 4, 5, and 6 were blank; a lesser number of cases had a blank for address field 3 as well.

The SAS program we wrote set the variable STREET equal to the EDB address field that should contain the street address (typically ADDRESS1). It also extracted separate CITY and STATE variables from the EDB address field that contained the city and state.

The RESZIP field in the EDB data contains the 9-digit Zip code. The SAS program dropped the last four digits of the EDB RESZIP variable, and created a new variable with the 5-digit Zip code (ZIP).

Verify the values and formats of STREET, CITY, and STATE. The first part of this step was completed prior to running addresses through the GeoCode program search engine. To verify that STREET and STATE contain the correct data, the SAS program checked for two things:

  1. That the string of characters contained in the new variable, STREET, actually started with a number. This does not provide 100 percent verification, as it is possible for the string of characters contained in the variable STREET to start with a number, but not be an actual street address. However, this step does help ensure that STREET contains a street address.
  2. That the string we identified as the state of residence (the new variable, STATE) was a valid two letter state postal abbreviation.

At this point, the STATE and ZIP variables were considered finalized. The remainder of the SAS algorithm focused on cleaning the STREET variable and ensuring that it was in the proper format. Before cleaning STREET, we dropped any cases where the GeoCode program would be unable to make a match, and for which we could obtain a match simply by reformatting the data. Dropped were addresses where:

  1. The street address was missing.
  2. The beneficiary's state was invalid (as indicated by an invalid two letter state postal abbreviation which was often a foreign country), or they lived in Puerto Rico.10
  3. If the beneficiary's address was a rural route, an RFD, a P.O. Box, or Box number.

For the remaining cases, CITY appeared to be relatively clean, and we did not attempt to reformat or validate that particular variable subsequent to dropping the cases listed above. Approximately 12.5 percent of the EDB records were dropped by this point, leaving us with about 87.5 percent of the records to which we applied further cleaning algorithms.

At this point, we began an iterative process of running small samples of the Medicare CAHPS survey addresses through the GeoCode address-matching process, identifying format-related problems in the street address field, and developing SAS code to repair the problems. Based on this testing process, we developed a series of six11 "fixes," all of which were targeted to reformat specific anomalies that occurred regularly in the street address field. These fixes made repairs related to three basic elements of a street address that caused the address matching program to fail to find a valid match for what is a valid address:

  1. Street address fields sometimes contained apartment, suite, lot, or unit numbers. While these are valid for mailing, the GeoCode program will return an error (i.e., "street not found") on an address containing one of these numbers. The first "fix" applied to the EDB address removed the apartment number (or analogue) out of the STREET field. This fix cleared the path for the subsequent five fixes that were applied to the STREET field.
  2. In cases where the street NAME was actually a number (e.g., 25th Street, 1st Avenue, etc.), the Geocode program failed to find a valid match for the street if the suffix was missing from the numbered street. The suffix was almost always missing in the EDB address fields. We tested the suffix problem manually, and found that the simple addition of a suffix could, in many cases, turn a null match into an exact match. Numerical street names appear in a variety of patterns in the STREET variable, and four out of the five remaining fixes were designed to detect these patterns, and make the appropriate changes.
  3. In some records, the street address contained what appeared to be a double street number - one 2- or 3-digit number, followed by a space, then another 2- or 3-digit number. We discovered that in some places, particularly Queens, NY, the space needs to be replaced by a dash. In other places, however, it is unclear if the double number with a space is valid, or if the space should be deleted. In those cases, the double number was left as is.

For each fix, the SAS program outputs a text file listing, for each "fixed" record, the Medicare beneficiary's HIC number, the observation number, the address in it's original, "pre-fixed" format, the pattern of the new format, and the actual "fixed" address. This allowed us to check that the fix actually did what we expected it to, and it provides a record of the difference between the old addresses and the new addresses.

Output corrected addresses. The SAS program uses the PUT statement in conjunction with the FILE statement to output a single ASCII text file (*.txt) of addresses in the STREET, CITY, STATE ZIP format. This file contains all of the addresses that have been cleaned (100 percent of the records that were run through the fixes, or about 87.5 percent of the total number of beneficiary records). During testing we started with a CAHPS-matched EDB file with 830,728 records, which was reduced to 760,961 after the SAS program was run.


7This excludes 266 beneficiaries who were originally coded as missing in the EDB but are now coded as Hispanics. Beneficiaries who were already coded as Hispanic in the EDB are also not included in this total.

8This excludes 68 beneficiaries who were originally coded as missing in the EDB but are now coded as A/PI. Beneficiaries who were already coded as A/PI in the EDB are also not included in this total.

9There are also several analogues to apartment number that appear in address fields, including suite number, lot number (in the case of mobile home parks), unit number, etc.

10The GeoCode program does not match addresses in Puerto Rico.

11 The "fixes" were numbered according to the order in which they were developed. However, the order in which they were applied in the SAS programs does not follow this numbering. Some fixes developed later (Fix 5, for example) had to be applied before earlier fixes.


Return to Contents
Proceed to Next Section

 

AHRQ Advancing Excellence in Health Care