Your browser doesn't support JavaScript. Please upgrade to a modern browser or enable JavaScript in your existing browser.
Skip Navigation U.S. Department of Health and Human Services www.hhs.gov
Agency for Healthcare Research Quality www.ahrq.gov
www.ahrq.gov

2. Methods and Data (continued)

2.5.2 Running the GeoCode Program

In testing the GeoCode program, we discovered that the program had a tendency for erratic performance. The help staff at GeoLytics seemed unable to explain the variations in performance. The primary problem was due to a lookup error—"failure to open data member" (eFOM). Between two and six percent of addresses we tested returned this error. Upon examination, we could not find any syntax errors that prevented these records from being successfully coded, and the technical support people at GeoLytics could not explain why these errors were occurring. However, we found that when we ran the addresses receiving the eFOM error code back through the GeoCode CD program a second time by themselves, they were matched at a 100 percent success rate.

The GeoLytics GeoCode CD program product allows the user to choose a variety of options that alter the balance between completeness of address coverage and speed of processing. In order to obtain maximum coverage, and thereby match the most addresses possible, we ran the GeoCode CD program with the following options turned on:

  1. Allow phonetic match of state name.
    – The geocoder phonetically matches the full state name in an address (but not an abbreviation).
  2. Allow place-based ZIP code match.
    – If a street is not found in a ZIP, the geocoder scans other ZIP codes associated with the place (typically a city or a town) for a match.
  3. Allow phonetic match of street name.
    – The geocoder uses a phonetic match for street names (e.g., an input address with the street name "Maine St." is considered a match with Main St. in the database).
  4. Disregard parity for address match.
    – Normally, the geocoder matches even/odd addresses with even/odd address ranges. This option disregards this practice.
  5. Allow closest address match.
    – The geocoder finds the closest address range to match the house number (rather than an exact one).
  6. Allow fuzzy street type match.
    – The geocoder will match addresses with the same street name, even if the street types are different (e.g., Greenwood Drive is considered a match with Greenwood Road).
  7. Geocode no matter what.
    – If it cannot find an exact match, the geocoder will assign to the address the census coordinates associated with the center of a ZIP code (ZIP centroid12), or the center of a state (state centroid).

The GeoCode program outputs two files as it runs—a text file (*.txt) summarizing the geocoder performance, the accuracy codes, and the error codes; and a database file (*.dbf) containing the fields selected by the user. For each database file, we selected the following fields13:

Field Description
SEQNO Sequential Number
ADDRESS Input Address
ACCURACY Accuracy and Error Codes
BLOCK Matched Block Code
PLACE Place FIPS Code
MCD MCD (Minor Civil Division) Code
STATE State FIPS Code
ZIP ZIP Code for 2003
PLACENAME Matched Place Name
AreaKey Block Group Code

The sequential number field contains a number between 1 and n, where n is the total number of records processed by the program. The input address is the address in the STREET, CITY, STATE ZIP format constructed and output by the address cleaning SAS program. Accuracy and error codes are explained below. The matched block code is a string of fifteen digits that indicates, respectively, an individual’s state (2 digit FIPS code), county (3 digit FIPS code), census tract (6 digit FIPS code), and block (4 digit FIPS code, the first digit in the 4-digit string indicates the block group). The full string constitutes a unique, block-level identifier. Any persons living within the same block will have the same matched block code. Place indicates the city or town FIPS code, and MCD indicates the Minor Civil Division code. The area key is basically a substring of the matched block code that contains the first twelve, rather than the full fifteen digits, and constitutes a unique block group-level identifier.

Return to Contents

2.5.3 Summary of GeoCode program accuracy codes

Failure details. The geocoding process can fail for a number of reasons, including setup or programmatic errors, a missing database entry, or an invalid input address. Failures fall under two general categories: syntax/lookup errors and programmatic/setup errors. Failed GeoCode results are indicated by error codes, which are summarized in Tables 2.6 and 2.7.

Table 2.6 GeoCode program syntax and lookup errors

Error Code Error Message
eIHN Missing or invalid house number*
eISt Missing or invalid street name*
eITy Missing or invalid street type
eINa Missing or invalid city name
eISN Missing or invalid state name/abbrev*
eIZI Missing or invalid ZIP code*
eIAd Incomplete or malformed address*
eUAF Unknown address format
eMiA Missing address
eNZI Failed to lookup ZIP code
eANF Address not found
eSNF Street not found

*Errors encountered while geocoding EDB addresses.

Source: GeoLytics Incorporated of East Brunswick, New Jersey—GeoCode CD program 2003, Version 1.02.

Table 2.7 GeoCode program programmatic and setup errors

Error Code Error Message
eGNO GeoCode has not been opened
eFOD Failed to open database
eFOF Failed to open data file NAME
eFOM Failed to open data member NAME*
eMiF Missing file NAME
eGOF General open failure, file NAME
eFA1 Failed to allocate memory
eNAS No address data for state NAME*
eNSZ No data for state-zip NAME
eSSO String size overflow
eOKI Output file kind invalid NAME
eOF1 Output failure NAME
eOLI Output field list invalid NAME

*Errors encountered while geocoding EDB addresses.

Source: GeoLytics Incorporated of East Brunswick, New Jersey—GeoCode CD program 2003, Version 1.02.

Success details. The GeoCode program also indicates how successful it has been in matching addresses to FIPS codes. In addition to indicating accurate or exact matches, it indicates what kinds of "adjustments" it made to successfully match the address to a place with a FIPS code. Successful match details are presented in Table 2.8. Some successful results will generate accuracy codes indicating that the geocoder could only code the address by using some of the fallback matching options described above. Its worth noting that GeoCode CD may employ more than one of these fallback matching options to find a match for a particular address.

Table 2.8 GeoCode program accuracy codes and messages

Accuracy Accuracy Message
aNP1 Place not found*
aNPa Address match with no parity*
aCAd Closest address match*
aFTy Fuzzy street type match*
aPhM Phonetic match*
aNMa No match found
aNMP No match performed
aPBZ Place-based ZIP match*
aSpC Spelling corrected*
aStC State centroid used*
aSEn Street end used*
aZIC ZIP centroid used*
aInD Inaccurate direction*

*Accuracy options encountered while geocoding EDB addresses.

Source: GeoLytics Incorporated of East Brunswick, New Jersey—GeoCode CD program 2003, Version 1.02.

Test results using the GeoCode program on the CAHPS sample addresses. Table 2.9 below summarizes the error and accuracy results from the CAHPS sample test file. It indicates that 8.4 percent of the 830,728 CAHPS sample addresses taken from the EDB were dropped because they were uncodeable by the GeoCode program for some reason, very often for having a box number instead of a street address. It also shows that of the remaining 760,961 addresses (91.6 percent of the original total), all but four-tenths of a percent (0.4 percent) were successfully geocoded. The process we followed in this test yielded an overall total successful match of 91.2 percent of the EDB addresses to Census block group level FIPS codes.

Table 2.9 Summary of GeoCode error and accuracy results for the CAHPS test file

CAHPS/EDB Test File

Results Number Percent
Original number of records 830,728 100.0
Number of records dropped (uncodeable) 69,767 8.4
Addresses processed 760,961 91.6
...Successfully geocoded (first iteration) 719,220 94.5
...Successfully geocoded eFOM records (second iteration) 38,322 5.0
...Total failed 3,419 0.4
GeoCode success rate 757,542 99.6
Percent total test file records matched   91.2
Success details*    
Accurate Match 477,746 62.8
Place Not Found 77,273 10.2
Address match with no parity 5,931 0.8
Closest address match 37,984 5.0
Fuzzy street type match 86,701 11.4
Phonetic match 37,847 5.0
Place-based ZIP match 16,519 2.2
Spelling corrected 0 0.0
State centroid used 905 0.1
Street end used 3,871 0.5
ZIP centroid used 63,031 8.3
Inaccurate direction 20,525 2.7
Failure details    
Failed due to syntax error 3,418 0.4
...Missing or invalid house number 3,367 0.4
...Missing or invalid state name/abbreviation 0 0.0
...Missing or invalid ZIP code 47 0.0
...Incomplete or malformed address 4 0.0
Failed due to lookup error 38,323 5.0
...Failed to open data member (eFOM) 38,322 5.0
...No address data for state 1 0.0

*Note: Success detail categories reflect distribution of accuracy codes. These codes are NOT mutually exclusive. Some addresses can have up to four accuracy codes associated with them.

Source: Result of running GeoCode CD program 2003 Version 1.02 on addresses from Medicare EDB from mid-2003 for respondents to the Medicare CAHPS fee-for-service, managed care enrollee, and disenrollee surveys for 2000-2002.

Return to Contents

2.5.4 Application of the GeoCode Program Processing to the Full EDB

We obtained the 10 segments of the full unloaded EDB from CMS in mid-2003. Because each segment of the EDB contained more than four million beneficiary records, we processed each segment separately, first extracting the addresses and other necessary identification variables from the EDB, correcting the addresses using the SAS programs we developed, and finally running them through the GeoCode program. Each segment of the EDB was run through the GeoCode program separately. The program took from 16 to 36 hours to process and match the more than four million records contained in each segment. As indicated above in the description of the test results on the CAHPS sample addresses, it was necessary to rerun the addresses with an eFOM error that failed to match on the first iteration, and virtually all of them were successfully matched on the second iteration through the GeoCode program.

Run EDB segments through the GeoCode program. The results of the GeoCode program processing are summarized in Table 2.10 for all 10 segments of the unloaded EDB combined. The results were extremely similar for each of the 10 segments. Overall, 86.8 percent of the 41,742,407 addresses of Medicare beneficiaries were processed through the Geocode program. Ninety-nine and two tenths percent of the addresses that were processed (or 36,223,053) were successfully matched to a FIPS code that included the block group. As Table 2.8 shows, 61 percent of the matches made were exact with the addresses that were input.

Import Geocode output files and merge with EDB records. We used PROC IMPORT in SAS 8.2 to transform the database (*.dbf) files produced by the GeoCode program into SAS data files (*.sas7bdat). Using the ADDRESS field we prepared as input from the EDB to the GeoCode program as the common key (common to the EDB and the GeoCode output), we merged the output files (containing Census-based geographic identifiers including the AreaKey number string that identifies block groups) onto the EDB records.

Return to Contents

2.5.5 Results of Geo-coding the Sample of 1.96 Million Medicare Beneficiaries

The sample of 1.96 million Medicare fee-for-service beneficiaries is a subset of the beneficiaries geocoded from the mid-2003 EDB. The results of the geocoding for the 1.96 million are presented in Table 2.11. While the table indicates that 81 percent (1,588,121 out of 1,960,121) of the addresses for the sample members were successfully geocoded, this was with allowing the use of ZIP code and state centroid when there was no other way to achieve a successful match of the input address to a Census-listed address. It should be noted that we did rerun unmatched addresses from the mid-2003 EDB as well as those that changed from the mid-2003 through the Geocode CD in the hope of more completely and correctly geocoding sample members.

We know from analyses performed in sub-task one of this task order that most of the state centroid matches (4,090) are not true matches at all, but forced to the state centroid by the GeoCode CD program on addresses that are foreign. The same may be true of some of the Zip (159,217) centroid matches as well. We feel very confident saying, however, that based upon our validation of address block group matching against the Census, that the true match rate at the block group level for the sample is most likely at least 75 percent.

Table 2.10 Summary of GeoCode error and accuracy codes for the 10 segments of the EDB combined

Results Sums Percent
Original number of records 41,742,407 100.0
Number of records dropped (uncodeable) 5,223,766 12.5
Addresses processed 36,518,641 87.5
...Successfully geocoded (first iteration) 35,108,329 96.1
...Successfully geocoded eFOM records (second iteration) 1,114,724 3.1
...Total failed 295,588 0.8
GeoCode success rate 36,223,053 99.2
Percent total EDB records matched   86.8
Success details*    
Accurate Match 20,028,633 61.0
Place Not Found 3,216,868 9.8
Address match with no parity 281,554 0.9
Closest address match 1,821,893 5.5
Fuzzy street type match 3,919,792 11.9
Phonetic match 1,752,858 5.3
Place-based ZIP match 799,836 2.4
Spelling corrected 10 0.0
State centroid used 47,252 0.1
Street end used 181,270 0.6
ZIP centroid used 2,972,274 9.0
Inaccurate direction 1,027,377 3.1
Failure details    
Failed due to syntax error 262,176 0.8
...Missing or invalid house number 175,561 0.5
...Missing or invalid state name/abbreviation 4 0.0
...Missing or invalid ZIP code 86,335 0.3
...Incomplete or malformed address 276 0.0
Failed due to lookup error 1,022,267 3.4
...Failed to open data member (eFOM) 1,018,483 3.4
...No address data for state 3,784 0.0

*Note: Success detail categories reflect distribution of accuracy codes. These codes are NOT mutually exclusive. Some addresses can have up to four accuracy codes associated with them.

Source: Result of running GeoCode CD program 2003 Version 1.02 on addresses from Medicare EDB from mid-2003 for respondents to the Medicare CAHPS fee-for-service, managed care enrollee, and disenrollee surveys for 2000-2002.

Table 2.11 Success with Geocoding of the Medicare Beneficiaries Included in the RTI Sample of 1.96 Million

Sample Number
Total Sample 1,960,121
Successfully geocoded 1,588,607
GeoCoding Success Rate 81.0%
   
Success Details  
Exact Match 920,390
Other Accuracy Code 504,910
Zip Centroid 159,217
State Centroid 4,090

Source: Result for sample of 1.96 million of running GeoCode CD program 2003, Version 1.02 on addresses from Medicare EDB from mid-2003.


12The centroid of a 5-digit ZIP code area is the balance point of the polygon formed by its boundaries. The centroid is calculated based on the coordinate extremes of the polygon.

13One field we did not include, the MATCH field, contained the full address that the GeoCode search engine determined to be the closest match to the input address. We had intended to include this field, but during the testing phase, we discovered problems with the MATCH field that led to major problems when trying to transform the *.dbf files into SAS files.


Return to Contents
Proceed to Next Section

 

AHRQ Advancing Excellence in Health Care