2. Methods and Data (continued)
2.5.2 Running the GeoCode Program
In testing the GeoCode program, we discovered that the program had a tendency for erratic performance. The help staff at GeoLytics seemed unable to explain the variations in performance. The primary problem was due to a lookup error—"failure to open data member" (eFOM). Between two and six percent of addresses we tested returned this error. Upon examination, we could not find any syntax errors that prevented these records from being successfully coded, and the technical support people at GeoLytics could not explain why these errors were occurring. However, we found that when we ran the addresses receiving the eFOM error code back through the GeoCode CD program a second time by themselves, they were matched at a 100 percent success rate.
The GeoLytics GeoCode CD program product allows the user to choose a variety of options that alter the balance between completeness of address coverage and speed of processing. In order to obtain maximum coverage, and thereby match the most addresses possible, we ran the GeoCode CD program with the following options turned on:
- Allow phonetic match of state name.
– The geocoder phonetically matches the full state name in an address (but not an abbreviation).
- Allow place-based ZIP code match.
– If a street is not found in a ZIP, the geocoder scans other ZIP codes associated with the place (typically a city or a town) for a match.
- Allow phonetic match of street name.
– The geocoder uses a phonetic match for street names (e.g., an input address with the street name "Maine St." is considered a match with Main St. in the database).
- Disregard parity for address match.
– Normally, the geocoder matches even/odd addresses with even/odd address ranges. This option disregards this practice.
- Allow closest address match.
– The geocoder finds the closest address range to match the house number (rather than an exact one).
- Allow fuzzy street type match.
– The geocoder will match addresses with the same street name, even if the street types are different (e.g., Greenwood Drive is considered a match with Greenwood Road).
- Geocode no matter what.
– If it cannot find an exact match, the geocoder will assign to the address the census coordinates associated with the center of a ZIP code (ZIP centroid12), or the center of a state (state centroid).
The GeoCode program outputs two files as it runs—a text file (*.txt) summarizing the geocoder performance, the accuracy codes, and the error codes; and a database file (*.dbf) containing the fields selected by the user. For each database file, we selected the following fields13:
Field |
Description |
SEQNO |
Sequential Number |
ADDRESS |
Input Address |
ACCURACY |
Accuracy and Error Codes |
BLOCK |
Matched Block Code |
PLACE |
Place FIPS Code |
MCD |
MCD (Minor Civil Division) Code |
STATE |
State FIPS Code |
ZIP |
ZIP Code for 2003 |
PLACENAME |
Matched Place Name |
AreaKey |
Block Group Code |
The sequential number field contains a number between 1 and n, where n is the total number of records processed by the program. The input address is the address in the STREET, CITY, STATE ZIP format constructed and output by the address cleaning SAS program. Accuracy and error codes are explained below. The matched block code is a string of fifteen digits that indicates, respectively, an individual’s state (2 digit FIPS code), county (3 digit FIPS code), census tract (6 digit FIPS code), and block (4 digit FIPS code, the first digit in the 4-digit string indicates the block group). The full string constitutes a unique, block-level identifier. Any persons living within the same block will have the same matched block code. Place indicates the city or town FIPS code, and MCD indicates the Minor Civil Division code. The area key is basically a substring of the matched block code that contains the first twelve, rather than the full fifteen digits, and constitutes a unique block group-level identifier.
Return to Contents
2.5.3 Summary of GeoCode program accuracy codes
Failure details. The geocoding process can fail for a number of reasons, including setup or programmatic errors, a missing database entry, or an invalid input address. Failures fall under two general categories: syntax/lookup errors and programmatic/setup errors. Failed GeoCode results are indicated by error codes, which are summarized in Tables 2.6 and 2.7.
Table 2.6 GeoCode program syntax and lookup errors
Error Code |
Error Message |
eIHN |
Missing or invalid house number* |
eISt |
Missing or invalid street name* |
eITy |
Missing or invalid street type |
eINa |
Missing or invalid city name |
eISN |
Missing or invalid state name/abbrev* |
eIZI |
Missing or invalid ZIP code* |
eIAd |
Incomplete or malformed address* |
eUAF |
Unknown address format |
eMiA |
Missing address |
eNZI |
Failed to lookup ZIP code |
eANF |
Address not found |
eSNF |
Street not found |
*Errors encountered while geocoding EDB addresses.
Source: GeoLytics Incorporated of East Brunswick, New Jersey—GeoCode CD program 2003, Version 1.02.
Table 2.7 GeoCode program programmatic and setup errors
Error Code |
Error Message |
eGNO |
GeoCode has not been opened |
eFOD |
Failed to open database |
eFOF |
Failed to open data file NAME |
eFOM |
Failed to open data member NAME* |
eMiF |
Missing file NAME |
eGOF |
General open failure, file NAME |
eFA1 |
Failed to allocate memory |
eNAS |
No address data for state NAME* |
eNSZ |
No data for state-zip NAME |
eSSO |
String size overflow |
eOKI |
Output file kind invalid NAME |
eOF1 |
Output failure NAME |
eOLI |
Output field list invalid NAME |
*Errors encountered while geocoding EDB addresses.
Source: GeoLytics Incorporated of East Brunswick, New Jersey—GeoCode CD program 2003, Version 1.02.
Success details. The GeoCode program also indicates how successful it has been in matching addresses to FIPS codes. In addition to indicating accurate or exact matches, it indicates what kinds of "adjustments" it made to successfully match the address to a place with a FIPS code. Successful match details are presented in Table 2.8. Some successful results will generate accuracy codes indicating that the geocoder could only code the address by using some of the fallback matching options described above. Its worth noting that GeoCode CD may employ more than one of these fallback matching options to find a match for a particular address.
Table 2.8 GeoCode program accuracy codes and messages
Accuracy |
Accuracy Message |
aNP1 |
Place not found* |
aNPa |
Address match with no parity* |
aCAd |
Closest address match* |
aFTy |
Fuzzy street type match* |
aPhM |
Phonetic match* |
aNMa |
No match found |
aNMP |
No match performed |
aPBZ |
Place-based ZIP match* |
aSpC |
Spelling corrected* |
aStC |
State centroid used* |
aSEn |
Street end used* |
aZIC |
ZIP centroid used* |
aInD |
Inaccurate direction* |
*Accuracy options encountered while geocoding EDB addresses.
Source: GeoLytics Incorporated of East Brunswick, New Jersey—GeoCode CD program 2003, Version 1.02.
Test results using the GeoCode program on the CAHPS sample addresses. Table 2.9 below summarizes the error and accuracy results from the CAHPS sample test file. It indicates that 8.4 percent of the 830,728 CAHPS sample addresses taken from the EDB were dropped because they were uncodeable by the GeoCode program for some reason, very often for having a box number instead of a street address. It also shows that of the remaining 760,961 addresses (91.6 percent of the original total), all but four-tenths of a percent (0.4 percent) were successfully geocoded. The process we followed in this test yielded an overall total successful match of 91.2 percent of the EDB addresses to Census block group level FIPS codes.
Table 2.9 Summary of GeoCode error and accuracy results for the CAHPS test file
CAHPS/EDB Test File
Results |
Number |
Percent |
Original number of records |
830,728 |
100.0 |
Number of records dropped (uncodeable) |
69,767 |
8.4 |
Addresses processed |
760,961 |
91.6 |
...Successfully geocoded (first iteration) |
719,220 |
94.5 |
...Successfully geocoded eFOM records (second iteration) |
38,322 |
5.0 |
...Total failed |
3,419 |
0.4 |
GeoCode success rate |
757,542 |
99.6 |
Percent total test file records matched |
|
91.2 |
Success details* |
|
|
Accurate Match |
477,746 |
62.8 |
Place Not Found |
77,273 |
10.2 |
Address match with no parity |
5,931 |
0.8 |
Closest address match |
37,984 |
5.0 |
Fuzzy street type match |
86,701 |
11.4 |
Phonetic match |
37,847 |
5.0 |
Place-based ZIP match |
16,519 |
2.2 |
Spelling corrected |
0 |
0.0 |
State centroid used |
905 |
0.1 |
Street end used |
3,871 |
0.5 |
ZIP centroid used |
63,031 |
8.3 |
Inaccurate direction |
20,525 |
2.7 |
Failure details |
|
|
Failed due to syntax error |
3,418 |
0.4 |
...Missing or invalid house number |
3,367 |
0.4 |
...Missing or invalid state name/abbreviation |
0 |
0.0 |
...Missing or invalid ZIP code |
47 |
0.0 |
...Incomplete or malformed address |
4 |
0.0 |
Failed due to lookup error |
38,323 |
5.0 |
...Failed to open data member (eFOM) |
38,322 |
5.0 |
...No address data for state |
1 |
0.0 |
*Note: Success detail categories reflect distribution of accuracy codes. These codes are NOT mutually exclusive. Some addresses can have up to four accuracy codes associated with them.
Source: Result of running GeoCode CD program 2003 Version 1.02 on addresses from Medicare EDB from mid-2003 for respondents to the Medicare CAHPS fee-for-service, managed care enrollee, and disenrollee surveys for 2000-2002.
Return to Contents
2.5.4 Application of the GeoCode Program Processing to the Full EDB
We obtained the 10 segments of the full unloaded EDB from CMS in mid-2003. Because each segment of the EDB contained more than four million beneficiary records, we processed each segment separately, first extracting the addresses and other necessary identification variables from the EDB, correcting the addresses using the SAS programs we developed, and finally running them through the GeoCode program. Each segment of the EDB was run through the GeoCode program separately. The program took from 16 to 36 hours to process and match the more than four million records contained in each segment. As indicated above in the description of the test results on the CAHPS sample addresses, it was necessary to rerun the addresses with an eFOM error that failed to match on the first iteration, and virtually all of them were successfully matched on the second iteration through the GeoCode program.
Run EDB segments through the GeoCode program. The results of the GeoCode program processing are summarized in Table 2.10 for all 10 segments of the unloaded EDB combined. The results were extremely similar for each of the 10 segments. Overall, 86.8 percent of the 41,742,407 addresses of Medicare beneficiaries were processed through the Geocode program. Ninety-nine and two tenths percent of the addresses that were processed (or 36,223,053) were successfully matched to a FIPS code that included the block group. As Table 2.8 shows, 61 percent of the matches made were exact with the addresses that were input.
Import Geocode output files and merge with EDB records. We used PROC IMPORT in SAS 8.2 to transform the database (*.dbf) files produced by the GeoCode program into SAS data files (*.sas7bdat). Using the ADDRESS field we prepared as input from the EDB to the GeoCode program as the common key (common to the EDB and the GeoCode output), we merged the output files (containing Census-based geographic identifiers including the AreaKey number string that identifies block groups) onto the EDB records.
Return to Contents
2.5.5 Results of Geo-coding the Sample of 1.96 Million Medicare Beneficiaries
The sample of 1.96 million Medicare fee-for-service beneficiaries is a subset of the beneficiaries geocoded from the mid-2003 EDB. The results of the geocoding for the 1.96 million are presented in Table 2.11. While the table indicates that 81 percent (1,588,121 out of 1,960,121) of the addresses for the sample members were successfully geocoded, this was with allowing the use of ZIP code and state centroid when there was no other way to achieve a successful match of the input address to a Census-listed address. It should be noted that we did rerun unmatched addresses from the mid-2003 EDB as well as those that changed from the mid-2003 through the Geocode CD in the hope of more completely and correctly geocoding sample members.
We know from analyses performed in sub-task one of this task order that most of the state centroid matches (4,090) are not true matches at all, but forced to the state centroid by the GeoCode CD program on addresses that are foreign. The same may be true of some of the Zip (159,217) centroid matches as well. We feel very confident saying, however, that based upon our validation of address block group matching against the Census, that the true match rate at the block group level for the sample is most likely at least 75 percent.
Table 2.10 Summary of GeoCode error and accuracy codes for the 10 segments of the EDB combined
Results |
Sums |
Percent |
Original number of records |
41,742,407 |
100.0 |
Number of records dropped (uncodeable) |
5,223,766 |
12.5 |
Addresses processed |
36,518,641 |
87.5 |
...Successfully geocoded (first iteration) |
35,108,329 |
96.1 |
...Successfully geocoded eFOM records (second iteration) |
1,114,724 |
3.1 |
...Total failed |
295,588 |
0.8 |
GeoCode success rate |
36,223,053 |
99.2 |
Percent total EDB records matched |
|
86.8 |
Success details* |
|
|
Accurate Match |
20,028,633 |
61.0 |
Place Not Found |
3,216,868 |
9.8 |
Address match with no parity |
281,554 |
0.9 |
Closest address match |
1,821,893 |
5.5 |
Fuzzy street type match |
3,919,792 |
11.9 |
Phonetic match |
1,752,858 |
5.3 |
Place-based ZIP match |
799,836 |
2.4 |
Spelling corrected |
10 |
0.0 |
State centroid used |
47,252 |
0.1 |
Street end used |
181,270 |
0.6 |
ZIP centroid used |
2,972,274 |
9.0 |
Inaccurate direction |
1,027,377 |
3.1 |
Failure details |
|
|
Failed due to syntax error |
262,176 |
0.8 |
...Missing or invalid house number |
175,561 |
0.5 |
...Missing or invalid state name/abbreviation |
4 |
0.0 |
...Missing or invalid ZIP code |
86,335 |
0.3 |
...Incomplete or malformed address |
276 |
0.0 |
Failed due to lookup error |
1,022,267 |
3.4 |
...Failed to open data member (eFOM) |
1,018,483 |
3.4 |
...No address data for state |
3,784 |
0.0 |
*Note: Success detail categories reflect distribution of accuracy codes. These codes are NOT mutually exclusive. Some addresses can have up to four accuracy codes associated with them.
Source: Result of running GeoCode CD program 2003 Version 1.02 on addresses from Medicare EDB from mid-2003 for respondents to the Medicare CAHPS fee-for-service, managed care enrollee, and disenrollee surveys for 2000-2002.
Table 2.11 Success with Geocoding of the Medicare Beneficiaries Included in the RTI Sample of 1.96 Million
Sample |
Number |
Total Sample |
1,960,121 |
Successfully geocoded |
1,588,607 |
GeoCoding Success Rate |
81.0% |
|
|
Success Details |
|
Exact Match |
920,390 |
Other Accuracy Code |
504,910 |
Zip Centroid |
159,217 |
State Centroid |
4,090 |
Source: Result for sample of 1.96 million of running GeoCode CD program 2003, Version 1.02 on addresses from Medicare EDB from mid-2003.
12The centroid of a 5-digit ZIP code area is the balance point of the polygon formed by its boundaries. The centroid is calculated based on the coordinate extremes of the polygon.
13One field we did not include, the MATCH field, contained the full address that the GeoCode search engine determined to be the closest match to the input address. We had intended to include this field, but during the testing phase, we discovered problems with the MATCH field that led to major problems when trying to transform the *.dbf files into SAS files.
Return to Contents
Proceed to Next Section