Background
Many data sets contain narrative text for industry and occupation. These
include vital records systems, cancer registries, worker's compensation
systems, and healthcare records. Manually assigning industry and occupation
(I&O) codes can be expensive, time consuming, and not highly consistent.
Furthermore, because some industry and occupation titles are so rare,
or include infrequently used synonyms, even experienced coders have great
difficulty in reaching agreement.
To decrease the number of cases a manual coder must review and to create
national consistency, NIOSH led the development of the Standardized Occupation
and Industry Coding (SOIC) software. The development of SOIC was a collaborative
effort that included the National Association for Public Health Statistics
and Information Systems, the National Center for Health Statistics (NCHS),
the Bureau of Labor Statistics (BLS), the National Center for Chronic
Disease Prevention and Health Promotion, and the Bureau of the Census
(BOC).
The Software
SOIC codes occupation and industry narratives according to the 1990 BOC
Alphabetical Index of Industries and Occupations supplemented
with special codes for non-paid workers, non-workers, and the military
as defined in the NCHS Instruction Manual, Part 19. This website provides
downloadable versions of the current version (SOIC
1.5) and its documentation. The SOIC
software may be downloaded free of charge. Minimum system requirements
include:
- 90 MHZ Pentium with 32 MB of RAM
- Windows® 98, NT, ME, or 2000
- Minimum 30 MB of free disk space
The SOIC system client was written using the Microsoft Visual Basic programming
language and the Microsoft Access database management system. SOIC data
tables and data files are stored as Access tables and files. SOIC offers
several data access features: the main window can be used for data entry;
text or ASCII files can be imported and exported; and files in Microsoft
Access, dBase, or FoxPro formats can be opened directly into the software.
Microsoft conventions for Windows applications were used wherever possible.
The software has an easy-use-interface created based on the U.S. standard
death certificate (the data entry screen) and includes an extensive system
menu that includes options for opening
and saving files, editing
or finding records, and coding a
single record and entire files.
The Coding Process
To assign industry and occupation codes, the software uses a stepwise
series of increasingly complex coding modules. Narrative information is
processed through each module until an industry or occupation code is
assigned or the narrative is determined to be uncodable.
![SOIC coding process flow diagram.](images/help/decision_tree_gimp.jpg)
Auto-Spell: corrects some misspellings and expands fused
words, acronyms, and abbreviations.
Lookup Tables: assigns codes based on exact matches
to various I&O narrative combinations.
- Paired-phrase matching: commonly occurring I&O narratives.
- Company matching: a limited list of state-specific industry names.
- Idiom matching: misleading industry narratives.
Knowledge Base: assigns codes based on static handwritten
coding rules that emulate the logic that a manual coder would typically
apply (e.g., performs “fuzzy” matching on word fragments).
There are 2,055 rules that are broken down into 848 industry rules and
1,207 occupation rules.
Word-to-Code: predicts codes based on word patterns
observed in data used to develop the software.
Coding Results
NIOSH conducted a comparison of SOIC and an expert’s manually assigned
codes for 48,067 cases from a death certificate based surveillance system.
The number of software-assigned codes that matched the expert manual coder
is shown below. In this test there was no adjudication of the results;
that is, the mismatched cases were not reviewed to determine if the SOIC
autocoder or the manual coder was actually correct. These results are
provided as an illustration. Coding results will vary and depend upon
overall data quality. The software does not perform well on narratives
with company names and other ambiguous information.
Number
of SOIC assigned codes that matched manually assigned codes |
Industry
Codes matched |
36,376
cases (76%) |
Occupation
Codes Matched |
36,207
cases (75%) |
Both
occupation and industry codes matched |
30,389 cases (63%) |
Software Version
The current SOIC software version available for download is v. 1.5 and
is based on the 1990 BOC industry and occupation coding scheme. The software
is provided as a resource tool for injury and illness researchers where
uniform coding of industry and occupation is beneficial to prevention
efforts. No further revisions will be made to this version. User support
is limited. Assistance may be requested by contacting the NIOSH SOIC
group.
The BOC developed a 2000 industry and occupation coding scheme. Currently,
NIOSH is not planning to create a version of SOIC that incorporates these
codes.
Page last updated: July 10, 2007
Page last reviewed: May 13, 2008
Content Source: National Institute for Occupational Safety and Health (NIOSH)
|