|
Chapter 4 Processing of Data
Once the data have been collected or acquired from an external
source, some processing is usually necessary to make the data ready for conversion
into information products.
This chapter contains standards for securing the data
during processing (Section 4.1), checking data for potential errors (Section
4.2), dealing with missing data issues (Section 4.3), and adding information
to the data (Section 4.4). This chapter
also contains standards for monitoring and evaluating data operations,
including nonresponse analysis, (Section 4.5) and for documenting (Section 4.6)
the data processing operations.
4.1 Data Protection
Standard 4.1: Safeguards must be taken throughout data processing
to protect the data from disclosure, theft, or loss.
Key Terms: confidentiality,
information security, storage
Guideline 4.1.1: Confidentiality Procedures
Implement the confidentiality procedures given
in the BTS Confidentiality Procedures
Manual sections on “Physical Security Procedures” and “Security of
Information Systems” to protect the data from unauthorized disclosure or
release during data production, use, storage, transmittal, and disposition
(e.g., completed data collection forms, electronic files and hard copy printouts).
Guideline 4.1.2: Security of Information Systems
Follow the information system security procedures
in the BTS Confidentiality Procedures
Manual, and periodically monitor and update them. Ensure that:
- Data
files, networks, servers, and desktop PCs are secure from malicious
software, unauthorized access, or theft.
- Access
to confidential data is controlled so that only authorized staff can read
and/or write to the data. The
project manager responsible for the data should periodically review staff access
rights to guard against unauthorized release or alteration.
Guideline 4.1.3: Data Storage
Develop and implement routine data backups. Secure backup data from unauthorized access
or release.
Related Information
Bureau of Transportation Statistics. 2004. Confidentiality Procedures Manual. Washington, DC.
Federal Committee on Statistical
Methodology. 1994. Report
on Statistical Disclosure Limitation Methodology, Statistical Policy
Working Paper 22. Washington,
DC:
Office of Management and Budget. Available at http://www.fcsm.gov/working-papers/spwp22.html as of November 15, 2004.
Office of Management and Budget. 2005. Standards for Statistical Surveys (Proposed),
Section 3.4 (Data Protection [during data
collection]) and Section 6.5 (Data Protection [during information dissemination]). Washington, DC. July 14.
Approval Date: April 20, 2005
4.2 Data Editing
Standard 4.2: As part of standard data processing, mitigate
errors by checking and editing both data BTS collects and data it acquires from
external sources.
Key Terms: edit, imputation, outliers, skip pattern
Guideline 4.2.1: Types of Edits
At a minimum, the editing process must
include checking for the items below, and appropriate editing if errors are
detected.
- Omission
or duplication of records/units,
- Data
that fall outside a pre-specified range, or for categorical data, data
that are not equal to specified categories,
- Data
that contradict other data within an individual record/unit,
- Data inconsistent
with past data or with data from outside sources,
- Missing
data that can be directly filled from other portions of the same record or
through follow-up with the data provider,
- Incorrect
flow through prescribed skip patterns, and
- Selections
in excess of the allowable number, such as multiple selections for a “mark
one” data item.
Guideline 4.2.2: Editing Process
In a data editing system:
- Develop
editing rules in advance of any data processing. Rules may be modified during data
processing (Section 4.5.1).
- Minimize
manual intervention, since it will result in inconsistent applications of
the edit rules and may introduce human error.
- Set the acceptable data
ranges for outlier checks at broad enough levels so that legitimate
special effects, trend shifts, or industry changes are not erroneously
removed.
Guideline 4.2.3: Edit Resolution
Several actions are possible when a data value fails
an edit check. Recommended procedures
are:
- Verify
with the original source or respondent and correct as appropriate, or
- Change
the data value to the most likely value based upon other information
collected, or impute a substitute value (Guideline 4.3.4).
- For
administrative or regulatory data, any changed value needs the data
provider’s acceptance.
- Notify
the source if a change is made to data provided by an external source.
- Replacing
the failed value with a missing value indicator (Guideline 4.4.2), and
- Accepting
the data value as reported. Provide
reasons for overriding edits.
Related Information
Federal Committee on Statistical
Methodology. 1990. Data
Editing in Federal Statistical Agencies, Statistical Policy Working Paper
18. Washington,
DC:
Office of Management and Budget. Available at http://www.fcsm.gov/working-papers/wp18.html as of November 15, 2004.
__________.
1996. Data Editing Workshop and Exposition, Statistical Policy Working
Paper 25, Washington, DC: Office of Management and Budget. Available at http://www.fcsm.gov/working-papers/wp25a.html as of November
15, 2004.
__________.
2001. Measuring and Reporting Sources of Error in Surveys, Statistical
Policy Working Paper 31, Section 7.2.3 (Editing Errors), Washington, DC: Office of Management and Budget. Available at
http://www.fcsm.gov/01papers/spwp31_final.pdf as of November 15, 2004.
Hawkins, D.M. 1980. Identification of Outliers. New York: Chapman and Hall.
Office of Management and Budget. 2005. Standards for Statistical Surveys (Proposed),
Section 3.1 (Data Editing). Washington,
DC.
July 14.
Approval Date: April 20, 2005
4.3 Missing Data
Standard 4.3: Unit and item nonresponse must be appropriately
measured, adjusted for, and reported. Response rates must be computed using
standard formulas to measure the proportion of the eligible respondents
represented by the responding units.
Key Terms: bias, eligible unit, imputation, item, item nonresponse,
multivariate analysis, nonresponse bias, overall unit nonresponse, probability
of selection, response rates, sample substitution, unit, unit nonresponse,
weight
Guideline 4.3.1: Basis for Rates
Calculate unit and item response rates
based either on the probability of selection (for household or personal data
collections) or on the unit’s measure of size for industry or establishment
data collections.
- Base
proportions of the total industry on a measure of size available for all
eligible units (e.g., annual operating revenue, total employment).
- For sample surveys, use the inverse of the probability of selection (base weights) in response rate calculation. For 100 percent (universe) data collections, the base weight for each unit is one.
- For sample designs using unequal probabilities, such as stratified designs with optimal allocation, report weighted missing data rates along with unweighted missing data rates.
- If
sample substitutions were made, calculate response rates without the
substituted cases.
Guideline 4.3.2: Unit Response Rates
Calculate unit response rates (RRU) as the ratio of the number of
completed data collection cases (CC)
to the number of in-scope sample cases (AAPOR 2000). A number of different categories of cases
comprise the total number of in-scope cases:
CC= number of completed cases;
R= number of cases that refused to
provide any data;
O= number of eligible units not
responding for reasons other than refusal;
NC= number of noncontacted units known
to be eligible;
U= number of units of unknown
eligibility; and
e= estimated proportion of units of
unknown eligibility that are eligible.
The unit response rate (OMB 2005) represents a composite of
these components:
- The numerator includes
all cases that have submitted sufficient information to be considered
complete responses for the data collection period.
- Complete cases may
contain some missing data items.
Data collection staff and principal data users should jointly
determine the criteria for considering a case to be complete.
- The denominator
includes all original survey units that were identified as being eligible,
including units with pending responses with no data received, new eligible
units added to the data collection effort, and an estimate of the number
of eligible units among the units of unknown eligibility. The denominator does not include units
deemed out-of-business, out-of-scope, or duplicates.
- An unweighted version of the unit response rate can be used for tracking and analyzing data collection operations.
- A simple way to calculate e(U) is to compute the weighted ratio of eligible to ineligible in completed cases or eligibility-known cases and assume the same ratio will apply to the U cases.
- If a data collection has special circumstances that justify a formula other than the one above, such as longitudinal or partial response considerations, a more appropriate formula can be used if accompanied by a full explanation of the calculation method.
- When a
data collection has multiple stages, calculate the overall unit response
rates (RROC) as the
product of two or more unit level response rates.
Guideline 4.3.3: Item Response Rates
Calculate item response rates (RRI) as the ratio of the number of
respondents for whom an in-scope response was obtained (CCx for item x) to the
number of respondents who were requested to provide information for that item.
The number requested to provide information for an item is the number of unit
level respondents (CC) minus the
number of respondents with a valid skip for item x (Vx). When an
abbreviated questionnaire is used to convert refusals, the eliminated questions
are treated as item nonresponse.
- Calculate
the total item response rates (RRTx) for specific items as the product of
the overall unit response rate (RRO)
and the item response rate for item x (RRIx).
Guideline 4.3.4: Imputation
Decisions regarding whether or not to adjust
data, adjust weights, and impute for missing data should be based on how the
data will be used and the assessment of the bias due to missing data that is
likely to be encountered.
- To
avoid biased estimates, include imputed data in any reported totals.
- When
used, imputation procedures should be internally consistent, be based on
theoretical and empirical considerations, be appropriate for the analysis,
and make use of the most relevant data available.
- Since
most data sets are subject to analysis by users to detect relationships
between variables, implement imputation methods that preserve multivariate
relationships.
- To
ensure data integrity, re-edit data after imputation.
Guideline 4.3.5: Weight Adjustments
For data collections involving sampling,
adjust weights for unit nonresponse, unless unit imputation is warranted. Adjust weights for missing units within
classes of sub-populations to reduce bias.
Related Information
American Association for Public Opinion Research. 2000. Standard Definitions: Final Dispositions of Case Codes and Outcome Rates for Surveys. Lenexa, Kansas: AAPOR.
Kalton,
G. 1983.
Compensating for Missing Survey
Data. Institute for Social Research,
University of Michigan.
__________ and Flores-Cervantes,
I. 2003.
Weighting Methods, Journal of Official Statistics Vol.19, No.2.
__________
and Kasprzyk, D.
1982. Imputing for missing survey
responses. Proceedings of the Section on Survey Research Methods American
Statistical Association, 1982, 22-31.
__________
and Kasprzyk, D. 1986.
The treatment of missing survey data. Survey
Methodology, Vol. 12, No. 1, 1-16.
Little, R.J.A. and Rubin, D. 1987. Statistical Analysis with Missing Data. New York:
Wiley.
Office of Management and Budget. 2005. Standards for Statistical Surveys (Proposed),
Section 3.2 (Missing Data). Washington,
DC. July
14.
Rubin, D.B.
1987. Multiple Imputation for Nonresponse in Surveys. New York:
Wiley.
Schafer, J.L.
1997. Analysis of Incomplete Multivariate Data. London, UK:
Chapman and Hall.
Approval Date: April 20, 2005
4.4 Data Coding
Standard 4.4: To allow appropriate analysis, use codes to
identify missing, edited, and imputed items.
Codes added to convert collected text information into a form that facilitates
analysis must use standardized codes, when available, to enhance comparability
with other data sources.
Key Terms: coding, editing,
external source, imputation, skip pattern
Guideline 4.4.1: Codes for Missing and Inapplicable Data
Use codes on the file that clearly distinguish
between cases where an item is missing and cases where an item does not apply,
such as when skipped over by a skip pattern.
- Distinguish
between data missing initially from the source, unreadable data, and data
deleted in the editing process.
- If
the data collection instrument contains skip patterns, distinguish between
items skipped and items not ascertained (such as refusals).
- Do
not use blanks and zeros to identify missing data, as they tend to be
confused with actual data. Similarly,
do not use numeric codes like a series of nines or eights for missing
numeric items if these could be legitimate reported values.
- If a
data file acquired from an external source was not previously coded, the
level of coding effort should depend on how BTS plans to use the file and
on whether BTS plans to further disseminate the file.
- For
data in tabular form, the BTS Guide
to Style and Publishing Procedures contains a number of symbols and
abbreviations to place in cells with various types of missing or
inapplicable data.
Guideline 4.4.2: Indicating Edit Actions and Imputations
Code the data set to indicate edit
actions and imputed values.
- Indicate
whether cases passed or failed each edit.
If a case fails an edit, indicate the edit disposition (Guideline
4.2.3).
- If
more than one method could be used to impute a missing data item, indicate
the imputation method used.
Guideline 4.4.3: Coding Text Information
Although it is preferable to pre-code
responses, it may be necessary to code open-ended text fields for further use.
- To
code text data for easier analysis, use standardized codes if they exist (Guideline
2.3.3). Develop other types of
codes by using existing DOT or other federal agency practice, or by using standard
codes from industry or international organizations, when they exist.
- When
manually coding text, create a quality assurance process that verifies at
least a sample of the coding to determine if a specific level of coding
accuracy and reliability is being maintained.
Related Information
American Association for Public Opinion
Research. 1998. "Standard Definitions – Final Dispositions of Case
Codes and Outcome Codes for RDD Telephone Surveys and In-Person Household
Surveys," http://www.aapor.org/ethics/stddef.html.
Bureau of Transportation Statistics (BTS).
2003.
BTS Guide to Style and Publishing
Procedures. Washington,
DC.
__________. 2005. BTS
Statistical Standards Manual, Chapter 2 (Data Collection Planning and
Design). Washington,
DC.
Office of Management and Budget. 2005. Standards for Statistical Surveys (Proposed),
Section 3.3 (Coding). Washington,
DC.
July 14.
Approval Date: April 20, 2005
4.5 Monitoring and
Evaluation
Standard 4.5: Monitor and evaluate each data processing activity, both to assess the impact on data
quality and to inform data users.
Key Terms: frame, imputation, item nonresponse, incident
data, longitudinal, missing at random, multivariate modeling, nonresponse bias,
overall unit nonresponse, population, response rates, unit nonresponse, weight
Guideline 4.5.1: Quality Control
Establish quality control procedures to
monitor and report on the operation of data processing procedures.
- Incorporate quality
control into the processing procedures to automatically produce outputs useable
by data system managers. Outputs
produced during data processing should be used to adjust procedures for
higher quality results and greater efficiency.
- Monitor failure rates for
each edit and by case. Analyze the pattern of edit failures graphically to
pinpoint problems more easily and prioritize items for follow-up.
- When applicable, automate
the process of referring data problems to data providers for quicker
resolution.
- Maintain
information on the amount of missing data, actions taken, and problems
encountered during imputation for inclusion in the data processing
(Guidelines 4.6.2 and 4.6.3) and user documentation (Guideline 6.8.1).
Guideline 4.5.2: Unit Response Analysis Requirement
Conduct an analysis of nonresponse for any
data collection with an overall unit response rate (Guideline 4.3.2) less than
80 percent. The objective is to measure
the impact of the nonresponse and to determine whether the data are missing at
random.
- Compare
respondents and nonrespondents across subgroups using external or frame data,
if available, or through a nonresponse follow-back survey.
- Compare
respondents’ characteristics to known characteristics of the population
from an external source. This comparison
can indicate possible bias, especially if the characteristics in question
are related to the data collection effort’s key variables.
- Consider
multivariate modeling of response using respondent and nonrespondent external
data to determine if nonresponse bias exists.
- For a
multi-stage data collection effort, focus the response analysis on the stages
with the higher missing data rates.
- Evaluate
the impact of weighting adjustments on nonresponse bias.
Guideline 4.5.3: Item Response Analysis Requirement
If the item response rate (Guideline 4.3.3) is
less than 70 percent, conduct an item nonresponse analysis to determine if the
data are missing at random at the item level, in a similar fashion to Guideline
4.5.2.
- Analyze
missing data rates at the item level and compare the characteristics of
the reporters and the non-reporters.
- For
some data collections, such as incident data collections, missing data
rates may not be known. In such
cases, provide estimates or qualitative information on what is known.
Guideline 4.5.4: Timing of Nonresponse Bias Analyses
Conduct unit and item nonresponse bias analyses prior to the release of any information
products
- Analyze
the missing data effect at least annually if the data collection occurs
more than once a year or is continuous.
- Analyze
the missing data effect each time data are collected if the collection occurs
annually or less often.
- For
data collections from longitudinal panels, analyze the effect of missing
data after each collection due to attrition of respondents over time.
Guideline 4.5.5: Publishable Items
In those cases where the analysis indicates
that the data are not missing at random, the decision to publish individual
items should be based on the amount of potential bias due to missing data.
- If the missing data bias analysis shows that the data are not
missing at random and the total item missing data rate (Section 4.3.3) is less
than 70 percent, the estimate should be regarded as unreliable.
- Suppress or flag estimates that are unreliable due to missing
data.
Related Information
Bureau of Transportation Statistics.
2005. BTS Statistical Standards Manual, Chapter 6 (Dissemination of
Information). Washington,
DC.
Groves, R. 1989. Survey Errors and Survey Costs. New York, NY:
Wiley, Chapters 10 and 11.
Interagency Household Survey Nonresponse
Group. Information
available at http://www.fcsm.gov/committees/ihsng/ihsng.htm as of April 18, 2005.
Office of Management and Budget. 2005. Standards for Statistical Surveys (Proposed),
Section 3.2 (Nonresponse Analysis and Response Rate Calculation). Washington, DC. July 14.
Approval Date: April 20, 2005
4.6 Documentation of Data
Processing Procedures
Standard 4.6: The data processing procedures must be
documented for both BTS and public use.
For external source data, the documentation must include procedures used
by the external source as well as procedures that were implemented on the data
at BTS. Documentation must allow
reproduction of the steps leading to the results.
Key Terms: coding, derived
data, edit, external source, imputation, item response, response rates, unit
response, weight
Guideline 4.6.1: Edit Procedures
Documentation must
describe:
- The edit rules and their purpose,
- Procedures for handling records that fail edits,
- A description of the codes used to indicate edit disposition
(Guideline 4.2.3), and
- The procedures for, and the results of, any edit
performance evaluations.
Guideline 4.6.2: Measures of Edit Performance
For key edits as
identified by the data collection staff, maintain measures for the number of:
- Edit messages, by edit disposition (Guideline 4.2.3),
- Edit messages resulting in revisions of the original data, and
- Edit messages overridden, by reason for overriding the edit.
Guideline 4.6.3: Procedures for Handling Missing Data
Documentation of procedures for handling
missing data must include:
- The
unit response rate or rates,
- Item
response rates for key variables as identified
by the data collection staff,
- Item
response rates for any items with response rates less than 70 percent,
- Formulas
used to calculate unit and item response rates,
- Results
of response bias analyses,
- Full documentation of the
methods of imputation or weight adjustments,
- A description of the
coding schemes used to identify missing and imputed values, and
- An assessment of the
nature, extent, and effects of imputation or weight adjustments.
Guideline 4.6.4: Procedures for Coding
Text Information
Document both the
source for any coding scheme used and the coding process (whether automated or
manual), and make it available to data users.
Any reliability or accuracy studies of the coding process should also be
documented and made available.
Guideline 4.6.5: Derived Data Items
Documentation should include all formulas,
detailed descriptions on how the item was created, and the sources of any
external information used to derive additional data items for the file.
Guideline 4.6.6: Information Systems Documentation
Systems for the processing of data should
have documentation of all operations (both automated and manual) necessary to
operate, maintain, and update the systems.
- The
documentation should provide an overview of integrated manual and
automated operations, workflow, interfaces, and personnel requirements.
- Documentation
should be sufficiently detailed and complete that personnel unfamiliar
with the systems can become knowledgeable and operate them, if necessary.
- Information
systems documentation may be incorporated into existing documentation or
written as a separate document.
Guideline 4.6.7: Documentation Updates
Update documentation whenever a major change
to the processing system is made, but at least annually when the frequency is
less than annual.
Related Information
American Association for Public Opinion
Research. 1998. "Standard
Definitions – Final Dispositions of Case Codes and Outcome Codes for RDD
Telephone Surveys and In-Person Household Surveys." Available at http://www.aapor.org/ethics/stddef.html as of April
18, 2005.
Office of Management and Budget. 2002. Guidelines for Ensuring and Maximizing the
Quality, Objectivity, Utility, and Integrity of Information Disseminated by
Federal Agencies. Federal Register,
Vol. 67, No. 36, pp. 8450-8460. Washington,
DC.
February 22.
Approval Date: April 20, 2005
|
|