|
Chapter 4 - Processing Data
Once the collected data is in electronic form, some "processing"
is usually necessary to mitigate obvious errors, and some analysis is usually
necessary to convert data into useful information for decision documents, publications,
and postings for the Internet.
Figure 5
Text Version
4.1 Data Editing and Coding
Principles
- Data editing is the application of checks that identify missing, invalid,
duplicate, inconsistent entries, or otherwise point to data records that
are potentially in error.
- Typical data editing includes range checks, validity checks, consistency
checks (comparing answers to related questions), and checks for duplicate
records.
- For numerical data, "outliers" are not necessarily bad data. They
should be examined for possible correction, rather than systematically deleted.
Note: By "examine" we mean you can check the original
forms, compare data items with each other for consistency, and/or follow-up
with the original source, all to see if the data are accurate or error has
been introduced.
- Editing is a final inspection-correction method. It is almost always necessary,
but data quality is better achieved much earlier in the process through
clarity of definitions, forms design, data collection procedures, etc.
- Coding is the process of adding codes to the data set as additional information
or converting existing information into a more useful form. Some codes indicate
information about the collection. Other codes are conversions of data, such
as text data, into a form more useful for data analysis.
For example, a code is usually added to indicate the "outcome" of
each case. If there were multiple follow-up phases, the code may indicate
in which phase the result was collected. Codes may also be added to indicate
editing and missing data actions taken. Text entries are often coded to facilitate
analysis. So, a text entry asking for a free form entry of a person’s occupation
may be coded with a standard code to facilitate analysis.
- Many coding schemes have been standardized.
Examples: the North American Industry Classification System (NAICS) codes,
the Federal Information Processing Standards (FIPS) for geographic codes (country,
state, county, etc.), the Standard Occupation Codes (SOC).
Guidelines
- An editing process should be applied to every data collection and to third-party
data to reduce obvious error in the data. A minimum editing process should
include range checks, validity checks, checks for duplicate entries, and consistency
checks.
Examples of edits: If a data element has five categories numbered from 1 to
5, an answer of 8 should be edited to delete the 8 and flag it as a missing
data value. Range checks should be applied to numerical values (e.g., income
should not be negative). Rules should be created to deal with inconsistency
(e.g., if dates are given for a train accident and the accident date is before
the departure date, the rule would say how to deal with it). Data records
should be examined for obvious duplicates.
- Most editing decisions should be made in advance and automated. Reliance
on manual intervention in editing should be minimized, since it may introduce
human error.
- Do not use outlier edits to the extent that special effects and trends would
be hidden. Outliers can be very informative for analysis. Over-editing can
lead to severe biases resulting from fitting data to implicit models imposed
by the edits.
Rapid industry changes could be missed if an agency follows an overly restrictive
editing regimen that rejects large changes.
- Some method should be used to allow after-the-fact identification of edits.
One method is to add a separate field containing an edit code (i.e., a "flag").
Another is to keep "version" files, though this provides less
information to the users.
- To avoid quality problems from analyst coding and spelling problems, text
information to be used for data analysis should be coded using a standard
coding scheme (e.g., NAICS, SOC, and FIPS discussed above). Retain the text
information for troubleshooting.
- The editing and coding process should clearly identify missing values
on the data file. The method of identifying missing values should be clearly
described in the file documentation. Special consideration should be given
to files that will be directly manipulated by analysts or users. Blanks
or zeros used to indicate missing data have historically caused confusion.
Also, using a coding to identify the reason for the missing data will facilitate
missing data analysis.
- The editing and coding process and editing statistics should be documented
and clearly posted with the data, or with disseminated output from the data.
References
- Little, R. and P. Smith (1987) "Editing and Imputation for Quantitative
Survey Data," Journal if the American Statistical Association,
Vol. 82, No. 397, pp. 58-68.
4.2 Handling Missing Data
Principles
- Untreated, missing data can introduce serious error into estimates. Frequently,
there is a correlation between the characteristics of those missing and
variables to be estimated, resulting in biased estimates. For this reason,
it is often best to employ adjustments and imputation to mitigate this damage.
- Without weight adjustments or imputation, calculation of totals are underestimated.
Essentially, zeroes are implicitly imputed for the missing items.
- One method used to deal with unit-level missing data is weighting adjustments.
All cases, including the missing cases, are put into classes using variables
known for both types. Within the classes, the weights for the missing cases
are evenly distributed among the non-missing cases.
- "Imputation" is a process that substitutes values for missing
or inconsistent reported data. Such substitutions may be strongly implied
by known information or derived as statistical estimates.
- If imputation is employed and flagged, users can either use the imputed
values or deal with the missing data themselves.
- The impact of missing data for a given estimate is a combination of how
much is missing (often known via the missing data rates) and how much the
missing differ from the sources that provided data in relation to the estimate
(usually unknown).
For example, given a survey of airline pilots that asks about near-misses
they are involved in and whether they reported them, it is known how many
of the sampled pilots did not respond. You will not know if the ones who did
respond had a lower number of near-misses than the ones who did not.
- For samples with unequal probabilities, weighted missing data rates give
a better indication of impact of missing data across the population than
do unweighted rates.
Guidelines
- Unit nonresponse should normally be adjusted by a weighting adjustment
as described above, or if no adjustment is made, inform data users about
the missing values.
- Imputing for missing item-level data (see definition above) should be
considered to mitigate bias. A missing data expert should make or review
decisions about imputation. If imputation is used, a separate field containing
a code (i.e., a flag) should be added to the imputed data file indicating
which variables have been imputed and by what method.
- All methods of imputation or weight adjustments should be fully documented.
- The missing data effect should be analyzed. For periodic data collections,
it should be analyzed after each collection. For continuous collections,
it should analyzed at least annually. As a minimum, the analysis should
include missing data rates at the unit and item levels and analysis of the
characteristics of the reporters and the non-reporters to see how they differ.
For some reporting collections, such as with incidents, missing data rates
may not be known. For such cases, estimates or just text information on
what is known should be provided.
- For sample designs using unequal probabilities (e.g., stratified designs
with optimal allocation), weighted missing data rates should be reported
along with unweighted missing data rates.
References
- Chapter 4, Statistical Policy Working Paper 31, Measuring and Reporting
Sources of Error in Surveys, Statistical Policy Office, Office of Information
and Regulatory Affairs, Office of Management and Budget, July 2001.
- The American Association for Public Opinion Research. 2000. Standard
Definitions: Final Dispositions of Case Codes and Outcome Rates for Surveys.
Ann Arbor, Michigan: AAPOR.
4.3 Production of Estimates and Projections
Principles
- "Derived" data items are additional case-level data that are
either directly calculated from other data collected (e.g., # of days from
two dates), added from a separate data source (e.g., the weather on a given
date), or some combination of the two (e.g., give the departing and arriving
airports, calculating distance from an external source). Deriving data is
a way to enhance the data set without increasing respondent burden or significantly
raising costs.
- An "estimate" is an approximation of some characteristic of
the target group, like the average age, constructed from the data.
- A "projection" is a prediction of an outcome from the target group,
usually in the future.
Examples: The average daily traffic volume at a given point of the Garden
State Parkway in New Jersey two years from now. Total airline operations ten
years from now.
- Estimates from samples should be calculated taking the sample design into
account. The most common way this is done is weighted averages using weights
based on the design.
- Estimates of standard error of an estimate will give an indication of
the precision of the estimate. However, it will not include a measure of
bias that may be introduced by problems in collection or design.
Guidelines
- Use derived data to enhance the data set without additional burden on data
suppliers.
For example, the data collection can note the departure and arrival airports,
and the distance of the flight can be added derived from a separate table.
- Weights should be used in all estimates from samples. Weights give the number
of cases in the target group that each case represents, and are calculated
as the inverse of the sampling probability. If using weights, adjust weights
for nonresponse as discussed in section 4.2.
For example, the National Household Travel Survey is designed to be a sample
representing the households of the United States, so the total of the weights
for all sample households should equal the number of households in the United
States. Due to sampling variability, it won’t. Since we have a very good count
of households in the United States from the 2000 Census, we can do a ratio
adjustment of all weights to make them total to that count.
- Construct estimation methods using published techniques or your own documented
derivations appropriate for the characteristic being estimated. Forecasting
experts should be consulted when determining projections.
Example: You have partial year data and you want to estimate whole year data.
A simple method is to use past partial year to whole year ratios (if stable
year to year) to construct an extrapolation projection (Armstrong 2001).
- Standard error estimates should accompany any estimates from samples.
Standard errors should be calculated taking the sample design in account.
For more complex sample designs, use replicated methods (e.g., jackknife,
successive differences) incorporating the sample weights. Consult with a
variance estimation expert.
- Ensure that any statistical software used in constructing estimates and
their standard errors use methods that take into account the design of the
data collection.
- The methods used for estimations and projections should be documented
and clearly posted with the resulting data.
References
- Armstrong, J.S. (2001). "Extrapolation of Time Series and Cross-Sectional
Data," in Principles of Forecasting: A Handbook for Researchers
and Practitioners, edited by J. S. Armstrong, Boston: Kluwer.
- Cochran, William G.(1977), Sampling Techniques (3rd
Ed.). New York: Wiley.
- Wolter, K.M. (1985). Introduction to Variance Estimation. New York:
Springer-Verlag.
4.4 Data Analysis and Interpretation
Principles
- Careful planning of complex analyses needs to involve concerned parties.
Data analysis starts with questions that need to be answered. Analyses should
be designed to focus on answering the key questions rather than showing
all data results from a collection.
- Analysis methods are designed around probability theory allowing the analyst
to separate indications of information from uncertainty.
- For analysis of data collected using complex sample designs, such as surveys,
the design must be taken into account when determining data analysis methods
(e.g., use weights, replication for variances).
- Estimates from 100% data collections do not have sampling error, though
they are usually measuring a random phenomenon (e.g., highway fatalities),
and therefore have a non-zero variance.
- Data collected at sequential points in time often require analysis with
time series methods to account for inter-correlation of the sequential points.
Similarly, data collected from contiguous geographical areas require spatial
data analysis.
Note: Methods like linear regression assume independence of the data points,
which may make them invalid in time and geographical cases. The biggest impact
is in variance estimation and testing.
- Interpretation should take into account the stability of the process being
analyzed. If the analysis interprets something about a process, but the
process has been altered significantly since the data collection, the analysis
results may have limited usefulness in decision making.
- The "robustness" of analytical methods is their sensitivity
to assumption violation. Robustness is a critical factor in planning and
interpreting an analysis.
Guidelines
- The planning of data analysis should begin with identifying the questions
that need to be answered. For all but simplistic analyses, a project plan
should be developed. Subject matter experts should review the plan to ensure
that the analysis is relevant to the questions that need answering. Data
analysis experts should review the plan (even if written by one) to ensure
proper methods are used. Even "exploratory analyses" should be
planned.
- All statistical methods used should be justifiable by statistical derivation
or reference to statistical literature. The analysis process should be accompanied
by a diagnostic evaluation of the analysis assumptions. The analysis should
also include an examination of the probability that statistical assumptions
will be violated to various degrees, and the effect such violations would
have on the conclusions. All methods, derivations or references, assumption
diagnostics, and the robustness checks should be documented in the plan and
the final report.
Choices of data analysis methods include descriptive statistics for each variable,
a wide range of graphical methods, comparison tests, multiple linear regression,
logistic regression, analysis of variance, nonparametric methods, nonlinear
models, Bayesian methods, control charts, data mining, cluster analysis, and
factor analysis (this list is not meant to be exhaustive and should not be
taken as such).
- Any analysis of data collected using a complex sample design should incorporate
the sample design into the methods via weights and changes to variance estimation
(e.g., replication).
- Data analysis for the relationship between two or more variables should
include other related variables to assist in the interpretation. For example,
an analysis may find a relationship between race and travel habits. That
analysis should probably include income, education, and other variables
that vary with race. Missing important variables can lead to bias. A subject
matter expert should choose the related variables.
- Results of the analysis should be documented and either included with
any report that uses the results or posted with it. It should be written
to focus on the questions that are answered, identify the methods used (along
with the accompanying assumptions) with derivation or reference, and include
limitations of the analysis. The analysis report should always contain a
statement of the limitations including coverage and response limitations
(e.g., not all private transit operators are included in the National Transit
Database; any analysis should take this into account). The wording of the
results of the analysis should reflect the fact that statistically significant
results are only an indication that the null hypothesis may not hold true.
It is not absolute proof. Similarly, when a test does not show significance,
it does not mean that the null hypothesis is true, it only means that there
was insufficient evidence to reject it.
- Results from analysis of 100 percent data typically should not include tests
or confidence intervals that are based on a sampling concept. Any test or
confidence interval should use a measure of the variability of the underlying
random phenomenon.
For example, the standard error of the time series can be used to measure
the variance of the underlying random phenomenon with 100 percent data over
time. It can also be used to measure sampling error and underlying variance
when the sample is not 100 percent.
- The interpretation of the analysis results should comment on the stability
of the process analyzed.
For example, if an analysis were performed on two years of airport security
data prior to the creation of the Transportation Security Agency and the new
screening workforce, the interpretation of the results relative to the new
processes would be questionable.
References
- Skinner, C., D. Holt, and T. Smith. 1989. Analysis of Complex Surveys.
New York, NY: Wiley.
- Tukey, J. 1977. Exploratory Data Analysis. Reading, MA: Addison-Wesley.
- Agresti, A. 1990. Categorical Data Analysis. New York, NY: Wiley.
|
|