Appendix A
Key Terms
Note: Sections referencing each key term are
indicated by square brackets.
accuracy [2.2, 5.2]: Accuracy refers to the
closeness of an estimate to the value of the population parameter.
administrative data collection [3.2]: Administrative data are records produced in
conjunction with the administration of a program, such as motor vehicle
registrations. In addition to providing
a source for the data itself, administrative records may also provide
information helpful in the design of the data collection process (e.g.,
sampling lists, stratification information).
archive [6.8]: Archiving is the
preservation of records or documents in long-term storage.
bias [2.3, 3.2, 4.3, 5.2]: Bias refers to a systematic deviation
of an estimate from the value of the population parameter. In statistical estimation, bias exists when
the expected value of an estimator does not equal the parameter that it is
intended to estimate.
bridge estimates, bridge study [2.3, 5.2]: A bridge study defines the relationship
between an existing methodology and a new methodology for the purpose of reconciling
the estimates from both methods.
coding [4.4, 4.6]: Coding is the process of adding alphanumeric values to a data file either to convert
text information to categories that can be more easily counted, tabulated, or
analyzed, or to indicate case-level operational information such as missing
data information.
collection instrument [2.3]: Collection instruments are
devices, such as forms, survey questionnaires, file layouts, online computer
entry screens, traffic sensors, etc., used to collect data.
confidential
[6.3]: “Confidential” is a status accorded to
information identified as sensitive by the authority (law) under which the
information was collected. (The
information is not classified “confidential” in a national security
sense.) Confidential information must be
protected and access to it controlled. See also confidentiality, disclosure limitation, sensitive
material.
confidentiality [2.3, 3.1, 3.3, 4.1, 6.5]:
The term “confidentiality” implies both
a pledge used during data collection, which guarantees that the uses of the
data will be limited to those purposes specified in an Information Collection
Request (ICR), and the active implementation of administrative procedures and
security protocols to protect confidential data from unauthorized
disclosure. See also confidential, disclosure limitation, Information Collection Request (ICR), sensitive material.
coverage [2.2, 2.4, 3.2]: Coverage refers to the relationship
between the elements on a list used as a frame and the target population units. Undercoverage errors occur when target
population units are missed during frame construction, and overcoverage errors
occur when units are duplicated or enumerated in error. See
also frame.
crosswalk
[2.3]: A crosswalk relates categories from one
classification system to categories in another classification system.
data collection [3.3]: Data collection refers
to all processes involved in acquiring data from a target population, including
cases where previously gathered data are obtained from an external source.
derived data [4.6]: Derived data are additional unit-level data
that are either directly calculated from other collected data or added from a
separate data source. Assumptions may be
used in deriving data. For example,
flight distances can derived from reported origin and destination airports by
assuming that planes fly the most direct routes.
disclosure limitation [6.5]: Disclosure limitation involves
techniques that are used to prevent the public release of individually
identifiable data that were obtained under a pledge of confidentiality. See also confidential, confidentiality, sensitive
material.
edit, editing [4.2, 4.4, 4.6]: Editing is the application of checks that
identify missing, invalid, duplicate, or inconsistent entries, or otherwise
point to data records that are potentially in error.
eligible unit [4.3]: An eligible unit is a unit that is in
the target population. An eligible
sample unit is a unit selected for a sample that is confirmed to be a member of
the target population.
estimates [5.2]: Estimates are numerical values for
population parameters based on data collected from a survey or other sources.
external source [3.1, 3.4, 4.4, 4.6, 6.2, 6.7, 7.2]: An external source is a data source over which
BTS has little or no control during design, planning, and implementation of the
data collection.
frame [2.2, 2.4, 3.2, 3.4, 4.5]: A
frame consists of one or more lists and/or procedures that allow the
target population to be enumerated.
imputation [4.2, 4.3, 4.4, 4.5, 4.6]: Imputation is a statistical procedure
that uses available information and some assumptions to derive substitute
values for missing values in a data file.
incident data [4.5]: Incident data consist of reports submitted only
when a certain kind of event occurs, such as the release of a hazardous
material during shipment.
inference [5.2]: Inference is the statistical derivation of
information from data.
Information Collection Request
(ICR) [3.3]: An ICR is a set of information, required by
the Privacy Act, which is given to a data provider prior to the collection of
any information. See also
confidential, confidentiality, disclosure
limitation, sensitive
material.
information product [6.1, 7.1, 7.2]: An information
product is any agency release of information to the public, regardless
of physical form or characteristic.
Printed reports, micro-data files, press releases, and tables posted on
the web are all information products.
information security [4.1]: Information security refers to the
safeguards, whether administrative or physical, in information systems or the
building space that protect information against unauthorized disclosure and
limit access to only authorized users in accordance with established
procedures.
item [4.3]: An item is the smallest piece of information
that can be obtained from a data collection instrument.
item nonresponse [4.3, 4.5]: Item nonresponse occurs when data are
missing for one or more items in an otherwise complete report.
item response [4.6]: See item nonresponse.
Key variable [2.3, 3.3, 5.1]: Key variables are data collection
items for which aggregate estimates are commonly published. Key variables may include important analytic
composites and other policy-relevant variables that are essential elements of
the data collection.
longitudinal [4.5]: A longitudinal data collection is a series of repeated
data collections on the same units over time.
The data from a single unit on a single variable over time constitute a
time series. The analysis of
interrelations within the time series is longitudinal analysis.
major data user [2.1]: See
primary data user.
measurement error [2.3, 3.3]: Measurement error is the difference
between observed values of a variable recorded under similar conditions and its
actual value (e.g., errors in reporting, reading, calculating, or recording a
numerical value).
metadata
[6.4]: Metadata
is descriptive information about a data file.
micro data [6.4]: Micro data are sets of unit-level records. A micro-data file includes the detailed
responses for individual respondents.
missing at random [4.5]: A variable is missing at random if the
probability that an item is missing does
not depend on its value, but may depend on the values of other observed
variables. A variable is missing
completely at random if the probability that an item is missing does not depend
on the values of any items, missing or not.
multivariate analysis [4.3]: Multivariate analysis is a generic term
for many methods of analysis that are used to investigate relationships among two
or more variables.
multivariate modeling [4.5]: Multivariate
modeling is a method of analyzing the relationships between two or more
variables by assuming some form of mathematical model, fitting the model, and statistically
testing the model fit.
nonresponse bias [3.3, 4.3, 4.5]: Nonresponse bias is the impact on the
observed value of an estimate due to differences between respondents and
nonrespondents. The impact of nonresponse on a given estimate is affected
by both the degree of nonresponse (missing data rates) and the degree that the respondents’
reported values differ from what the nonrespondents would have reported (usually
unknown).
objectivity
[6.3]: Objectivity is the accurate, clear, complete,
and unbiased presentation of information developed using sound statistical and
research methods.
outliers [4.2]: An outlier is an isolated extreme high or low
value, not necessarily erroneous, in a statistical distribution.
overall unit nonresponse [4.3, 4.5]: Overall unit nonresponse combines
unit nonresponse across two or more levels of data collection, where
participation at the second stage of data collection is conditional upon
participation in the first stage of data collection.
peer review [6.1]: A peer review is an evaluation
conducted by one or more technical experts independent of an information
product’s development.
population [4.5]: See
target population.
precision
[2.1, 6.6]: Numerical precision refers to the number of significant digits of numerical
values. Precision of sample estimate
refers to its reliability. See also
reliability.
primary data user [7.3]:
Primary data users are people or organizations who use information
products, in either raw or aggregate form, and are identified in strategic
plans and legislation that support the creation and maintenance of a data
system. See also secondary data user.
probability of selection [4.3]: The probability of selection is the
probability that a given population unit will be selected by a sampling process,
based on the probabilistic methods used in sampling.
record layout [6.4]: A record layout is a description
of the data elements in a file (variable names, data types, and length of space
on the file) and their physical locations.
regulatory data collection [3.2]: A regulatory data collection is mandated by a
regulation to provide information for regulatory purposes. In addition to providing a source for the
data itself, regulatory data may also provide information helpful in the design
of the data collection process (e.g., sampling lists, stratification
information).
reliability
[5.2, 6.3]: Reliability refers to the degree of
consistency of an estimate, such as measured by its relative standard error.
reproducibility
[5..3, 6.8]: Reproducibility is the ability to
substantially replicate the disseminated information.
response rates [2.2, 2.3, 3.3, 4.3, 4.5, 4.6]: A response rate is to the proportion of
the eligible units that is represented by the responding units.
revision
[6.7]: A revision is a change made to a previously
disseminated information product.
robustness
[5.2]: The robustness
of an estimator or analysis method is the degree to which the required
calculations are insensitive to violations of their assumptions.
sample substitution [4.3]: Sample substitution refers
to the practice of sampling matched pairs (in which the members of the pair do
not have an independent probability of selection), and obtaining data from the
second member only if the first member does not respond.
secondary data user [7.3 ]:
Secondary data users are people or organizations who use information
products, in either raw or aggregate form, but who are not identified in
strategic plans and legislation that support the creation and maintenance of a
data system. See also primary data user.
sensitive material [6.1]: Sensitive material is information
whose release would jeopardize confidentiality, privacy, or other guarantees
given to data providers. See also
confidential, confidentiality,
disclosure limitation.
significant
[6.3]: A result is statistically significant if a
statistical test indicates, at a pre-specified probability level, that the
result is unlikely to have occurred by chance.
significant digit [6.6]: A significant digit is a
digit needed to express a number to within the uncertainty of measurement.
skip pattern [4.2, 4.4]: A
skip pattern in a data collection instrument is the process of skipping
over non-applicable questions depending upon the answer to a prior question.
standard error [6.4]: The standard error is the standard
deviation (or square root of the variance) of an estimator.
See also variance.
statistical map [6.2]: Statistical maps are depictions of geographically related
data on maps using colors and symbols. Statistical
maps are also known as thematic maps.
storage [4.1]: Storage refers to warehousing of project
documents and/or data in a secure location.
target audience [5.1]: The target audience is the
set of the data users that a particular information product is intended to
serve.
target population [2.2, 2.4, 3.2]: The target population is the set of all people, businesses, objects, or events about
which information is required.
time series [5.2, 6.3]: A time series is a series of values of a variable at successive
times.
transparency
[5.3, 6.8]: Transparency means possessing sufficient
detail and clarity about data and methods to facilitate reproducibility.
See also reproducibility.
trend
[5.2]: A trend
is a long-term change in the mean of a time series.
unit [4.3]: A unit in a data collection is the entity that provides the lowest
level raw data, such as a household in a household survey.
unit nonresponse [4.3, 4.5]: Unit nonresponse occurs when a report that should have been
received is completely missing or is received and cannot be used (e.g., garbled
data, missing key variables).
unit response [4.6]: Unit response
occurs when a report is received and contains usable data. See unit nonresponse.
variance [5.2,
6.3, 6.4]: The variance of a sample-based estimator is a measure of the degree
to which estimates would vary about its mean if it were recomputed on
successive, identically designed, samples of the population. The variance of a population is a measure of the
degree to which individual values vary about the population mean. Technically, it is the expected value of the
square of the difference between a random variable and its expected mean.
See also standard error.
weight [4.3, 4.5, 4.6, 6.4]: Weights are relative values associated with
each sample unit that are intended to correct for unequal probabilities of
selection for each unit due to sample design.
Weights most frequently represent the number of units in the population
that the sampled unit represents. Weights
may be adjusted for nonresponse.
weighted average [6.2]: A weighted average is a
mean in which the components have non-negative weights that sum to one but are
not necessarily equal.
Approval Date: October 5, 2005
|