Bureau of Transportation Statistics (BTS)
Printable Version

Appendix A
Key Terms

Note:  Sections referencing each key term are indicated by square brackets.

accuracy [2.2, 5.2]:  Accuracy refers to the closeness of an estimate to the value of the population parameter.

administrative data collection [3.2]:  Administrative data are records produced in conjunction with the administration of a program, such as motor vehicle registrations.  In addition to providing a source for the data itself, administrative records may also provide information helpful in the design of the data collection process (e.g., sampling lists, stratification information).

archive [6.8]:  Archiving is the preservation of records or documents in long-term storage.

bias [2.3, 3.2, 4.3, 5.2]:  Bias refers to a systematic deviation of an estimate from the value of the population parameter.  In statistical estimation, bias exists when the expected value of an estimator does not equal the parameter that it is intended to estimate.

bridge estimates, bridge study [2.3, 5.2]:  A bridge study defines the relationship between an existing methodology and a new methodology for the purpose of reconciling the estimates from both methods.

coding [4.4, 4.6]:  Coding is the process of adding alphanumeric values to a data file either to convert text information to categories that can be more easily counted, tabulated, or analyzed, or to indicate case-level operational information such as missing data information.

collection instrument [2.3]:  Collection instruments are devices, such as forms, survey questionnaires, file layouts, online computer entry screens, traffic sensors, etc., used to collect data.

confidential [6.3]:  “Confidential” is a status accorded to information identified as sensitive by the authority (law) under which the information was collected.  (The information is not classified “confidential” in a national security sense.)  Confidential information must be protected and access to it controlled.  See also confidentiality, disclosure limitation, sensitive material.

confidentiality [2.3, 3.1, 3.3, 4.1, 6.5]:  The term “confidentiality” implies both a pledge used during data collection, which guarantees that the uses of the data will be limited to those purposes specified in an Information Collection Request (ICR), and the active implementation of administrative procedures and security protocols to protect confidential data from unauthorized disclosure.  See also confidential, disclosure limitation, Information Collection Request (ICR), sensitive material.

coverage [2.2, 2.4, 3.2]:  Coverage refers to the relationship between the elements on a list used as a frame and the target population units.  Undercoverage errors occur when target population units are missed during frame construction, and overcoverage errors occur when units are duplicated or enumerated in error.  See also frame.

crosswalk [2.3]:  A crosswalk relates categories from one classification system to categories in another classification system.

data collection [3.3]:  Data collection refers to all processes involved in acquiring data from a target population, including cases where previously gathered data are obtained from an external source.

derived data [4.6]:  Derived data are additional unit-level data that are either directly calculated from other collected data or added from a separate data source.  Assumptions may be used in deriving data.  For example, flight distances can derived from reported origin and destination airports by assuming that planes fly the most direct routes.

disclosure limitation [6.5]:  Disclosure limitation involves techniques that are used to prevent the public release of individually identifiable data that were obtained under a pledge of confidentiality.  See also confidential, confidentiality, sensitive material.

edit, editing [4.2, 4.4, 4.6]:  Editing is the application of checks that identify missing, invalid, duplicate, or inconsistent entries, or otherwise point to data records that are potentially in error.

eligible unit [4.3]:  An eligible unit is a unit that is in the target population.  An eligible sample unit is a unit selected for a sample that is confirmed to be a member of the target population.

estimates [5.2]Estimates are numerical values for population parameters based on data collected from a survey or other sources.

external source [3.1, 3.4, 4.4, 4.6, 6.2, 6.7, 7.2]:  An external source is a data source over which BTS has little or no control during design, planning, and implementation of the data collection.

frame [2.2, 2.4, 3.2, 3.4, 4.5]:  A frame consists of one or more lists and/or procedures that allow the target population to be enumerated.

imputation [4.2, 4.3, 4.4, 4.5, 4.6]:  Imputation is a statistical procedure that uses available information and some assumptions to derive substitute values for missing values in a data file.

incident data [4.5]:  Incident data consist of reports submitted only when a certain kind of event occurs, such as the release of a hazardous material during shipment.

inference [5.2]:  Inference is the statistical derivation of information from data.

Information Collection Request (ICR) [3.3]:  An ICR is a set of information, required by the Privacy Act, which is given to a data provider prior to the collection of any information.  See also confidential, confidentiality, disclosure limitation, sensitive material.

information product [6.1, 7.1, 7.2]:  An information product is any agency release of information to the public, regardless of physical form or characteristic.  Printed reports, micro-data files, press releases, and tables posted on the web are all information products.

information security [4.1]:  Information security refers to the safeguards, whether administrative or physical, in information systems or the building space that protect information against unauthorized disclosure and limit access to only authorized users in accordance with established procedures.

item [4.3]:  An item is the smallest piece of information that can be obtained from a data collection instrument.

item nonresponse [4.3, 4.5]:  Item nonresponse occurs when data are missing for one or more items in an otherwise complete report.

item response [4.6]:  See item nonresponse

Key variable [2.3, 3.3, 5.1]:  Key variables are data collection items for which aggregate estimates are commonly published.  Key variables may include important analytic composites and other policy-relevant variables that are essential elements of the data collection.

longitudinal [4.5]:  A longitudinal data collection is a series of repeated data collections on the same units over time.  The data from a single unit on a single variable over time constitute a time series.  The analysis of interrelations within the time series is longitudinal analysis.

major data user [2.1]:  See primary data user

measurement error [2.3, 3.3]:  Measurement error is the difference between observed values of a variable recorded under similar conditions and its actual value (e.g., errors in reporting, reading, calculating, or recording a numerical value).

metadata [6.4]:  Metadata is descriptive information about a data file.

micro data [6.4]:  Micro data are sets of unit-level records.  A micro-data file includes the detailed responses for individual respondents.

missing at random [4.5]:  A variable is missing at random if the probability that an item is missing  does not depend on its value, but may depend on the values of other observed variables.  A variable is missing completely at random if the probability that an item is missing does not depend on the values of any items, missing or not.

multivariate analysis [4.3]:  Multivariate analysis is a generic term for many methods of analysis that are used to investigate relationships among two or more variables.

multivariate modeling [4.5]:  Multivariate modeling is a method of analyzing the relationships between two or more variables by assuming some form of mathematical model, fitting the model, and statistically testing the model fit.

nonresponse bias [3.3, 4.3, 4.5]:  Nonresponse bias is the impact on the observed value of an estimate due to differences between respondents and nonrespondents.  The impact of nonresponse on a given estimate is affected by both the degree of nonresponse (missing data rates) and the degree that the respondents’ reported values differ from what the nonrespondents would have reported (usually unknown).

objectivity [6.3]:  Objectivity is the accurate, clear, complete, and unbiased presentation of information developed using sound statistical and research methods.

outliers [4.2]:  An outlier is an isolated extreme high or low value, not necessarily erroneous, in a statistical distribution.

overall unit nonresponse [4.3, 4.5]:  Overall unit nonresponse combines unit nonresponse across two or more levels of data collection, where participation at the second stage of data collection is conditional upon participation in the first stage of data collection.

peer review [6.1]:  A peer review is an evaluation conducted by one or more technical experts independent of an information product’s development.

population [4.5]:  See target population

precision [2.1, 6.6]:  Numerical precision refers to the number of significant digits of numerical values.  Precision of sample estimate refers to its reliability.  See also reliability.

primary data user [7.3]:  Primary data users are people or organizations who use information products, in either raw or aggregate form, and are identified in strategic plans and legislation that support the creation and maintenance of a data system.  See also secondary data user.

probability of selection [4.3]:  The probability of selection is the probability that a given population unit will be selected by a sampling process, based on the probabilistic methods used in sampling.

record layout [6.4]:  A record layout is a description of the data elements in a file (variable names, data types, and length of space on the file) and their physical locations.

regulatory data collection [3.2]:  A regulatory data collection is mandated by a regulation to provide information for regulatory purposes.  In addition to providing a source for the data itself, regulatory data may also provide information helpful in the design of the data collection process (e.g., sampling lists, stratification information).

reliability [5.2, 6.3]:  Reliability refers to the degree of consistency of an estimate, such as measured by its relative standard error.

reproducibility [5..3, 6.8]:  Reproducibility is the ability to substantially replicate the disseminated information.

response rates [2.2, 2.3, 3.3, 4.3, 4.5, 4.6]:  A response rate is to the proportion of the eligible units that is represented by the responding units.

revision [6.7]:  A revision is a change made to a previously disseminated information product.

robustness [5.2]:  The robustness of an estimator or analysis method is the degree to which the required calculations are insensitive to violations of their assumptions.

sample substitution [4.3]:  Sample substitution refers to the practice of sampling matched pairs (in which the members of the pair do not have an independent probability of selection), and obtaining data from the second member only if the first member does not respond.

secondary data user [7.3 ]:  Secondary data users are people or organizations who use information products, in either raw or aggregate form, but who are not identified in strategic plans and legislation that support the creation and maintenance of a data system.  See also primary data user.

sensitive material [6.1]:  Sensitive material is information whose release would jeopardize confidentiality, privacy, or other guarantees given to data providers.  See also confidential, confidentiality, disclosure limitation.

significant [6.3]:  A result is statistically significant if a statistical test indicates, at a pre-specified probability level, that the result is unlikely to have occurred by chance.

significant digit [6.6]:  A significant digit is a digit needed to express a number to within the uncertainty of measurement.

skip pattern [4.2, 4.4]:  A skip pattern in a data collection instrument is the process of skipping over non-applicable questions depending upon the answer to a prior question.

standard error [6.4]:  The standard error is the standard deviation (or square root of the variance) of an estimator.  See also variance.

statistical map [6.2]:  Statistical maps are depictions of geographically related data on maps using colors and symbols.  Statistical maps are also known as thematic maps.

storage [4.1]:  Storage refers to warehousing of project documents and/or data in a secure location.

target audience [5.1]:  The target audience is the set of the data users that a particular information product is intended to serve.

target population [2.2, 2.4, 3.2]:  The target population is the set of all people, businesses, objects, or events about which information is required.

time series [5.2, 6.3]:  A time series is a series of values of a variable at successive times.

transparency [5.3, 6.8]:  Transparency means possessing sufficient detail and clarity about data and methods to facilitate reproducibility.  See also reproducibility.

trend [5.2]:  A trend is a long-term change in the mean of a time series.

unit [4.3]:  A unit in a data collection is the entity that provides the lowest level raw data, such as a household in a household survey.

unit nonresponse [4.3, 4.5]:  Unit nonresponse occurs when a report that should have been received is completely missing or is received and cannot be used (e.g., garbled data, missing key variables).

unit response [4.6]:  Unit response occurs when a report is received and contains usable data.  See unit nonresponse.

variance [5.2, 6.3, 6.4]:  The variance of a sample-based estimator is a measure of the degree to which estimates would vary about its mean if it were recomputed on successive, identically designed, samples of the population.  The variance of a population is a measure of the degree to which individual values vary about the population mean.  Technically, it is the expected value of the square of the difference between a random variable and its expected mean.  See also standard error.

weight [4.3, 4.5, 4.6, 6.4]:  Weights are relative values associated with each sample unit that are intended to correct for unequal probabilities of selection for each unit due to sample design.  Weights most frequently represent the number of units in the population that the sampled unit represents.  Weights may be adjusted for nonresponse.

weighted average [6.2]:  A weighted average is a mean in which the components have non-negative weights that sum to one but are not necessarily equal.

Approval Date:  October 5, 2005