SUBJECT: MACHINE READABLE PRODUCTS
NCES STANDARD: 7-1
PURPOSE: To ensure the utility of data files created by NCES staff and
contractors, all NCES data files must be accompanied by easily accessible
documentation that clearly describes the metadata
necessary for users to access and manipulate the data.
KEY TERMS: confidentiality,
confidentiality edit, edits, imputation,
metadata, reference
year, response rates,
survey system, survey
year, universe, and variance.
STANDARD 7-1-1: Machine-readable products must be released in ASCII
format. Machine-readable products include flat files, relational databases,
and spreadsheets. Each record must contain a unique case identifier
such as ID. Files with multiple records per case must also contain unique
record type identifiers (e.g., record number, year of data). Data files
must be in one of two acceptable formats:
- Delimited, text quoted file format that is importable, or
- Positional files where the locations of all variables are identified
(i.e., file, record within file, and position within record).
GUIDELINE 7-1-1A: Data producers are invited to provide additional
data sets in alternate formats that may be helpful to users. For guidance
on Web-based formats, see the NCES public Web publishing standards;
request a copy by sending an e-mail to NCESWebmaster@ed.gov
GUIDELINE 7-1-1B: To facilitate the sharing and use of data
elements, national and international standards organizations have
produced drafts of several standards for the creation of metadata
on data elements. Examples are the International Organization for
Standards "Specification and Standardization of Data Elements"
standard (ISO/IEC 11179) and the more detailed American National Standards
Institute "Metadata for the
Management of Shareable Data" Standard (ANSI X3.285)
www.ansi.org.
These standards continue to be refined. Data producers should determine
what metadata standards are current at the time data files are prepared
and produce associated metadata
for their files that are in compliance with applicable standards.
STANDARD 7-1-2: A file description and record layout must be
provided for each file. The file information/metadata
header must include the following:
- Title of the survey (survey name,
part, and year as applicable);
- Name(s) of each file;
- Reference year for the data;
- Version number and date of release;
- Logical record length (in positional files) or number of variables
on the file (delimited files);
- Number of records per case or observation; and
- Number of cases in the data file. For delimited files also include
the delimiters (e.g., comma, space).
STANDARD 7-1-3: For each variable on the file, the file description
must include the following:
- Variable name;
- Data type (alpha or numeric);
- Record number (if multiple records per case);
- Position within the record (beginning-end, or variable number if
delimited) within the record, field length, and variable label; and
- The survey question wording and response categories.
STANDARD 7-1-4: Data set naming conventions must be standardized
and must conform to Information Systems Security Organization (ISSO)
(or more recent) standards for pressing a CD, which currently requires
a name with the following format: "xxxxxxxx.xxx".
STANDARD 7-1-5: Jewel box covers and Web links or URL links must
identify the survey system
(e.g., HS&B, CCD), component, survey
year, and version number.
STANDARD 7-1-6: All variables must be clearly identified and described.
- The description of variables must include the universe
for the variable.
- In the case of composite variables, the description must identify
all survey items used to construct the variables and must include
the algorithm used to construct the variables.
- Upper and lower case labels that clearly describe the variables
must be used.
- For all categorical variables, each value must be associated with
a frequency, a percentage of total cases and a label for each category.
In public-use and restricted-use file documentation, unweighted frequencies
must be included (see Standard 4-2-10 for
public-use files without confidentiality
edits).
- For all continuous variables, the distribution of values (e.g.,
minimum, maximum, mean, and standard deviation) must be provided.
GUIDELINE 7-1-6A: FIPS Standards should be used where applicable. NCES
standard definitions and codes should be used where applicable (see
Standard 1-4).
GUIDELINE 7-1-6B: Variables names should be consistent across
surveys within a survey system,
within and across years.
GUIDELINE 7-1-6C: In a printable record layout file, line length should
be specified so that it prints correctly without wrapping and without
special modification (e.g., 72 characters, 12 point type).
STANDARD 7-1-7: Data file documentation must be complete for
all data files. This includes an abstract or summary that cites the
methodology report or technical notes associated with the survey
and a description of survey methodology that is consistent with the
NCES standard for survey system documentation (see Standard 3-4). In
general, survey methodology documentation for data files must include
the following:
- Description of data collection methods;
- Weighting and imputation procedures;
- Description of editing, error
resolution, and imputation flags;
- Guidelines for processing the data;
- The reference year for the
data;
- Unweighted frequency counts, and response
rates;
- Information on how to use replicate weights or PSUs and stratum
for variance estimation; and
- Procedures for using weights to produce estimates.
STANDARD 7-1-8: The following data element conventions must be
used:
- Numeric-fields must contain only numbers or blanks. Reserve codes
for numeric fields should be extreme negative values (e.g., lower
than the lowest real value).
- "0" must represent zeros. Blanks or "-" may
not be used to represent 0s.
- Unique values must be used to distinguish between legitimate skips
and nonresponse.
- Suppression symbols must be removed from numeric fields and stored
in associated "flag" fields.
- Separate record locations must be used for all data items.
- Imputed data must be flagged in associated "flag" fields.
Imputation methods must be identified
in the flag. Blanks are not legitimate values for flags.
GUIDELINE 7-1-8A: When practical, numeric data fields containing continuous
variables should be identical in length.