Frequently Asked Questions (FAQ)
These Frequently Asked Questions (FAQs) and answers
cover the the most common questions encountered when working
with Continuous NHANES (1999 and on), NHANES III, NHANES II, and
NHANES I data. The FAQs are arranged by tutorial module topic.
Click the hyperlinked question to view the answer.
Survey Orientation
Survey Overview
Question 1. I noticed your survey acronyms changed from NHES to NHANES. Are
there any differences between these surveys?
Answer: The first three National Health Examination Surveys (NHES) were
conducted between 1960 and 1970, each with its own specific age
groupings as a target population. These three surveys were known as NHES I, II, and III.
Between 1971 and 1975, a large nutrition component was added to the
fourth series and all subsequent surveys. The name of the survey was
thus changed to the National Health and Nutrition Examination Survey
(NHANES). Two subsequent periodic surveys were
conducted between 1976-1980, and 1988-1994. These three surveys are
known as NHANES I, II, and III. Since 1999, the
NHANES is conducted annually.
Question 2. Have the data contents remained constant across surveys?
Answer: No. Survey contents have changed from NHES to NHANES. Compared with
the NHES series, the NHANES series not only added a nutrition
component, but also incorporated additional examination components (in
certain years), as well as an environmental health component and a
dental component.
NHANES added in-person interview at participant's home, and
additional interviewer-administered survey questionnaires, including a
dietary questionnaire, and questionnaires on selected special topics
during the MEC interview.
In NHANES III and continuous NHANES, selected groups of participants
are also asked to complete questionnaires on other sensitive topics,
such as illicit drug use or sexual behaviors using a self-administered
interview (Audio-CASI).
For the dietary component, participants who have completed the MEC
dietary questionnaire participated in two follow-up questionnaire
activities: taking a telephone interview with survey staff using
Computer-Assisted Telephone Interview (CATI) and filling out a food
frequency questionnaire.
Please also be mindful that certain examination components have been
added or removed in different NHANES surveys.
Question 3. Why do you call NHANES conducted
after 1999 "
continuous NHANES”?
Answer: This is because the previous NHANES I, II and III were periodic
surveys conducted between certain years with intervals in between each
wave of the survey. Since 1999, the NHANES is conducted annually without
any interruptions.
The continuous NHANES program provides increased flexibility in
changing survey contents to meet emerging needs. It also allows for
increased timeliness in releasing bi-annual datasets and estimates on
topics of public health interest.
Navigate the NHANES Website
Question 1. Can I access current and historic
data conducted by your agency from the website?
Answer: Yes.
All publicly
available data and related documentation are released and updated in a
centralized location: the NHANES website.
The website contains the public-use data
files from the initial National Health Examination Survey (NHES) dataset
up to the most current continuous NHANES dataset. Codebooks
and documentation accompany each dataset released.
Previously, some datasets were released on CD
or DVD. However, the most up-to-date versions of these datasets are
available on the website.
Question 2. Why are certain variables or data
files not publicly released on the website?
Answer: Some variables or entire data files are
not publicly released due to disclosure concerns, for example,
geographic identifiers. These files are only available through the
Research Data Center (RDC). You may review the
Data Release and Access Policy for more information.
You can find more information on this
topic from the link on the main section of the homepage:
Access to Public and Non-Public NHANES Data Sets
Question 3. How do you decide the contents for
each NHANES survey?
Answer: The NHANES program solicits survey
proposals from various federal agencies such as NIH, USDA, EPA, and many
other CDC centers. These proposals are reviewed by expert panels, and
content is determined after a rigorous evaluation process including
consideration of criteria such as public health importance, feasibility
of the proposed survey items and burden to survey participants.
If you would like to find more information on this topic, you can click
on the link on the main section of the homepage:
Proposal Guidelines for New Survey
Content.
Question 4. I noticed there is a lot of
information in the data documentation. Do I have to read these
documents?
Answer: The data documentation contains important
and relevant information for your analysis. Therefore it is very
important that you consult the documentations to assist you in
determining the scope of your analysis and which variables to include.
These documents are released with each
survey dataset, and contain information pertaining to:
- Survey Contents, which shows the years
components were collected and when changes to the components
occurred,
- Sample Person Questionnaire protocols, and
- MEC Component Descriptions, which contain the
examination and laboratory protocol for obtaining these measures in
the survey.
Data Structure & Contents
Question 1.
Are NHANES data structured the same way throughout the years?
Answer: No. The data structure and content of the
NHANES surveys has changed over the years. This concept is important to
understand when searching across different NHANES surveys. So when you
use the tutorial, please consult the NHANES Data Structure and Contents
module to understand how the continuous NHANES (1999 – present) and
historic NHANES (I, II and III) are laid out differently.
Question 2. What are the main differences in data structure between the
continuous NHANES and NHANES III?
Answer: The continuous NHANES (1999-present) are
conducted annually and released in two-year cycles, so the continuous
NHANES is organized by survey cycles, namely, NHANES 1999-2000,
2001-2002, 2003-2004, etc.
Each cycle is divided into four sections
labeled by data collection method: Demographics, Examination,
Laboratory, and Questionnaire. Within each section are many components
– groups of related variables packaged in a data file according to
topics.
NHANES III was conducted
in two phases, from 1988-91 and from 1992-94. However, the NHANES III
data releases were not structured according to survey phases. Instead,
they were organized according to the time of the data component release.
Two specific releases contain the majority of data of interest to most
researchers – these are Series 11, No. 1a and Series
11, No. 2a.
Series 11 No. 1a contains the majority of the data and corresponding documentation for
the survey interview and examination components. They are further
divided into five separate data files:
- NHANES III Household Adult Data File
- NHANES III Household Youth Data File
- NHANES III Examination
Data File
- NHANES III Laboratory
Data File
- NHANES III Dietary
Recall Data Files
Question 3. I cannot find my
variables in NHANES III series 11 No. 1a or 2a. Where else shall I search?
Answer: There are
a number of other data releases (e.g. series No. 3a through No. 25a)
which contain additional data based on a subsample of the survey, very
specific topic areas which were delayed in their original release, or
data files based on surplus sera projects. Also, new data releases on
NHANES III continue to occur, so please check the NHANES website for
these new releases.
Preparing an Analytic Dataset
Locate Variables
Question 1. How are continuous NHANES data
(1999-present) organized in the publicly accessible website?
Answer:
- The current NHANES data are first grouped
into survey cycles (i.e. 2005-2006, 2003-2004, 2001-2002,
1999-2000);
- Within each cycle, the date files are
organized into four major components: Demographic, Examination,
Laboratory, and Questionnaire files;
- Survey variables are stored in these
different components, and the component's variable list
contains the list of all the publicly released variables and
their file locations.
Question 2. Why are there so many
data files?
Answer:
The data files have been separated to reduce
the amount of time to download data and documentation from the Internet,
along with the greater ease in producing, editing, and validating data
files. This does require that you merge files together for analysis.
Please refer to the following SAS code example to learn how to merge
files together
NHANES data merge code example
Question 3. How do I know which component contains the
variables of interest to me?
Answer: Generally speaking, the
continuous NHANES data files are organized by their collection method. For instance, if the information is collected via household
interview or MEC interview, the items are mostly like stored in the
Questionnaire component. If the variables of interests are lab items,
they are in the Laboratory component. In summary:
- Demographics files: survey design (e.g.
weights, design strata) and demographic variables
- Examination files: information
collected through physical exams, dental exams, and dietary
interview components (Note: not every survey participant agreed to
a physical examination)
- Laboratory files: results from
specimens such as blood, urine, hair, air, tuberculosis skin test,
and household dust and water specimens
- Questionnaire files: data collected
through household interview and mobile examination center (MEC)
interview
Question 4. Once I know which cycle and component to
search for my variables, what is the fastest way to find them?
Answer: The continuous NHANES component's
variable list contains the list of all the publicly released variables
and their file locations. You can use the Adobe search function to speed
up the process of finding your variables of interests. The variable
lists include the following information:
- filename the variable is found in,
- name of the component,
- variable name, and
- variable's English label (a short
description of the variable).
Question 5. My search
resulted in a long list of variables. Which one is appropriate for my
analysis?
Answer: Not every result returned will be
relevant to your analysis. To decide which variables are appropriate for
your analysis, you need to review the survey documentation. You
cannot determine which ones to use in your analysis without consulting
the data file documentation. This is because the search results give you
variables with similar names, but they could be auxiliary, or used for
excluding sample persons from certain measures, rather the actual
measurements you are interested in. Therefore, you have to READ THE
DOCUMENTATION!
To find the relevant documentation, you
should first write down the file names that the variables are
stored in, and then use this introduction to identify the data file
and documentation to download. The data file documentation will guide
you to include the appropriate variables for your analysis.
Question 6. What kinds of NHANES documents are
available and how is it best to use them?
Answer: There are three types of data
documentation for each data file that you need to consult, and each
provides valuable information to facilitate your gathering of background
information on your variables. The three documentation
types are:
Before analyzing the data, you will need
to know how the variable is coded, data editing, processing, and
collection information, and the frequency (or sample size) of the
variable. The codebook lists all the variables in the data file. Use it
to determine what the values associated with a variable mean. Use the
data file documentation to determine if the collection or measurement is
appropriate for your analysis. The frequency files for each data file
contain the frequency count for each item in the data file and can be
used to verify the sample size for a particular data item.
Question 7.
On the NHANES 2003-2004 data page
I see links for data. How do I access the data from these links?
Answer:
The Docs and Procedures files are in Adobe PDF format, so
you should be able to view these directly in your browser, if configured
with Adobe Acrobat. A PDF file can be saved from this view using the
"File/Save As..." menu and specifying a location on your local computer
or network to store the file. Or you can right-click the file name
directly on the webpage and select "Save Target As..." from the popup
box, then specify a location to save the file on your computer.
Clicking on the Data link will open a dialog box from which you
can specify a location to store the file (using the "Save" button) or
open it directly with SAS (using the "Open" button.)
Question 8. Next to the name of each questionnaire section, laboratory
component, or exam component on the NHANES 2003-2004 data page
there are links that appear as follows: [Data, Docs, Procedures].
What are these links for?
Answer: In previous years (1999-2002), each of the codebook, documentation, file frequencies, and SAS
transport dataset were made directly, individually accessible by a
separate link, in brackets, next to the data section name.
Starting with data years 2003-2004 the documentation, codebook, and
frequencies have been combined in a single Adobe PDF file, accessible
from the single Docs link. A new, direct link to Procedures
is also being provided for each data section on the data page.
For exam sections this will link to the
examination procedures manual for that section, and for questionnaire
sections this will link to the questionnaire instrument. Procedure
manuals for laboratory sections are also available but as there may be
multiple documents for a given lab, these will remain on a separate
page, accessible by clicking the Laboratory Procedures Manuals
link at the top of the list of laboratory data sections.
Question 9. I know NHANES collected information
on certain topics, but I couldn't find them on the variable lists. What
happened to those data collected?
Answer: This is because all variables listed in
the component variable lists are those that are publicly released.
These data files are available for download from the NHANES website. If
you wish to use variables that are NOT listed in a component
variable list, you will need to use the Research Data Center. You can
review the
NCHS Research Data Center for more information about how to obtain
access to non-publicly released variables.
Question 10. Why isn't the adolescent data on
alcohol use, smoking, sexual behavior, reproductive health and drug use
available as a public release file?
Answer:
These files have not been released on the
NHANES website due to confidentiality concerns. Adolescent data files
containing this sensitive information will be made available at the
NCHS Research Data Center.
Download Data Files
Question 1. Where can I access NHANES data files?
Answer: Both current and historic NHANES data are
made available for download on the NHANES website. The only exceptions
would be those files not publicly released yet due to disclosure risk,
or still under processing.
Question 2. What format are the data files in? Can they be used with SAS,
SPSS, or Stata?
Answer: The Continuous NHANES files
are in SAS transport file format (.xpt).
They can be used with any package that supports this file format. For
statistical/analytical packages that do not support SAS transport file
format, you need to convert the file to a different format using an
appropriate software package. Users desiring alternate data formats can use the
SAS
System Viewer
— a free download from SAS Institute
— to convert the transport file into space-, tab-, or
comma-delimited text files for use in additional software programs, such
as Microsoft Excel. All prior NHANES data files
are in ASCII format.
Question 3. I have downloaded the files, but I cannot run any of them with my
statistical program. What happened?
Answer: This might be due to the fact that you
have NOT extracted the files yet. These transport files are NOT usable
until you first extract them using the XPORT engine. Then you will need
to use proc copy and save them as permanent SAS datasets.
Question 4. What operating system do I need to extract these files?
Answer: NHANES data files can be extracted on
Windows, UNIX, or Macintosh based systems.
Question 5. Why do I need
to go through the trouble of extracting and saving these files? Can't I
just double click on the files and let the SAS program extract and save
them automatically?
Answer: It is true that some users might be able
to double click on the downloaded files, and your SAS
software will
extract and save them as temporary WORK files automatically. However,
depending on the operating system or version of SAS program, some users
may not be able to do the double clicking directly. That is why we
provided you the instructions for extracting and saving these files.
Even for those users who can double click
and create temporary files, we still recommend that you save them as
permanent SAS datasets, because as soon as you exit the SAS program, the
temp (WORK) files will no long exist.
Append & Merge Datasets
Question 1. I noticed that continuous NHANES data files
are released in 2-year cycles. What do I do if I need to combine
different years together?
Answer: When you want to
combine multiple years, you need to append the data files from different
survey cycles on the same variables.
Question 2. Why do I have to check the contents of the
data files before appending the data? What do I do if I find
variables named or labeled differently?
Answer: You should always check
the variable lists first because variable names may be different from
cycle to cycle, or recoded or derived variables may be added in
different cycles.
- If the names or labels of the variables of interest are identical in the selected cycles, you can append the data files
directly.
- If the names or labels of the variables of interest have
changed, you will have to find out whether the wording,
definition, and/or response categories have been modified, and then recode the variables to make their names and response categories
consistent before appending.
Question 3. The variables I'm interested in come from interview, examination
and laboratory components. How do I combine them together?
Answer: If you want to join variables from
different components, you need to merge these data files.
The first step in merging data is to
sort each of the data files by a unique identifier. Then you merge
the data by that unique identifier.
Question 4. What is the
unique identifier in NHANES data that we need to append or merge data
by?
Answer: In NHANES, each sampled person is
identified by a unique sequence number, and the variable name for it is
SEQN. Every time you extract variables from an NHANES data file, you
should always include the SEQN variable in your selection. The SEQN will
later be used for sorting the data, and for appending and merging the
data files.
Clean & Recode Data
Question 1. What percent of missing data is usually
acceptable for NHANES data analysis?
Answer: As a general rule, if
10% or less of your data for a variable are missing from your
analytic dataset, it is usually acceptable to continue your analysis
without further evaluation or adjustment. However, if more than 10% of
the data for a variable are missing, you may need to determine whether
the missing values are distributed equally across socio-demographic
characteristics, and decide whether further imputation of missing
values or use of adjusted weights are necessary.
Question 2. How are missing values, "blank but
applicable", "don't know" and other values coded?
Answer: There are codes for refused (7-fill: that is 7, or 77, or
777…, depending on the number of digits required for a particular data
value), don't know (9-fill), and missing values (a blank field) which
means the person was not asked the question or given the test. There is
no longer a specific code for those cases where the variable response is
"
blank but applicable”; for such cases the values are designated as
missing values. For laboratory data there are special considerations.
When a laboratory value was less than the lower limit of detection (LOD),
a "
fill” value based on the LOD was used instead of the sample value as
the sample value was deemed "
not detectable.” An indicator variable
taking value (0 or 1) is used to identify which values are real and
which values are fill values.
Question 3. Why do I have to
check the missing data?
Answer: If you fail to identify "
refusal” or
"
don't know” as types of missing data, and treat the assigned values for
"
refused” or "
don't know” as real values, you will get distorted results
in your statistical analyses. Therefore, it is important to recode
"
refused” or "
don't know” responses as missing values (either as a
period (.) for numeric variables or as a blank for character variables).
Question 4. How do I determine the skip patterns
for a questionnaire section?
Answer: The first step is to review all of the
documentation for the questionnaires. To review skip patterns look at
the complete questionnaire
specifications. Please note that not all questionnaire
items are released due to small sample sizes and
confidentiality/sensitivity issues, but all skip pattern integrity was
maintained and validated.
The
significance of a skip pattern depends on the question leading to
the skip pattern, the questions within that skip pattern, and the
variables you intend to analyze. If you fail to check for skip
patterns, you may obtain only a proportion of the
population, instead of the entire study
population. Check the codebook to determine if a skip pattern affects
the variables in your analysis.
Question 5. How do I check for outliers, and
what do I do with influential outliers?
Answer: For continuous variables, you identify
outliers by using a univariate analysis to check for normality. If the
distribution is highly skewed, you can do a data transformation
to make the distribution of the data closer to normal.
After checking the distribution and
normality of the data, plot the survey weight against the variable to
determine which of the extreme values identified in the univariate
analysis are outliers. You must also determine if the outliers represent
valid values and, if so, also carry extremely large survey
weights.
Outliers with extremely large weights
could have an influential impact on your estimates. You will have to
decide whether to keep these influential outliers in your analysis or
not. It is up to the analysts to make that decision.
Please consult the
Analytical Guidelines for more information on this topic.
Format & Label Data
Question 1. Do I have to format and label all variables?
Answer: No. Formatting and labeling variables in
SAS is optional and does not need to be done for all variables in
the dataset. However, it is especially useful for frequently used
variables and for clarity in your output.
Question 2. Are there rules on how to format
and label variables?
Answer: Formats and labels are user-defined tools
that provide a convenient way to define variables in your SAS or
SUDAAN output. Formatting is used to assign descriptive text
names to numeric and character values of a variable. Labeling,
on the other hand, allows you to assign descriptive titles to
variable names.
Survey Design Factors
Sample Design
Question 1. What do you mean by the phrase "
NHANES is a complex survey”?
Answer: We frequently refer to NHANES as a complex
survey because the data are not obtained using a simple random sample.
Rather, a complex, multistage, probability sampling design is used to
select participants representative of the civilian,
non-institutionalized US population.
Question 2. How do you draw an NHANES sample?
Answer: The NHANES study draws
its sample in four stages described below:
- Stage 1: Primary sampling units (PSUs) are
selected. These are mostly single counties or, in a few cases,
groups of contiguous counties with probability proportional to a
measure of size (PPS).
- Stage 2: The PSUs are divided up into segments
(generally city blocks or their equivalent). As with each PSU,
sample segments are selected with PPS.
- Stage 3: Households within each segment are
listed, and a sample is randomly drawn. In geographic areas where
the proportion of age, ethnic, or income groups selected for
oversampling is high, the probability of selection for those groups
is greater than in other areas.
- Stage 4: Individuals are chosen to participate
in NHANES from a list of all persons residing in selected
households. Individuals are drawn at random within designated
age-sex-race/ethnicity screening subdomains. On average, 1.6 persons
are selected per household.
Question 3. What is a Sample Weight?
Answer: A sample weight is assigned to each sample
person. It is a measure of the number of people in the population
represented by that sample person in NHANES, reflecting the unequal
probability of selection, nonresponse adjustment, and adjustment to
independent population controls. When unequal selection probability is
applied, as in the NHANES sample, the sample weights are used to produce
an unbiased national estimate. More information about sample weights and
how they are created can be found in the Weighting module.
Question 4. Do I have to use
sample weights and other survey design variables?
Answer: Yes. For NHANES datasets, the use of
sampling weights and sample design variables is recommended for all
analyses because the sample design is a clustered design and
incorporates differential probabilities of selection. Accounting for the
complex sampling design of NHANES is especially critical when
calculating statistical estimates and estimating standard errors of
means, geometric means, percentages and other statistics.
If you fail to account for the sampling
parameters, you may obtain biased estimates and overstate significance
levels.
Question 5. What are Masked
Variance Unites (MVUs) and why do we need them in analyses?
Answer: Primary Sampling Units (PSU) are selected
from strata defined by geography and proportions of minority
populations. Most strata contain two PSUs. Together, these strata and
the PSUs represent the variance units (sampling units used to estimate
sampling error).
To protect the confidentiality of data
obtained from sample persons, Masked Variance Units (MVU) are
constructed. MVUs are equivalent to Pseudo-PSUs used to estimate
sampling errors in past NHANES. The MVUs on the data file are not the
"true" design PSUs. They are a collection of secondary sampling units
aggregated into groups for the purpose of variance estimation. They
produce variance estimates that closely approximate the variances that
would have been estimated using the "true" design variables.
These MVUs have been created for each
two-year cycle of NHANES and have been created in a way that allows them
to be used for any combination of data cycles. These MVUs are used to
define the strata and PSU variables on the public release files. The
variable name for the stratum is sdmvstra and the variable name
for the PSU is sdmvpsu.
Question 6. Why does NHANES oversample some groups but not others? Do you
oversample different groups over the years?
Answer: NHANES typically samples larger numbers of
certain subgroups who are of particular public health interest.
Oversampling is done to increase the reliability and precision of
estimates of health status indicators for these population subgroups. As
for why certain subgroups in the population are not oversampled, it may
be due to the fact the it is either cost prohibitive and/or
operationally not feasible to oversample certain groups in the
population.
Which subgroups get oversampled does
change from cycle to cycle. Therefore, it is critical to carefully
review the documentation for each survey cycle to determine which
subgroups were oversampled.
Specifying Weighting Parameters
Question 1. How are NHANES weights constructed?
Answer: In general a sample person is assigned a
base weight that is equivalent to the reciprocal of his/her probability
of selection. However, calculating the base weight in NHANES is much
more complicated due to the survey's complex, multistage design. In
summary, NHANES weights are constructed:
based on the final probability selection through 4
sampling stages;
adjusted for nonresponse to the in-home interview when
creating interview weights, and further adjusted for non-response to the
MEC exam when creating exam weights; and
post-stratified to match the population control totals for
each sampling subdomain.
Question 2. How do NHANES weights account for different response rates to the
in-home interview and MEC exam?
Answer: In NHANES, an individual can be classified
as a non-respondent to the interview portion of the survey and/or the
exam portion. An individual is considered a non-respondent to the
interview if he/she was selected to be in the sample, but did not
participate in the in-home interview. Similarly, an individual who
agreed to complete the interview but did not agree to, or come in for,
the MEC portion of the survey is considered a non-respondent to the
exam.
Adjustments made for survey non-response
account only for sample person interview or exam non-response, but
not for component/item non-response (i.e., a sample person declined
to have their blood pressure measured in the examination component but
completed all other examination components).
To produce estimates appropriately
adjusted for survey non-response it is important to check all of the
variables in your analysis and select the weight of the smallest
analysis subpopulation. All interview and MEC exam weights can be found
on the demographic file for the respective survey. Weights for a given
component conducted on only a subsample of the original NHANES sample
are available on the data file for that particular component.
Question 3. Will data and weights be available on
public use files for single years such as 1999, 2000, 2001, or 2002?
Answer: No. Even though each single year in NHANES
comprises a nationally representative sample of the U.S. population,
two years of data are necessary for sufficient sample sizes, hence
the data are released in two year cycles, and no single year weights will
be available to the public.
Sometimes, even two years of data do not
guarantee sufficient sample size to produce statistically reliable
estimates. This is especially true when you are dealing with rare
events or demographic subdomains (e.g., sex-age-race/ethnicity groups).
Therefore, combining two or more 2-year
cycles of the continuous NHANES is strongly recommended, whereas
analyzing a single year of NHANES data is discouraged.
However, since some components were
collected across three years during 1999-2002, single year datasets for
1999-2002 are available in the
Research Data Center (RDC).
Question 4. I was told I have to use
the 4-year weights provided on public use files for 1999-2002. Why can't
I combine weights together myself?
Answer: Sample weights for NHANES 1999-2000 were
based on population estimates developed by the Bureau of the Census
before the Year 2000 Decennial Census counts became available. The
2-year sample weights for NHANES 2001-2002, and all other subsequent
2-year cycles, are based on population estimates that incorporate the
year 2000 Census counts.
Because different population bases were
used, the 2-year weights for 1999-2000 and 2001-2002 are not directly
comparable. Therefore, when combining 1999-2000 with 2001-2002
survey years in analyses, you must use the 4-year sample weights
provided by NCHS since these have been created to account for the two
different reference populations.
For both 1999-2000 and 2001-2002 survey
cycles, the demographic file contains the weight variables for your use:
- wtint2yr and wtint4yr for all
interviewed sample persons,
- wtmec2yr and wtmec4yr for the
sample persons who have MEC data items, and
- two-year and four-year (for subsample datasets
with consistent data elements across two survey cycles) subsample
weights for selected sample persons.
Question 5. How do I calculate 6-year weights?
Answer: Six year sample weights for NHANES 1999-2004
should be calculated by researchers as follows: With the first two
dataset weights (NHANES 1999-2002) already averaged, then the six year
year weight would be WT99-04 = (2/3) x WT99-02 + (1/3) x WT03-04, where
WT99-02 is the variable WTMEC4YR from the NHANES 2001-2002 demographic
file dataset, and WT03-04 is the variable WTMEC2YR from the NHANES
2003-2004 demographic file dataset. Please refer to the NHANES Analytic
Guidelines provided with the data release files to determine the
appropriate methodology for analyses of combined years of data.
Question 6. What are the
subsample weights and how are they constructed?
Answer: NHANES respondents are
asked to participate in a variety of survey components that are
statistically defined (or random) subsamples of the NHANES MEC-examined
sample. These include a variety of lab, nutrition/dietary,
environmental, or mental health components. (Please see the respective
survey protocol/documentation for more specific information.)
For example, some, but
not all, participants are selected to give a fasting blood sample on the
morning of their MEC exam. The subsamples selected for these components
are chosen at random with a specified sampling fraction (for example,
1/2 or 1/3 of the total examined group) according to the protocol for
that component. Each component subsample has its own designated weight,
which accounts for the additional probability of selection into the
subsample component, as well as the additional nonresponse.
Question 7. Can you combine subsample weights?
Answer: No. Subsample weights are not designed to
be combined. In fact, many subsamples are mutually exclusive. If it is
necessary to combine two or more subsamples for your analyses, then
appropriate weights would need to be recalculated. However, details on
how to recalculate weights when combining subsamples go well beyond the
scope of this tutorial. Therefore, it is strongly advised that you do
not attempt to combine subsamples in any analysis.
Question 8. When I subset NHANES data, should I do it in
SUDAAN or in SAS data steps?
Answer: For SUDAAN procedures it is important that you do not create a
smaller subgroup based on any non weight-related groups of interest
(e.g. demographic, laboratory or examination variables) in the SAS data
step before executing the SUDAAN procedure. Instead, it is highly
recommended that you create a subset of your sample population using
the subpopn statement in the SUDAAN procedure itself and not
in the SAS data step. In addition, SUDAAN procedures require that all
observations in the dataset being read into a procedure have the same
sample weight. Therefore, prior to the SUDAAN procedure you should
create a subset of your data to include only those observations with
the appropriate sample weight for your analysis.
For SAS Survey procedures, there is no
subpopn statement. Instead, most SAS Survey procedures use a
domain statement for domain analysis, also known as subgroup
analysis or subpopulation analysis.
Variance Estimation
Question 1. What kind of sampling features may affect the variance estimates
of NHANES data?
Answer: NHANES has a complex, multistage,
probability cluster design, which would require the statistical analysis
to take into account these sample design features. Specifically,
attributes of the complex sample design such as differential weighting,
clustering and stratification will all have various impact on variance
estimates, estimated standard errors, and thereby test statistics and
confidence intervals.
Question 2. What would happen to the variance
estimates if standard statistical software for simple random samples is
used?
Answer: In a complex sample survey setting such as
NHANES, variance estimates computed using standard statistical software
packages that assume simple random sampling are generally too low and
therefore biased. As a result, significance levels are overstated and
type I error is more likely to occur. This is because these software
packages do not account for the differential weighting and the
correlation among sample persons within a cluster.
Question 3. How do you estimate the impact of
a complex sample design on variance estimates?
Answer: The impact of the complex sample design
upon variance estimates is measured by the design effect (DEFF). It is
defined as the ratio of the variance of a statistic which accounts for
the complex sample design to the variance of the same statistic based on
a hypothetical simple random sample of the same size.
Question 4.
Are there specific mathematical formulas you recommend to use for
computing variance estimates for complex survey
data?
Answer: For complex sample surveys, exact
mathematical formulas for variance estimates are usually not available.
Variance approximation procedures are required to provide reasonable
estimates of sampling error.
Two variance approximation procedures
which account for the complex sample design and compute design effects
are replication methods and Taylor Series Linearization. Initially, the
delete 1 jackknife method, a replication method, was used to estimate
variances based on data from the NHANES 1999-2000 survey. Balance
repeated replication was used for NHANES III. Currently NCHS recommends
the use of the Taylor Series Linearization methods for variance
estimation in all NHANES surveys. SUDAAN, Stata and the SAS Survey
procedures can be used to obtain variance estimates by this method.
Survey design variables identifying strata and PSU are required in order
to utilize these software packages. If replication methods are used, you
must compute your own replicate weights.
Question 5. Why do you emphasize degrees of
freedom so much in your NHANES tutorial?
Answer: There are several reasons for the emphasis
given to the proper calculation of the degrees freedom:
To calculate the correct value for the t-statistic from a
t-distribution and a selected level of significance, you must calculate
the proper degrees of freedom for the estimate.
Continuing research on issues related to stability of
variance estimates in subdomains of NHANES have shown that standard
error estimates based on small numbers of paired PSUs (i.e., degrees of
freedom) are prone to instability. Therefore, it is important to examine
the number of degrees of freedom from which a standard error estimate is
based.
The reliability of the estimated standard error, as
measured by its relative standard error (i.e., (standard error of the
standard error of the estimate/standard error
of the estimate)*100), is inversely proportional to
its
degrees of freedom.
As the number of degrees of freedom increases, the
relative standard error decreases and the reliability of the estimate
increases. The NHANES guidelines recommended a relative standard error
of at most 30%. This corresponds to at least 22 degrees of freedom.
Question 6. How do you properly calculate the
degrees of freedom?
Answer: Degrees of freedom are properly calculated
by subtracting the number of clusters in the first level of sampling
(strata) from the number of clusters in the second level of sampling
(PSUs) for each subgroup you are analyzing.
Question 7. Are there any differences between
SAS and SUDAAN software in terms of handling the degrees of freedom?
Answer: For both SUDAAN and SAS Survey procedures,
the degrees of freedom are calculated in the same way when looking at
the entire sample population or in subgroups where all strata and PSUs
are represented.
However, when you analyze data on a
subgroup of sample persons who may not be represented in all strata and
PSUs (e.g., Mexican Americans), the degrees of freedom provided in the
SUDAAN and SAS Survey Procedures output may differ. For example, SAS
Survey procedures, such as proc surveymeans, compute the degrees
of freedom as the number of clusters (PSUs) in the non-empty strata
minus the number of non-empty strata.
This means that if your data have empty
strata (no persons in the population for either PSU) the number of
degrees of freedom will increase. This is incorrect and SAS is currently
working on correcting this problem.
Question 8. How do you generate confidence
intervals using SAS or SUDAAN?
Answer: Both SAS Survey
procedures (proc surveymeans) and SUDAAN version 9.1 (proc
descript) produce 95% confidence intervals (CI). These 95% CIs are
calculated using the Wald method, which is based on a t-statistic for
the number of degrees of freedom in the entire NHANES sample.
However, they do not
correct for the reduction in the degrees of freedom in subdomains
where not all strata and PSUs are represented. Please see the Variance
Estimation module for instructions on how to correctly calculate 95%
confidence interval. Also, the Wald method should not be used when the
proportion is close to 0% or 100%. For prevalence estimates near 0% or
near 100%, standard methods of calculating confidence limits, such as
the Wald method, may produce lower limits less than 0% or upper limits
greater than 100%. In these cases, it is often recommended to use
alternative methods for calculating 95% confidence limits using
transformations (such as the logit or arcsine transformation), using the
Wilson method, or calculating exact confidence limits such as the
Clopper-Pearson approach.
NHANES Analysis
Descriptive Statistics
Question 1. In the tutorial, you recommend
checking the frequency distribution of each variable before analysis.
Why?
Answer: A frequency distribution not only presents an
organized picture of how individual scores are distributed on a
measurement scale, but also reveals extreme values and outliers which
may affect the analysis. Researchers can make decisions on whether and
how to recode or perform data transformation based on the distribution
statistics.
Question 2. Is it a good idea to get frequency tables for
all variables in your analysis, and print them out for reference?
Answer: In general, it is a good idea to check the
frequency distribution for all variables before analysis. However, you
may want your frequency distributions to be structured as either tables
or graphs. Because NHANES data have very large sample sizes with a
potentially long list of different values for continuous variables, it
is recommended that you use a graphic format to check the distribution
for continuous variables, and either frequency tables or graphic forms
for nominal or interval variables. If you request frequency tables for
all variables, you should always examine the length of your
printout before you press the "
print” button, as there may be hundreds
of pages involved.
Question 3. If the statistics for normality
turn out to be significant in my analysis, does that mean I cannot use
parametric tests any more?
Answer: Not necessarily. Statistics of normality do reveal
whether a data distribution is normal or not, and help determine whether
parametric or non-parametric tests should be used, or data
transformation is needed. However, since NHANES is a large,
representative sample of the U.S. population, most continuous variables
from this sample are expected to be normally distributed. If you just
conduct tests for normality, results on most variables would be
significant, i.e. even the slightest deviation from normality could
result in rejecting the null hypothesis due to the extremely large
sample sizes. Therefore, you should not solely rely on these tests for
normality to base your decision on.
A Q-Q plot, or a quantile-quantile plot, may offer
additional information. Q-Q plot is a graphical data analysis technique
for assessing whether the distribution for data follows a particular
distribution. In a Q-Q plot, the distribution of the variable in
question is plotted against a normal distribution. The variable of
interest is normally distributed if a straight line intersects the
y-axis at a 45 degree angle. Based on your tests of normality and Q-Q
plot, you may make a more informed decision about parametric or
non-parametric tests.
Question 4. What do you use percentiles for?
Answer: Compared with raw scores, percentiles provide additional information
about the distribution of values. For example, if you are told that a
boy is 27 inches tall and weighs 30 pounds, information such as the
average height and weight for his age group, or the number of boys who
score above or below this boy in his group would be very helpful. It is
much more informative if you could transform the height and weight of
the boy into percentile rank, such as 75th percentile in
height, and 50th percentile in weight for his age group.
Question 5. Can you generate percentiles with SAS Survey Procedures?
No, not the current versions of SAS.
Question 6. When should you use geometric means instead of arithmetic
means?
Answer: In instances where the data are highly skewed,
geometric means should be used. A geometric mean, unlike an arithmetic
mean, minimizes the effect of very high or low values, which could bias
the mean if a straight average (arithmetic mean) were calculated. The
geometric mean is a log-transformation of the data and is expressed as
the Nth root of
the product of N numbers.
Question 7. In the Descriptive Statistics module you demonstrated
how to calculate prevalence for hypertension. But the definition you
used in this tutorial is different from the one I usually use. Why is
that?
Answer: Definitions of many conditions or risk factors, such as hypertension,
diabetes, osteoporosis, or obesity have changed over time. In addition,
definitions also vary by different health or medical organizations'
guidelines. Over the years, publications using historic or current
NHANES have reflected these changes in definitions.
In this tutorial, all the definitions used are for illustration
purposes only, rather than definitive guidelines. For the appropriate or
most recent definition of medical conditions, please consult official
publications from corresponding medical or public health agencies.
Hypothesis Testing
Question 1. Can we use the student t-test for NHANES data?
Answer: Yes. The student t-test assumes that the data has a normal
distribution, and that the covariance is small. NHANES data do meet both
assumptions on most occasions provided that you do not divide the data
into very small sub-domains. Therefore, the t-test is frequently used
for NHANES data to detect differences in health outcomes or risk factors
between subpopulations.
Question 2. How should I handle the degrees of freedom when conducting
hypothesis testing with NHANES data?
Answer: Unlike with simple random samples, you cannot simply use n-1 as the
degrees of freedom in NHANES since it is a multi-stage, area probability
sample. So the number of independent pieces of information, or degrees
of freedom, depends upon the number of PSUs rather than on the number of
sample persons. Therefore, the degrees of freedom are calculated as the
number of first stage units (PSUs) containing observations minus the
number of strata (please see Sample Design module for more
information).
Question 3. When I calculate confidence intervals for a point estimate in
NHANES, should I use the t score or the Z score in the formula?
Answer: You should use a t-statistic with degrees of freedom equal to the
difference between the number of PSUs and the number of strata
containing observations.
Question 4. Do I have to use weights and design based methods when
calculating confidence intervals?
Answer: Yes. Sample weights must be incorporated in calculating the estimate
and its standard error, and design-based methods must be used to
estimate the standard error. Taylor Series Linearization is one example
of a design-based method. The design variables needed to obtain
estimates of standard errors through this method are provided on the
demographic files for the continuous NHANES.
Question 5. Can I get confidence intervals for highly skewed variables?
Answer: If you have highly skewed variables, transformations are recommended
before constructing the confidence intervals. One of the most common
transformations used in the literature is the loge. We
recommend that users verify that the transformed variable is normally
distributed before proceeding to construct confidence intervals.
Sometimes, applying the log-transformation does not necessarily yield a
normally distributed random variable. Furthermore, in instances in which
0 is a plausible value, the log is undefined. We recommend that users
try other transformations, for example the square root, in these
instances.
Question 6. Can I obtain geometric means and their confidence intervals
using SAS proc surveymeans?
Answer: At the present time, SAS proc surveymeans does not have an
option to produce geometric means and their standard errors. However,
they can be obtained by running proc surveymeans on the log
transformed variables to produce means and standard errors of the log
transformed variable, constructing the confidence interval on the
log-transformed scale, and then back transforming the endpoints.
In our tutorial, we demonstrated how you can obtain the geometric
mean and its standard error directly from SUDAAN proc descript.
If you have both software, you can then output the results to a SAS
dataset where the confidence interval can be constructed directly.
Question 7. What procedures would you recommend for chi square testing?
Answer: For a complex sample like NHANES, we recommend, that you SAS proc
surveyfreq (CHISQ, based on the Rao-Scott chi-square with an
adjusted F statistic). This would take into account for survey design
with degrees of freedom equal to the number of PSUs minus the number of
strata containing observations.
If you use SUDAAN, this statistic can be done through proc
crosstab procedure in SUDAAN version 9.0. It provides limited
chi-square statistics based on Wald chi-square but does not provide an F
adjusted p-value. However, SUDAAN regression models do provide F
adjusted chi-square statistics which are recommended for analyzing
NHANES data.
The Cochran Mantel Haenzel Test, an extension of the Pearson
Chi-Square, can be applied to stratified two-way tables to test for
homogeneity or independence in a non-survey setting. For a complex
sample its analogue can be obtained in SUDAAN proc crosstab (cmh).
Age Standardization
Question 1. When do I have to use age standardization?
Answer: Age standardization, or age adjustment, is used when comparing two or
more populations at one point in time, or one population at two or more
points in time. In other words, age-adjusted rates make two groups that
differ in their age distribution more comparable. This method is
particularly relevant when populations being compared have different age
structure, as is true, for example in the U.S. white and Hispanic
populations. In addition to being associated with population structure,
age also is frequently associated with many health outcomes and their
risk factors. Therefore, age standardization becomes a necessary method
to control for the confounding effects of age.
Question 2. Are age-adjusted rates usually different from unadjusted rates
in NHANES data?
Answer: That depends mainly on two factors: 1) whether the two
subgroups in comparison are very different in age distribution; and 2)
whether the health outcomes or risk factors being compared are
associated with age. If yes to both cases, you will usually find the age-adjusted rates
considerably different from crude rates. This suggests that
age-standardization is necessary. Nevertheless, it is generally good
practice to use age-adjusted estimates when comparing health outcomes
among subgroups, or at least compare the age-adjusted estimates with the
crude rates to make sure there are no substantial differences, before
using the crude estimates.
Question 3. There are different methods for age-standardization. Which do
you recommend for NHANES data?
Answer: For NHANES analysis, we usually adopt the direct method for age
standardization. This involves three steps:
- selecting a standard population, typically a US Census
population
- calculating age-standardizing proportions for age
categories of interests
- applying the adjustment factors to subpopulations under
comparison
For continuous NHANES, we recommend using the 2000 Census
population. A spreadsheet with the year 2000 U.S. population structure
by age is included in the tutorial for your convenience.
Question 4. When do you recommend the use of population estimates?
Answer: We most frequently use prevalence rates to describe health outcomes
or risk factors. These rates describe occurrences ranging from rare to
common in a population at a given time, but it is hard to see the impact
or magnitude of the issue just by looking at the prevalence rate.
Population estimates allow researchers to look at
the estimated total
numbers of persons in the U.S. affected with a given condition, thus
they can better describe the public health impact of an outcome or risk
factor.
Question 5. How do you calculate population estimates for NHANES data?
Answer:
- Calculate the crude prevalence (as a
percentage) for the age-, sex-, or race/ethnicity subgroups you are
interested in reporting. Then output these results to a SAS file.
- Use population totals from the Current
Population Surveys (CPS) to determine population estimates in
NHANES.
- Multiply the prevalence of the health
condition of interest by the corresponding CPS-based population
total to obtain an estimate of the number of non-institutionalized
U.S. individuals with the condition.
Question 6. Where can we obtain CPS totals for continuous NHANES data?
Answer: CPS-based population tables for NHANES by race/ethnicity, gender and
age are located on the respective survey cycle NHANES web page at:
http://www.cdc.gov/nchs/about/major/nhanes/nhanes01-02.htm and as
SAS data files located on the Download Sample Code and Dataset page of
our tutorial.
Question 7. Can you combine population totals
across survey cycles,
or for multiple age and gender or race/ethnic subgroups?
Answer: Yes. It is possible to combine NHANES survey cycles. For example,
to combine two survey cycles (e.g., 2001-2002 and 2003-2004), you must
use the midpoint of each cycle, and combine them as follows: ½ (NHANES
2001-2002 population totals) + ½ (NHANES 2003-2004 population totals) in
order to get a population total for 2001-2004. Similarly, you would do
this for each of the age-, sex-, or race/ethnicity groups you wanted to
combine to get a population total for that group.
The only exception would be when combining NHANES 1999-2000 with
2001-2002 data. As stated in the weighting module, these survey years
used a different reference population for sampling, so population totals
for 1999-2002 are provided by NCHS.
Question 8. Why can't you just sum the final sampling weights for the
population totals?
Answer: Since the non-institutionalized CPS population totals are used to
calculate the final sampling weights for the NHANES survey, you may
wonder why you cannot just sum the final sampling weights for all sample
persons with the health condition of interest, in order to arrive at
population estimates for the health condition. For example, the total
population estimate for a given health condition from the interviewed
sample should equal the sum of the final interview weights for that
health condition within the demographic domains among all interviewed
persons. However, if there are a significant number of exclusions or
missing data for a health condition, summing the weights will not
produce an accurate population estimate. Therefore, using this method
is NOT RECOMMENDED.
Linear Regression
Question 1. When do you use linear regression
for NHANES data?
Answer: A linear
regression model is typically used to assess the association
between independent variable(s) (Xi) and a continuous
dependent variable (Y). In cross-sectional surveys such as NHANES,
linear regression analyses can be used to examine associations between
covariates and health outcomes. For instance, you can use
a multiple linear
regression model to assess the association between high density
lipoprotein cholesterol (Y) and selected covariates (Xi) such
as race/ethnicity, age, sex, body mass index (BMI), smoking status, and
education level.
Question 2. Which test statistics would you
recommend for regression analysis of NHANES data, WALD F, Satterthwaite
adjusted F, or Satterthwaite adjusted chi square?
Answer: For regression analyses, SUDAAN produces the WALD F, Satterthwaite
adjusted F, and Satterthwaite adjusted chi square statistics with their
corresponding p-values. SAS Survey Procedures only produce the Wald F
test with their corresponding p-values.
The current NHANES Analytic Guidelines do not make a recommendation
about which test statistic is the "
best.” Generally speaking, the
Satterthwaite adjusted F is the most conservative of the three
statistics, which rejects the null hypothesis less often than do the
other two statistics. However, we encourage analysts to examine all
three statistics and the corresponding p-values for consistency. We also
encourage analysts to compare the nominal degrees of freedom (i.e. the
number of PSUs minus the number of strata containing observations) to
the adjusted Satterthwaite degrees of freedom. Nominal degrees of
freedom that are much larger than the adjusted Satterthwaite degrees of
freedom may indicate model instability.
Question 3. How do you specify a multiple
regression model in SUDAAN with both continuous and discrete independent
variables plus interaction terms?
Answer: In SUDAAN, the association between the dependent and independent
variables is expressed using the model statement in the
proc regress procedure. The dependent variable must be a continuous
variable and will always appear on the left hand side of the equation.
The variables on the right hand side of the equation are the independent
variables and may be discrete, continuous or both. Continuous variables
are simply listed in the model. Discrete variables are specified using a
subgroup or a class statement.
When interactions are included in the model, they are denoted with an
asterisk, *, between the two variables. An interaction can occur between
a discrete and a continuous variable, or between two discrete variables.
An interaction term will always appear on the right hand side of an
equation.
Question 4. Can you do multiple regression analysis in SAS?
Answer: You can conduct multiple regression analysis on NHANES using SAS
Survey Procedures. However, you need to be mindful that version 9.1 of
SAS Survey Procedures does not have a domain statement for
subpopulation analyses. Therefore, you have to use a macro provided on
the SAS website. You need to download that file, save it to your
computer, and make sure to note the location, as you will use SAS code
to refer to this file later.
In SAS version 9.2 or later version, a domain statement
will be
added to proc surveyreg so you will no longer need to use the SAS
macro to deal with subpopulation analyses.
The model statement in SAS is very similar to the SUDAAN procedure:
the dependent variable Y is continuous, and always appears on the left
hand side of the equation. The variables on the right hand side of the
equation are the independent variables and may be discrete or
continuous. Interactions always appear on the right hand side of an
equation, and are denoted with an asterisk, *, between the two
variables.
Question 5. How do you select a reference category in a regression
analysis?
Answer: In SUDAAN, the default reference category for a discrete variable is
set to the last category. However, you can use the reflevel
statement to change the reference level of a categorical variable to
your desired category.
In SAS, by default it is the high level in a discrete variable, and
there is no option to change the reference category in the model.
Therefore, you will need to recode the desired reference category as the
highest level before specifying the model.
Logistic Regression
Question 1. What statistical software can I use for logistic regression
analysis?
Answer: You can run logistic regression with
stand-alone SUDAAN or SAS-callable SUDAAN, SAS Survey procedure, or
Stata.
Question 2. Are these software packages very
similar in programming languages?
Answer: Not really. For instance, each SAS or
SUDAAN version has its own unique commands for executing logistic
regression analysis. You need to use the correct command for the
software that you are using. Please also note that different versions of
SAS and SUDAAN use slightly different statements to specify categorical
variables and reference groups. Make sure that you are using the correct
commands for the version of software on your computer.
This tutorial module
usually used SAS 9.1 and SUDAAN 9.0, and the commands are
slightly different for SAS and SUDAAN. For example:
- the stand-alone version of SUDAAN, the
procedure is logistic
- SAS-callable SUDAAN, the procedure is
called rlogist
- SAS survey procedures, the procedure is
surveylogistic
Question 3. How do you select weights for logistic regression models?
Answer: You always use the weight of the smallest common denominator for
all variables in the model. For instance, if you have both household
interview variables and MEC examination variables in the model, you will
choose MEC examination weights, since not all respondents who were
interviewed have participated in the MEC exam. Therefore, it is always
important to check all the variables in the model, and identify the
variable(s) with the smallest sample size.
Question 4. How do you code the dependent variable for event and non-event?
Answer: For SUDAAN and SAS, you usually code the dependent variable as 1 for
an event, and 0 for a non-event.
Question 5. When I run both the SAS Survey and SUDAAN programs for the same
logistic regression model, why do I sometimes get different results?
Answer: This is because there may be slight differences caused by missing
data in any paired PSU or how each software program handles degrees of
freedom. Specifically:
- The variance estimates and standard errors are
identical if there are no missing data in any paired PSUs. They will
be different if any one of the paired PSUs contains missing data, as
SAS and SUDAAN handle stratum contribution from the missing cells
differently.
- The confidence intervals are slightly
different since SAS and SUDAAN handles degrees of freedom
differently.
Page Last Modified:
August 05, 2008