Recognizing the Problem
NRSA Trainees Research Conference Slide Presentation (Text Version)
By Maureen Smith, M.D., Ph.D.
On June 5, 2004, Maureen Smith, M.D., Ph.D., made a presentation at the 10th Annual National Research Service Award (NRSA) Trainees Research Conference. This is the text version of her slide presentation. Select to access the PDF File, 125 KB (PDF Help).
Slide 1
Multi-level Analysis I: Recognizing the Problem
Maureen Smith, M.D., Ph.D.
Dept. of Population Health Sciences
University of Wisconsin-Madison
Slide 2
A day in the life of a researcher:
- We have data:
- ID (observation #).
- X (variable 1).
- Y (variable 2).
- We want to use the value of X to explain the value of Y.
ID |
X |
Y |
1 |
60 |
3 |
2 |
75 |
6 |
3 |
81 |
10 |
4 |
70 |
7 |
5 |
65 |
5 |
Slide 3
Welcome to the fantasy world of
linear regression
Slide 4
Reality check
How often are
observations truly
independent from one
another?
The slide features a square containing orange or green dots.
- Dot indicates geographic location of teenager.
- Orange or green indicates hair color.
Do these teenagers look independent?
Slide 5
1) Clustering introduced in sampling
The slide features a square containing 4 large circles. The circle labeled "Block 1" contains a number of orange dots. The circle labeled "Block 2" contains yellow dots. The circle labeled "Block 3" contains green dots. The circle labeled "Block 4" contains text that reads, "Not all blocks are selected".
- Multistage sampling:
- Circles represent city blocks.
- Blocks randomly sampled.
- All persons in block surveyed to determine attitudes.
- Persons in one block are more like their neighbors than persons who live in another block.
- Nesting or clustering of data:
Slide 6
Effect of sample design on errors
The slide features a square containing 4 large circles, as described on Slide 5 above, except that the circle labeled "Block 4" is now empty.
- Errors in linear regression:
- Assume independence.
- Each person => info.
- Each person worth "1."
- If clustering occurs:
- Obs not independent.
- Each person => less info.
- Each person worth < "1."
Slide 7
Simple linear regression won't work!
- Violates assumption of independence:
- If don't account for it.
- Standard errors are too small.
- Makes coefficients look more significant.
- "You think there is more information in the data than actually exists."
Slide 8
How much information is lost?
"Design Effect"
If designing a study using multistage sampling, need to increase sample size to account for loss of information.
- Design effect:
- Each observation is "worth less."
- Need to estimate your "effective" sample size.
- Used for sample size calculations in multi-stage sampling.
Neffective = |
Nn
_______________
Design effect |
Slide 9
Questions—Pair up!
The slide features a square containing 4 large circles, as described on Slide 6 above.
- Multi-stage sample design:
- City blocks N = 3.
- Persons N = 26.
- Design effect = 2.
- What is the effective sample size?
- What sample size would you use in your power calculations?
Slide 10
2) Clustering introduced naturally
The slide features a square containing 4 large circles, as described on Slide 6 above, except that the circles are now labeled "Hospital 1, 2, 3, 4" instead of "Block."
- Analyze costs of care for hospitalized patients.
- Patients in one hospital are more alike than patients in another hospital.
- Nesting or clustering of data:
- Patients within hospitals.
Slide 11
Effect of natural clusters on errors
The slide features a square containing 4 large circles, as described on Slide 10 above.
- Same effect on errors:
- Obs not independent.
- Each person => less info.
- Each person worth < "1."
- Simple linear regression won"t work!
Slide 12
What do we do?
- First question—do we care?
- Is clustering a nuisance?
or
- Is clustering an interesting phenomenon?
- Leads to different analytic strategies.
Slide 13
If clustering is a nuisance
- Example—Multi-stage sampling:
- Don't care how people vary within city blocks versus between city blocks.
- Artificially imposed by the sampling design.
- Not interested in measuring it.
- Just want to correct for it.
- Use analytic strategies that correct for clustering.
Slide 14
How to correct errors for clustering
- Robust estimates of variance:
- Stata ", robust cluster (____)".
- SAS empirical estimates of variance.
- Programs that account for complex survey design (weights, strata, clusters):
- Stata "svy" commands.
- SAS "survey___" commands.
- Other strategies.
Slide 15
If clustering is interesting
- Example—examine costs for hospitalized patients.
- Split out the variation in costs:
- How much variation due to differences in patients?
- How much variation due to differences in hospitals?
- Examine factors that explain variation in costs:
- Characteristics of patients.
- Characteristics of hospitals.
- Analytic strategy = Multi-level modeling!
Slide 16
Questions
1. Identify 3 patient characteristics that
might explain variation in costs.
2. Identify 3 hospital characteristics that
might explain variation in costs.
3. Do you think more of the variation in
costs is explained by the patient or the
hospital?
Slide 17
Multi-Level Models
(Hierarchical linear models)
(Random effects models)
Slide 18
The concept of "levels"
The slide features a square containing 4 large circles, as described on Slide 10 above.
- Our example—2 levels:
- Micro = patients (N=26).
- Macro = hospitals (N=3).
- At each level:
- Patient characteristics.
- Hospital characteristics.
Slide 19
Data Structure—Patient
Patient-level data ( = "unit-level data"):
Patient ID |
Hospital ID |
Age (X) |
Cost (Y) |
1 |
1 |
60 |
3 |
2 |
1 |
75 |
6 |
3 |
2 |
81 |
10 |
4 |
2 |
70 |
7 |
5 |
2 |
65 |
5 |
- Y represents a patient characteristic:
- X represents a patient characteristic:
- Age.
- Note—understand process at each step.
- "Older patients are sicker and tend to cost more."
Slide 20
Simple Linear Regression
yi = a + bxi + ei
- i indexes patients (i=1 to N).
- Relates x to y.
- Both variables are patient characteristics.
- Remember the assumptions.
Slide 21
Questions
Patient ID |
Hospital ID |
Age (X) |
Cost (Y) |
1 |
1 |
60 |
3 |
2 |
1 |
75 |
6 |
3 |
2 |
81 |
10 |
4 |
2 |
70 |
7 |
5 |
2 |
65 |
5 |
costi = a + b(agei) + ei
- Is there a problem with this model when applied to these data?
- If so, what?
Slide 22
The Problem
- Does not account for the clustering of patients within hospitals:
- Data have a structure that is not represented.
- ei—Assumption of independence is not met.
- Do we care?
- If clustering is nuisance => Stata robust option.
- If clustering is interesting => Multilevel model.
Slide 23
Data Structure—Hospital
Hospital-level data ( = "group-level data"):
Hospital ID |
Beds (W) |
1 |
10 |
2 |
65 |
- W represents a hospital characteristic:
- # of beds in the hospital.
- Bigger hospitals are more expensive:
- More technology.
- More high-cost specialists.
- "A built bed is a filled bed."
Slide 24
Combined Data Structure
Patient-level data:
Patient ID |
Hospital ID |
Age (X) |
Cost (Y) |
1 |
1 |
60 |
3 |
2 |
1 |
75 |
6 |
3 |
2 |
81 |
10 |
4 |
2 |
70 |
7 |
5 |
2 |
65 |
5 |
+
Hospital-level data:
Hospital ID |
Beds (W) |
1 |
10 |
2 |
65 |
= ?
Slide 25
Combined Data Structure
Patient- and hospital-level data:
Patient ID |
Hospital ID |
Age (X) |
Cost (Y) |
Beds (W) |
1 |
1 |
60 |
3 |
10 |
2 |
1 |
75 |
6 |
10 |
3 |
2 |
81 |
10 |
65 |
4 |
2 |
70 |
7 |
65 |
5 |
2 |
65 |
5 |
65 |
- Age (X) and Cost (Y):
- Variation between patients.
- Beds (W):
- Only variation between hospitals.
- No variation within hospitals.
Slide 26
WARNING—Equations coming up!
Remember—In multi-level modeling...
SUBSCRIPTS ARE YOUR FRIENDS!
Slide 27
Simple Linear Regression (one approach to modeling this data structure)
yij = a + bxij + dwj + eij
- j indexes hospitals (j=1 to N).
- i indexes patients within hospitals (i=1 to nj).
costij = a + b(ageij) + d(bedsj) + eij
Slide 28
Questions
Patient ID |
Hospital ID |
Age (X) |
Cost (Y) |
Beds (W) |
1 |
1 |
60 |
3 |
10 |
2 |
1 |
75 |
6 |
10 |
3 |
2 |
81 |
10 |
65 |
4 |
2 |
70 |
7 |
65 |
5 |
2 |
65 |
5 |
65 |
costij = a + b(ageij) + d(bedsj) + eij
- Is there a problem with this model when applied to these data?
- If so, what?
Slide 29
The Problem, Part 2
- You must assume that all of the data structure is represented by the explanatory variables.
- Unlikely this will account for the clustering of patients within hospitals.
- Assumes that all clustering within hospitals is explained by the number of beds in the hospital (W).
- If "beds" does not explain all clustering, then assumption of independence is not met for eij.
Slide 30
How do we represent the clustering?
Slide 31
Starting simple—random intercept
- Model the clustering between groups:
- Let the intercept only (aj) vary from group to group.
- Take out all group-level variables (W):
yij = aj + bxij + eij
- Groups j—higher or lower values of aj only.
- Assumes some groups tend to have, on average, higher or lower values of Y.
Slide 32
Question
yij = aj + bxij + eij
- Why take the group-level variable (W) out of this model?
- Must W be taken out of the model?
Slide 33
How do we want to model variation between groups?
- W—a "partial" way to model variation between groups:
- If included, it will pick up part of the variation between groups.
- "Part of the variation in costs between hospitals will be explained by the number of beds in the hospital."
- Goal of a random intercept model:
- Model the actual structure of the data.
- Let groups vary, on average, in Y.
- "Let the hospitals vary, on average, in cost."
Slide 34
How do we actually do it?
yij = aj + bxij + eij
- Split aj into (a0 + uj):
yij = a0 + + bxij + eij
a0 = average intercept (constant).
- uj = deviation from the average intercept for group j
= conditional on X, individuals in group j have Y values that are uj higher than in the average group.
- "Conditional on patient age, patients in Hospital j have costs that are uj higher than the average costs for all patients."
Slide 35
What do we do with uj?
Part 1—Fixed effects
- Are groups j regarded as unique?
- Do you want to draw conclusions about each group?
TREAT AS "FIXED EFFECTS"
- Create j - 1 indicator variables (0/1).
- Leads to j - 1 regression parameters.
Slide 36
Questions
Patient ID |
Hospital ID |
Age (X) |
Cost (Y) |
1 |
1 |
60 |
3 |
2 |
1 |
75 |
6 |
3 |
2 |
81 |
10 |
4 |
2 |
70 |
7 |
5 |
2 |
65 |
5 |
costij = a0 + b(ageij) + uj + eij
- For our data, what does this equation look like if uj is modeled as a fixed effect?
- Are all indicator variables in a model also fixed effects?
Slide 37
Modeling uj as a fixed effect
(uj = "differences between hospitals")
costij = a0 + b(ageij) + c(hosp2ij) + eij
Slide 38
What do we do with uj?
Part 2—Random effects
- Model uj explicitly.
- Additional assumption that uj is i.i.d.
- Groups (hospitals) considered exchangeable.
- Can include group-level explanatory variables (W).
Slide 39
Questions
Patient ID |
Hospital ID |
Age (X) |
Cost (Y) |
Beds (W) |
1 |
1 |
60 |
3 |
10 |
2 |
1 |
75 |
6 |
10 |
3 |
2 |
81 |
10 |
65 |
4 |
2 |
70 |
7 |
65 |
5 |
2 |
65 |
5 |
65 |
yij = a0 + b(xij) + uj + eij
- For our data, what does this equation look like if uj is modeled as a random effect?
- How would we include our hospital-level explanatory variable?
Slide 40
Modeling uj as a random effect
(uj = "differences between hospitals")
costij = a0 + b(ageij) + uj + eij
Slide 41
What we did and didn't do today
We discussed:
- Clustering (artificial and natural).
- Accounting for clustering:
- Nuisance = robust estimates of variance.
- Interesting = multilevel models.
- Representing clustering in simple model:
- Fixed effects.
- Random effects with group-level explanatory variables.
We didn't discuss:
- Random coefficients other than the intercept.
- Interaction terms (cross-level effects).
- Many other things.
Slide 42
Follow-up
maureensmith@wisc.edu
Current as of September 2004
Internet Citation:
Multi-level Analysis I: Recognizing the Problem. Text Version of a Slide Presentation at a National Research Service Award (NRSA) Trainees Research Conference. Agency for Healthcare Research and Quality, Rockville, MD. http://www.ahrq.gov/fund/training/smithtxt.htm