You Are Here: AHRQ Home > Funding Opportunities > Training and Education > NRSA Trainees Research Conference Presentations > Multi-level Analysis I: Recognizing the Problem

Multi-level Analysis I

Recognizing the Problem

NRSA Trainees Research Conference Slide Presentation (Text Version)

By Maureen Smith, M.D., Ph.D.

On June 5, 2004, Maureen Smith, M.D., Ph.D., made a presentation at the 10th Annual National Research Service Award (NRSA) Trainees Research Conference. This is the text version of her slide presentation. Select to access the PDF File, 125 KB (PDF Help).

Slide 1

Multi-level Analysis I: Recognizing the Problem

Maureen Smith, M.D., Ph.D.
Dept. of Population Health Sciences
University of Wisconsin-Madison

Slide 2

A day in the life of a researcher:

We have data:
- ID (observation #).
- X (variable 1).
- Y (variable 2).
We want to use the value of X to explain the value of Y.

ID	X	Y
1	60	3
2	75	6
3	81	10
4	70	7
5	65	5

Slide 3

Welcome to the fantasy world of linear regression

A simple model:
y_i = intercept + slope(x_i) + error
i indicates observations (1... N)
(a graph illustrates)
Assumptions:
- Linearity.
- Independence.
- Normality.
- Constant variance.

Slide 4

Reality check

How often are observations truly independent from one another?

The slide features a square containing orange or green dots.

Dot indicates geographic location of teenager.
Orange or green indicates hair color.

Do these teenagers look independent?

Slide 5

1) Clustering introduced in sampling

The slide features a square containing 4 large circles. The circle labeled "Block 1" contains a number of orange dots. The circle labeled "Block 2" contains yellow dots. The circle labeled "Block 3" contains green dots. The circle labeled "Block 4" contains text that reads, "Not all blocks are selected".

Multistage sampling:
- Circles represent city blocks.
- Blocks randomly sampled.
- All persons in block surveyed to determine attitudes.
Persons in one block are more like their neighbors than persons who live in another block.
Nesting or clustering of data:
- Persons within blocks.

Slide 6

Effect of sample design on errors

The slide features a square containing 4 large circles, as described on Slide 5 above, except that the circle labeled "Block 4" is now empty.

Errors in linear regression:
- Assume independence.
- Each person => info.
- Each person worth "1."
If clustering occurs:
- Obs not independent.
- Each person => less info.
- Each person worth < "1."

Slide 7

Simple linear regression won't work!

Violates assumption of independence:
If don't account for it.
- Standard errors are too small.
- Makes coefficients look more significant.
- "You think there is more information in the data than actually exists."

Slide 8

How much information is lost?

"Design Effect"

If designing a study using multistage sampling, need to increase sample size to account for loss of information.

Design effect:
- Each observation is "worth less."
- Need to estimate your "effective" sample size.
- Used for sample size calculations in multi-stage sampling.

N_effective =

N_n
_______________
Design effect

Slide 9

Questions—Pair up!

The slide features a square containing 4 large circles, as described on Slide 6 above.

Multi-stage sample design:
- City blocks N = 3.
- Persons N = 26.
Design effect = 2.

What is the effective sample size?
What sample size would you use in your power calculations?

Slide 10

2) Clustering introduced naturally

The slide features a square containing 4 large circles, as described on Slide 6 above, except that the circles are now labeled "Hospital 1, 2, 3, 4" instead of "Block."

Analyze costs of care for hospitalized patients.
Patients in one hospital are more alike than patients in another hospital.
Nesting or clustering of data:
- Patients within hospitals.

Slide 11

Effect of natural clusters on errors

The slide features a square containing 4 large circles, as described on Slide 10 above.

Same effect on errors:
- Obs not independent.
- Each person => less info.
- Each person worth < "1."
Simple linear regression won"t work!

Slide 12

What do we do?

First question—do we care?
- Is clustering a nuisance?
  or
- Is clustering an interesting phenomenon?
Leads to different analytic strategies.

Slide 13

If clustering is a nuisance

Example—Multi-stage sampling:
- Don't care how people vary within city blocks versus between city blocks.
- Artificially imposed by the sampling design.
- Not interested in measuring it.
- Just want to correct for it.
Use analytic strategies that correct for clustering.

Slide 14

How to correct errors for clustering

Robust estimates of variance:
- Stata ", robust cluster (____)".
- SAS empirical estimates of variance.
Programs that account for complex survey design (weights, strata, clusters):
- Stata "svy" commands.
- SAS "survey___" commands.
Other strategies.

Slide 15

If clustering is interesting

Example—examine costs for hospitalized patients.
Split out the variation in costs:
- How much variation due to differences in patients?
- How much variation due to differences in hospitals?
Examine factors that explain variation in costs:
- Characteristics of patients.
- Characteristics of hospitals.
Analytic strategy = Multi-level modeling!

Slide 16

Questions

1. Identify 3 patient characteristics that might explain variation in costs.

2. Identify 3 hospital characteristics that might explain variation in costs.

3. Do you think more of the variation in costs is explained by the patient or the hospital?

Slide 17

Multi-Level Models

(Hierarchical linear models)

(Random effects models)

Slide 18

The concept of "levels"

The slide features a square containing 4 large circles, as described on Slide 10 above.

Our example—2 levels:
- Micro = patients (N=26).
  - Micro-level = "units."
- Macro = hospitals (N=3).
  - Macro-level = "groups."
At each level:
- Patient characteristics.
- Hospital characteristics.

Slide 19

Data Structure—Patient

Patient-level data ( = "unit-level data"):

Patient ID	Hospital ID	Age (X)	Cost (Y)
1	1	60	3
2	1	75	6
3	2	81	10
4	2	70	7
5	2	65	5

Y represents a patient characteristic:
- Cost (thousands of $).
X represents a patient characteristic:
- Age.
- Note—understand process at each step.
- "Older patients are sicker and tend to cost more."

Slide 20

Simple Linear Regression

y_i = a + bx_i + e_i

i indexes patients (i=1 to N).
Relates x to y.
Both variables are patient characteristics.
Remember the assumptions.

Slide 21

Questions

Patient ID	Hospital ID	Age (X)	Cost (Y)
1	1	60	3
2	1	75	6
3	2	81	10
4	2	70	7
5	2	65	5

cost_i = a + b(age_i) + e_i

Is there a problem with this model when applied to these data?
If so, what?

Slide 22

The Problem

Does not account for the clustering of patients within hospitals:
- Data have a structure that is not represented.
- e_i—Assumption of independence is not met.
Do we care?
- If clustering is nuisance => Stata robust option.
- If clustering is interesting => Multilevel model.

Slide 23

Data Structure—Hospital

Hospital-level data ( = "group-level data"):

Hospital ID	Beds (W)
1	10
2	65

W represents a hospital characteristic:
- # of beds in the hospital.
Bigger hospitals are more expensive:
- More technology.
- More high-cost specialists.
- "A built bed is a filled bed."

Slide 24

Combined Data Structure

Patient-level data:

Patient ID	Hospital ID	Age (X)	Cost (Y)
1	1	60	3
2	1	75	6
3	2	81	10
4	2	70	7
5	2	65	5

Hospital-level data:

Hospital ID	Beds (W)
1	10
2	65

= ?

Slide 25

Combined Data Structure

Patient- and hospital-level data:

Patient ID	Hospital ID	Age (X)	Cost (Y)	Beds (W)
1	1	60	3	10
2	1	75	6	10
3	2	81	10	65
4	2	70	7	65
5	2	65	5	65

Age (X) and Cost (Y):
- Variation between patients.
Beds (W):
- Only variation between hospitals.
- No variation within hospitals.

Slide 26

WARNING—Equations coming up!

Remember—In multi-level modeling...

SUBSCRIPTS ARE YOUR FRIENDS!

Slide 27

Simple Linear Regression (one approach to modeling this data structure)

y_ij = a + bx_ij + dw_j + e_ij

j indexes hospitals (j=1 to N).
i indexes patients within hospitals (i=1 to n_j).

cost_ij = a + b(age_ij) + d(beds_j) + e_ij

Frequently used.

Slide 28

Questions

Patient ID	Hospital ID	Age (X)	Cost (Y)	Beds (W)
1	1	60	3	10
2	1	75	6	10
3	2	81	10	65
4	2	70	7	65
5	2	65	5	65

cost_ij = a + b(age_ij) + d(beds_j) + e_ij

Is there a problem with this model when applied to these data?
If so, what?

Slide 29

The Problem, Part 2

You must assume that all of the data structure is represented by the explanatory variables.
Unlikely this will account for the clustering of patients within hospitals.
- Assumes that all clustering within hospitals is explained by the number of beds in the hospital (W).
- If "beds" does not explain all clustering, then assumption of independence is not met for e_ij.

Slide 30

How do we represent the clustering?

Let the regression coefficients vary from group to group:
y_ij = a_j + b_jx_ij + dw_j + e_ij
Groups j can have higher or lower values of a_j and b_j.
Why not create d_j?

Slide 31

Starting simple—random intercept

Model the clustering between groups:
- Let the intercept only (a_j) vary from group to group.
- Take out all group-level variables (W):
  y_ij = a_j + bx_ij + e_ij
Groups j—higher or lower values of a_j only.
Assumes some groups tend to have, on average, higher or lower values of Y.

Slide 32

Question

y_ij = a_j + bx_ij + e_ij

Why take the group-level variable (W) out of this model?
Must W be taken out of the model?

Slide 33

How do we want to model variation between groups?

W—a "partial" way to model variation between groups:
- If included, it will pick up part of the variation between groups.
- "Part of the variation in costs between hospitals will be explained by the number of beds in the hospital."
Goal of a random intercept model:
- Model the actual structure of the data.
- Let groups vary, on average, in Y.
- "Let the hospitals vary, on average, in cost."

Slide 34

How do we actually do it?

y_ij = a_j + bx_ij + e_ij

Split a_j into (a₀ + u_j):
y_ij = a₀ + + bx_ij + e_ij
a₀ = average intercept (constant).
u_j = deviation from the average intercept for group j
= conditional on X, individuals in group j have Y values that are u_j higher than in the average group.
"Conditional on patient age, patients in Hospital j have costs that are u_j higher than the average costs for all patients."

Slide 35

What do we do with u_j?

Part 1—Fixed effects

Are groups j regarded as unique?
- Do you want to draw conclusions about each group?
TREAT AS "FIXED EFFECTS"
Create j - 1 indicator variables (0/1).
Leads to j - 1 regression parameters.

Slide 36

Questions

Patient ID	Hospital ID	Age (X)	Cost (Y)
1	1	60	3
2	1	75	6
3	2	81	10
4	2	70	7
5	2	65	5

cost_ij = a₀ + b(age_ij) + u_j + e_ij

For our data, what does this equation look like if u_j is modeled as a fixed effect?
Are all indicator variables in a model also fixed effects?

Slide 37

Modeling u_j as a fixed effect

(u_j = "differences between hospitals")

cost_ij = a₀ + b(age_ij) + c(hosp2_ij) + e_ij

hosp2 = 0/1:
- 1 = patient i in hospital 2, 0 = patient i in hospital 1
Do we need index j? No - why?
cost_i = a₀ + b(age_i) + c(hosp2_i) + e_i
What assumptions does this model make?

Slide 38

What do we do with u_j?

Part 2—Random effects

Three issues:
- Are groups regarded as sample from pop.?
- Do you want to test the effect of group level variables (remember W = # beds)?
- Do you have small group sizes (2-50 or 100)?
TREAT AS "RANDOM EFFECTS"

Model u_j explicitly.
Additional assumption that u_j is i.i.d.
- Groups (hospitals) considered exchangeable.
Can include group-level explanatory variables (W).

Slide 39

Questions

Patient ID	Hospital ID	Age (X)	Cost (Y)	Beds (W)
1	1	60	3	10
2	1	75	6	10
3	2	81	10	65
4	2	70	7	65
5	2	65	5	65

y_ij = a₀ + b(x_ij) + u_j + e_ij

For our data, what does this equation look like if u_j is modeled as a random effect?
How would we include our hospital-level explanatory variable?

Slide 40

Modeling u_j as a random effect

(u_j = "differences between hospitals")

cost_ij = a₀ + b(age_ij) + u_j + e_ij

u_j = deviation from the average cost for hospital j
= estimated using HLM, SAS, Stata (get a number!)
cost_ij = a₀ + b(age_ij) + d(beds_j) + u_j + e_ij
Uses the number of beds in the hospital to explain some of the variation in u_j.
Last question—what happens to u_j if the number of beds explains all of the differences between hospitals?

Slide 41

What we did and didn't do today

We discussed:

Clustering (artificial and natural).
Accounting for clustering:
- Nuisance = robust estimates of variance.
- Interesting = multilevel models.
Representing clustering in simple model:
- Fixed effects.
- Random effects with group-level explanatory variables.

We didn't discuss:

Random coefficients other than the intercept.
Interaction terms (cross-level effects).
Many other things.

Slide 42

Follow-up

maureensmith@wisc.edu

Current as of September 2004

Internet Citation:

Multi-level Analysis I: Recognizing the Problem. Text Version of a Slide Presentation at a National Research Service Award (NRSA) Trainees Research Conference. Agency for Healthcare Research and Quality, Rockville, MD. http://www.ahrq.gov/fund/training/smithtxt.htm

Multi-level Analysis I

Recognizing the Problem

NRSA Trainees Research Conference Slide Presentation (Text Version)

By Maureen Smith, M.D., Ph.D.

Slide 1

Multi-level Analysis I: Recognizing the Problem

Slide 2

A day in the life of a researcher:

Slide 3

Welcome to the fantasy world of linear regression

Slide 4

Reality check

Slide 5

1) Clustering introduced in sampling

Slide 6

Effect of sample design on errors

Slide 7

Simple linear regression won't work!

Slide 8

How much information is lost?

Slide 9

Questions—Pair up!

Slide 10

2) Clustering introduced naturally

Slide 11

Effect of natural clusters on errors

Slide 12

What do we do?

Slide 13

If clustering is a nuisance

Slide 14

How to correct errors for clustering

Slide 15

If clustering is interesting

Slide 16

Questions

Slide 17

Multi-Level Models

Slide 18

The concept of "levels"

Slide 19

Data Structure—Patient

Slide 20

Simple Linear Regression

Slide 21

Questions

Slide 22

The Problem

Slide 23

Data Structure—Hospital

Slide 24

Combined Data Structure

Slide 25

Combined Data Structure

Slide 26

Slide 27

Simple Linear Regression (one approach to modeling this data structure)

Slide 28

Questions

Slide 29

The Problem, Part 2

Slide 30

How do we represent the clustering?

Slide 31

Starting simple—random intercept

Slide 32

Question

Slide 33

How do we want to model variation between groups?

Slide 34

How do we actually do it?

Slide 35

What do we do with uj?

Slide 36

Questions

Slide 37

Modeling uj as a fixed effect

Slide 38

What do we do with uj?

Slide 39

What do we do with u_j?

Modeling u_j as a fixed effect

What do we do with u_j?

Modeling u_j as a random effect