Your browser doesn't support JavaScript. Please upgrade to a modern browser or enable JavaScript in your existing browser.
Skip Navigation U.S. Department of Health and Human Services www.hhs.gov
Agency for Healthcare Research Quality www.ahrq.gov
www.ahrq.gov

Multi-level Analysis I

Recognizing the Problem


NRSA Trainees Research Conference Slide Presentation (Text Version)

By Maureen Smith, M.D., Ph.D.


On June 5, 2004, Maureen Smith, M.D., Ph.D., made a presentation at the 10th Annual National Research Service Award (NRSA) Trainees Research Conference. This is the text version of her slide presentation. Select to access the PDF File, 125 KB (PDF Help).


Slide 1

Multi-level Analysis I: Recognizing the Problem

Maureen Smith, M.D., Ph.D.
Dept. of Population Health Sciences
University of Wisconsin-Madison

Slide 2

A day in the life of a researcher:

  • We have data:
    • ID (observation #).
    • X (variable 1).
    • Y (variable 2).
  • We want to use the value of X to explain the value of Y.
ID X Y
1 60 3
2 75 6
3 81 10
4 70 7
5 65 5

Slide 3

Welcome to the fantasy world of linear regression

  • A simple model:

    yi = intercept + slope(xi) + error
    i indicates observations (1... N)
    (a graph illustrates)

  • Assumptions:
    • Linearity.
    • Independence.
    • Normality.
    • Constant variance.

Slide 4

Reality check

How often are observations truly independent from one another?

The slide features a square containing orange or green dots.

  • Dot indicates geographic location of teenager.
  • Orange or green indicates hair color.

Do these teenagers look independent?

Slide 5

1) Clustering introduced in sampling

The slide features a square containing 4 large circles. The circle labeled "Block 1" contains a number of orange dots. The circle labeled "Block 2" contains yellow dots. The circle labeled "Block 3" contains green dots. The circle labeled "Block 4" contains text that reads, "Not all blocks are selected".

  • Multistage sampling:
    • Circles represent city blocks.
    • Blocks randomly sampled.
    • All persons in block surveyed to determine attitudes.
  • Persons in one block are more like their neighbors than persons who live in another block.
  • Nesting or clustering of data:
    • Persons within blocks.

Slide 6

Effect of sample design on errors

The slide features a square containing 4 large circles, as described on Slide 5 above, except that the circle labeled "Block 4" is now empty.

  • Errors in linear regression:
    • Assume independence.
    • Each person => info.
    • Each person worth "1."
  • If clustering occurs:
    • Obs not independent.
    • Each person => less info.
    • Each person worth < "1."

Slide 7

Simple linear regression won't work!

  • Violates assumption of independence:
  • If don't account for it.
    • Standard errors are too small.
    • Makes coefficients look more significant.
    • "You think there is more information in the data than actually exists."

Slide 8

How much information is lost?

"Design Effect"

If designing a study using multistage sampling, need to increase sample size to account for loss of information.

  • Design effect:
    • Each observation is "worth less."
    • Need to estimate your "effective" sample size.
    • Used for sample size calculations in multi-stage sampling.
Neffective = Nn
_______________
Design effect

Slide 9

Questions—Pair up!

The slide features a square containing 4 large circles, as described on Slide 6 above.

  • Multi-stage sample design:
    • City blocks N = 3.
    • Persons N = 26.
  • Design effect = 2.
  1. What is the effective sample size?
  2. What sample size would you use in your power calculations?

Slide 10

2) Clustering introduced naturally

The slide features a square containing 4 large circles, as described on Slide 6 above, except that the circles are now labeled "Hospital 1, 2, 3, 4" instead of "Block."

  • Analyze costs of care for hospitalized patients.
  • Patients in one hospital are more alike than patients in another hospital.
  • Nesting or clustering of data:
    • Patients within hospitals.

Slide 11

Effect of natural clusters on errors

The slide features a square containing 4 large circles, as described on Slide 10 above.

  • Same effect on errors:
    • Obs not independent.
    • Each person => less info.
    • Each person worth < "1."
  • Simple linear regression won"t work!

Slide 12

What do we do?

  • First question—do we care?
    • Is clustering a nuisance?
      or
    • Is clustering an interesting phenomenon?
  • Leads to different analytic strategies.

Slide 13

If clustering is a nuisance

  • Example—Multi-stage sampling:
    • Don't care how people vary within city blocks versus between city blocks.
    • Artificially imposed by the sampling design.
    • Not interested in measuring it.
    • Just want to correct for it.
  • Use analytic strategies that correct for clustering.

Slide 14

How to correct errors for clustering

  • Robust estimates of variance:
    • Stata ", robust cluster (____)".
    • SAS empirical estimates of variance.
  • Programs that account for complex survey design (weights, strata, clusters):
    • Stata "svy" commands.
    • SAS "survey___" commands.
  • Other strategies.

Slide 15

If clustering is interesting

  • Example—examine costs for hospitalized patients.
  • Split out the variation in costs:
    • How much variation due to differences in patients?
    • How much variation due to differences in hospitals?
  • Examine factors that explain variation in costs:
    • Characteristics of patients.
    • Characteristics of hospitals.
  • Analytic strategy = Multi-level modeling!

Slide 16

Questions

1. Identify 3 patient characteristics that might explain variation in costs.

2. Identify 3 hospital characteristics that might explain variation in costs.

3. Do you think more of the variation in costs is explained by the patient or the hospital?

Slide 17

Multi-Level Models

(Hierarchical linear models)

(Random effects models)

Slide 18

The concept of "levels"

The slide features a square containing 4 large circles, as described on Slide 10 above.

  • Our example—2 levels:
    • Micro = patients (N=26).
      • Micro-level = "units."
    • Macro = hospitals (N=3).
      • Macro-level = "groups."
  • At each level:
    • Patient characteristics.
    • Hospital characteristics.

Slide 19

Data Structure—Patient

Patient-level data ( = "unit-level data"):

Patient ID Hospital ID Age (X) Cost (Y)
1 1 60 3
2 1 75 6
3 2 81 10
4 2 70 7
5 2 65 5
  • Y represents a patient characteristic:
    • Cost (thousands of $).
  • X represents a patient characteristic:
    • Age.
    • Note—understand process at each step.
    • "Older patients are sicker and tend to cost more."

Slide 20

Simple Linear Regression

yi = a + bxi + ei

  • i indexes patients (i=1 to N).
  • Relates x to y.
  • Both variables are patient characteristics.
  • Remember the assumptions.

Slide 21

Questions

Patient ID Hospital ID Age (X) Cost (Y)
1 1 60 3
2 1 75 6
3 2 81 10
4 2 70 7
5 2 65 5

costi = a + b(agei) + ei

  1. Is there a problem with this model when applied to these data?
  2. If so, what?

Slide 22

The Problem

  • Does not account for the clustering of patients within hospitals:
    • Data have a structure that is not represented.
    • ei—Assumption of independence is not met.
  • Do we care?
    • If clustering is nuisance => Stata robust option.
    • If clustering is interesting => Multilevel model.

Slide 23

Data Structure—Hospital

Hospital-level data ( = "group-level data"):

Hospital ID Beds (W)
1 10
2 65
  • W represents a hospital characteristic:
    • # of beds in the hospital.
  • Bigger hospitals are more expensive:
    • More technology.
    • More high-cost specialists.
    • "A built bed is a filled bed."

Slide 24

Combined Data Structure

Patient-level data:

Patient ID Hospital ID Age (X) Cost (Y)
1 1 60 3
2 1 75 6
3 2 81 10
4 2 70 7
5 2 65 5

+

Hospital-level data:

Hospital ID Beds (W)
1 10
2 65

= ?

Slide 25

Combined Data Structure

Patient- and hospital-level data:

Patient ID Hospital ID Age (X) Cost (Y) Beds (W)
1 1 60 3 10
2 1 75 6 10
3 2 81 10 65
4 2 70 7 65
5 2 65 5 65
  • Age (X) and Cost (Y):
    • Variation between patients.
  • Beds (W):
    • Only variation between hospitals.
    • No variation within hospitals.

Slide 26

WARNING—Equations coming up!

Remember—In multi-level modeling...

SUBSCRIPTS ARE YOUR FRIENDS!

Slide 27

Simple Linear Regression (one approach to modeling this data structure)

yij = a + bxij + dwj + eij

  • j indexes hospitals (j=1 to N).
  • i indexes patients within hospitals (i=1 to nj).

costij = a + b(ageij) + d(bedsj) + eij

  • Frequently used.

Slide 28

Questions

Patient ID Hospital ID Age (X) Cost (Y) Beds (W)
1 1 60 3 10
2 1 75 6 10
3 2 81 10 65
4 2 70 7 65
5 2 65 5 65

costij = a + b(ageij) + d(bedsj) + eij

  1. Is there a problem with this model when applied to these data?
  2. If so, what?

Slide 29

The Problem, Part 2

  • You must assume that all of the data structure is represented by the explanatory variables.
  • Unlikely this will account for the clustering of patients within hospitals.
    • Assumes that all clustering within hospitals is explained by the number of beds in the hospital (W).
    • If "beds" does not explain all clustering, then assumption of independence is not met for eij.

Slide 30

How do we represent the clustering?

  • Let the regression coefficients vary from group to group:

    yij = aj + bjxij + dwj + eij

  • Groups j can have higher or lower values of aj and bj.
  • Why not create dj?

Slide 31

Starting simple—random intercept

  • Model the clustering between groups:
    • Let the intercept only (aj) vary from group to group.
    • Take out all group-level variables (W):

      yij = aj + bxij + eij

  • Groups j—higher or lower values of aj only.
  • Assumes some groups tend to have, on average, higher or lower values of Y.

Slide 32

Question

yij = aj + bxij + eij

  1. Why take the group-level variable (W) out of this model?
  2. Must W be taken out of the model?

Slide 33

How do we want to model variation between groups?

  • W—a "partial" way to model variation between groups:
    • If included, it will pick up part of the variation between groups.
    • "Part of the variation in costs between hospitals will be explained by the number of beds in the hospital."
  • Goal of a random intercept model:
    • Model the actual structure of the data.
    • Let groups vary, on average, in Y.
    • "Let the hospitals vary, on average, in cost."

Slide 34

How do we actually do it?

yij = aj + bxij + eij

  • Split aj into (a0 + uj):

    yij = a0 + + bxij + eij

  • a0 = average intercept (constant).

  • uj = deviation from the average intercept for group j

    = conditional on X, individuals in group j have Y values that are uj higher than in the average group.

  • "Conditional on patient age, patients in Hospital j have costs that are uj higher than the average costs for all patients."

Slide 35

What do we do with uj?

Part 1—Fixed effects

  • Are groups j regarded as unique?
    • Do you want to draw conclusions about each group?

    TREAT AS "FIXED EFFECTS"

  • Create j - 1 indicator variables (0/1).
  • Leads to j - 1 regression parameters.

Slide 36

Questions

Patient ID Hospital ID Age (X) Cost (Y)
1 1 60 3
2 1 75 6
3 2 81 10
4 2 70 7
5 2 65 5

costij = a0 + b(ageij) + uj + eij

  1. For our data, what does this equation look like if uj is modeled as a fixed effect?
  2. Are all indicator variables in a model also fixed effects?

Slide 37

Modeling uj as a fixed effect

(uj = "differences between hospitals")

costij = a0 + b(ageij) + c(hosp2ij) + eij

  • hosp2 = 0/1:
    • 1 = patient i in hospital 2, 0 = patient i in hospital 1
  • Do we need index j? No - why?

    costi = a0 + b(agei) + c(hosp2i) + ei

  • What assumptions does this model make?

Slide 38

What do we do with uj?

Part 2—Random effects

  • Three issues:
    • Are groups regarded as sample from pop.?
    • Do you want to test the effect of group level variables (remember W = # beds)?
    • Do you have small group sizes (2-50 or 100)?

    TREAT AS "RANDOM EFFECTS"

  • Model uj explicitly.
  • Additional assumption that uj is i.i.d.
    • Groups (hospitals) considered exchangeable.
  • Can include group-level explanatory variables (W).

Slide 39

Questions

Patient ID Hospital ID Age (X) Cost (Y) Beds (W)
1 1 60 3 10
2 1 75 6 10
3 2 81 10 65
4 2 70 7 65
5 2 65 5 65

yij = a0 + b(xij) + uj + eij

  1. For our data, what does this equation look like if uj is modeled as a random effect?
  2. How would we include our hospital-level explanatory variable?

Slide 40

Modeling uj as a random effect

(uj = "differences between hospitals")

costij = a0 + b(ageij) + uj + eij

  • uj = deviation from the average cost for hospital j
    = estimated using HLM, SAS, Stata (get a number!)

    costij = a0 + b(ageij) + d(bedsj) + uj + eij

  • Uses the number of beds in the hospital to explain some of the variation in uj.
  • Last question—what happens to uj if the number of beds explains all of the differences between hospitals?

Slide 41

What we did and didn't do today

We discussed:

  • Clustering (artificial and natural).
  • Accounting for clustering:
    • Nuisance = robust estimates of variance.
    • Interesting = multilevel models.
  • Representing clustering in simple model:
    • Fixed effects.
    • Random effects with group-level explanatory variables.

We didn't discuss:

  • Random coefficients other than the intercept.
  • Interaction terms (cross-level effects).
  • Many other things.

Slide 42

Follow-up

maureensmith@wisc.edu

Current as of September 2004


Internet Citation:

Multi-level Analysis I: Recognizing the Problem. Text Version of a Slide Presentation at a National Research Service Award (NRSA) Trainees Research Conference. Agency for Healthcare Research and Quality, Rockville, MD. http://www.ahrq.gov/fund/training/smithtxt.htm


 

AHRQ Advancing Excellence in Health Care