Bureau of Transportation Statistics (BTS)
Printable Version

Survey Methodology
Appendix B. Using SUDAAN and Other Software for the Analysis of the 2002 National Transportation Availability and Use Survey

Using SUDAAN and Other Software for the Analysis of the 2002 National Transportation Availability and Use Survey

Variance estimation procedures have been developed to account for complex sample designs. Using these procedures, factors such as the selection of the sample, the use of differential sampling rates to subsample a subpopulation and nonresponse adjustments can be appropriately reflected in estimates of sampling error. The two main methods for estimating variances from a complex survey are known as Taylor series variance estimation (linear approximation) and replication (including jackknife and balanced repeated replication (BRR) methods). Wolter (1985) is a useful reference on the theory and applications of these methods. Shao (1996) is a more recent review paper that compares these methods.

Standard statistical software packages that assume a simple random sampling design do not properly compute variance estimates from weighted data collected under a design other than simple random sampling. By properly using the variable, RAKEDW00, as the final full sample weighting factor in standard statistical programs, an analysis of the survey data will result in accurate point estimates; however, this will not result in accurate variance estimates.

To overcome this limitation, this document gives guidance for analyzing the survey data using the software package SUDAANã (Software for the Statistical Analysis of Correlated Data) based on the Taylor series and replication methods (Research Triangle Institute, 1997). SUDAAN is a statistical package developed by Research Triangle Institute (RTI) to analyze data from complex sample surveys. SUDAAN computes the standard errors of the estimates taking the survey design into account. While later versions of SUDAAN (version 8 or later) can use replication methods, it is most often used for computing variances based on the first-order Taylor series approximation also known as linearization. Though this section only provides details on the use of SUDAAN, the software packages of STATA and WesVar also can be used for linear approximation and replication methods respectively.

Although SUDAAN's estimates of variance based on linearization take into account the sample design of the survey; they do not properly reflect the variance reduction due to raking and poststratification. The weights in this survey were raked to control totals in the final step of the weighting process. Replication methods are more appropriate to compute estimates of variance under this condition. However, the magnitude of the reduction will depend on the type of estimate (i.e. total, proportion, etc.) and the correlation between the variable being analyzed and the dimensions used in raking.

Analysis of the Survey Data Using SUDAAN

This section describes how to use SUDAAN using both Taylor series and replication methods for the analysis of the survey data and the computation of appropriate standard errors and shows which options are appropriate to use. The data file contains 5,019 records, one for every completed extended interview.

I. Using Taylor Series Linear Approximation (SUDAAN and STATA)

Required Variables

The variables that provide information about the sample design in SUDAAN are:

Variable TSVUNIT (Taylor’s series variance unit). The variable TSVUNIT indicates the primary sampling unit (PSU) to be used for computing the estimates of variance using the Taylor series method. In the survey, the PSU corresponds to the household.

Variable RAKEDW00 (final full sample weight). The variable RAKEDW00 contains the final weight for the full sample. This weight is positive for all the records.

SUDAAN Keywords

The statements and keywords needed to run SUDAAN to compute variance estimates based on the Taylor Series approximation are:

DESIGN=WR (required). The sample was drawn without replacement; however, the WR (with replacement) design option is used because the finite population correction factor (fpc) is negligible. (Note: STRWR is not used because this requires that each record be a PSU, which is not the case because two persons could be sampled from the same household.)

NEST TSVUNIT /PSULEV = 1 (required). The keyword NEST lists the variables whose values identify the sampling stages. The Option /PSULEV = 1 instructs SUDAAN that TSVUNIT is the PSU level variable in position 1 in the NEST statement.

WEIGHT RAKEDW00 (required). The keyword WEIGHT lists the final weight to be used in the analysis. In this case, the variable for the weight is the final full sample weight RAKEDW00.

The variable TSVSTR in combination with the variable TSVUNIT can also be used to compute the standard errors with the appropriate changes in the NEST statement. The variable TSVSTR indicates the sampling stratum. In the survey, TSVSTR is set to 1 for all the records. An example of the use of this variable is also included in the following section.

SUDAAN is not the only statistical software that can be used to generate approximate standard errors using linear approximation.  The statistical software STATA can be used as well.  The variables TSVUNIT and TSVSTR can be used as the nesting variables and RAKEDW00 as the full sample weight in STATA to correctly generate both point estimates and standard errors.

II. Using Jacknife Replication Methods (SUDAAN and WesVar)

The additional statements and keywords needed to run SUDAAN to compute estimates of variance based on replication methods are:

DESIGN= JACKKNIFE (required). The survey data file includes replicate weights that can be used in SUDAAN. The replication method used to create the weights is a form of the jackknife method. If estimates of variance based on replication methods are computed, the option JACKKNIFE should be used in the design statement.

JACKWGTS RAKEDW01 - RAKEDW80 / ADJJACK=1 (required). The keyword JACKWGTS followed by the list of the variable names for the 80 replicate weights created for the survey (RAKEDW01-RAKEDW80). When computing variances, replicate based estimates need to be adjusted by a constant value c that depends on the replication method used.  In the replicates for this survey, the value of c is 1 and SUDAAN adjusts the weights appropriately with the option ADJJACK=1.

WesVar can be used to generate point estimates and appropriate standard errors using replication methods as well.  This dataset contains 80 replicates (RAKEDW01-RAKEDW80) for the full sample weight RAKEDW00.  These replicates should be included in the file when creating the WesVar dataset.  The jackknife method of JK2 should be selected as the jackknife method to be used.  The ID variable on this file is PERSID.

Estimates Using SUDAAN based on the Taylor Series approximation

Listing 1 shows an example of running SUDAAN’s PROC CROSSTAB to compute totals, percentages and standard errors for the variable GENDER[19] based on the Taylor Series approximation. The procedure CROSSTAB produces weighted frequencies and percentage distributions for categorical variables. The following statements were used to produce the output in Listing 1.

proc crosstab data = btsall design=WR ;
  weight RAKEDWØØ ;
  NEST TSVUNIT /PSULEV=1 ;
  subgroup gender ;
  levels 2;
  setenv colwidth = 17 decwidth= 3 ;
run ;     

The following statements also produce the same output as Listing 1. The difference is the use of the variable TSVSTR in the NEST statement.

proc crosstab data = btsall design=WR ;
  weight RAKEDWØØ ;
  NEST TSVSTR TSVUNIT;
  subgroup gender ;
  levels 2;
  setenv colwidth = 17 decwidth= 3 ;
run ;     

Listing 1.

Sample PROC CROSSTAB Output of Marginal Tools, Percentages, and Standard Errors*

Date: 12-12-2002                                          Research Triangle Institute
Page  : 1 
Time: 11:31:59                                              The CROSSTAB Procedure
Table : 1 

Variance Estimation Method: Taylor Series (WR)
by: WHAT IS YOUR/SUBJECT'S GENDER.

  WHAT IS YOUR/SUBJECT'S GENDER
Total 1 2
Sample Size       5011.000 2322.000 2689.000
Weighted Size     273335024.970 133394837.990 139940186.980
SE Weighted       3826319.579 3328823.884 3319195.188
Row Percent       100.000 48.803 51.197
Col Percent       100.000 48.803 51.197
Tot Percent       100.000 48.803 51.197
SE Row Percent    0.000 0.995 0.995
SE Col Percent    0.000 0.995 0.995
SE Tot Percent    0.000 0.995 0.995

*The standard errors of both the estimated totals and percentages in Listing 1 are much larger than standard errors that take raking into account. This is because the effect of raking cannot be accounted for in PROC CROSSTAB when using Taylor series linearization.

Listing 2 shows an example of running SUDAAN’s PROC DESCRIPT to compute means, and standard errors for the variable AGE[20] based on the Taylor Series approximation. The procedure DESCRIPT produces weighted totals and means and their standard errors for continuous variables. The following statements were used to produce the output in Listing 2.

PROC DESCRIPT DATA = btsall design = WR ;
  WEIGHT RAKEDW00 ;
  NEST TSVUNIT /PSULEV=1 ;
  VAR AGE ;
  setenv colwidth = 17 decwidth= 3 ;
  print / style = nchs ;
run ;     

Listing 2.

Sample PROC DESCRIPT Output of Means and Standard Errors

S U D A A N
Software for the Statistical Analysis of Correlated Data Copyright
Research Triangle Institute
July 2001
Release 8.0.0 

Date: 12-12-2002     Research Triangle Institute
Page  : 1
Time: 11:32:24          The DESCRIPT Procedure
Table : 1 

Variance Estimation Method: Taylor Series (WR)
by: Variable, One.

Variable One Sample Size Weighted Size Total Mean SE Mean
AGE AT SCREENER 1 4952.000 269936641.060 9544546622.010 35.358 0.423

 Estimates Using SUDAAN based on replication

Listing 3 shows an example of running SUDAAN’s PROC CROSSTAB to compute totals, percentages and standard errors for the variable GENDER[21] based on replication. The standard errors are smaller that those in Listing 1 because replication methods can reflect the reduction in variance caused by raking. The survey weights were raked to five dimensions in the last step of weighting. For GENDER, the standard errors are much smaller (in particular for totals) because GENDER was used to create one of the raking dimensions. The following statements were used to produce the output in Listing 3.

proc crosstab data = btsall design=JACKKNIFE;
  weight RAKEDW00 ;
  JACKWGTS RAKEDW01-RAKEDW80 /ADJJACK=1;
  subgroup gender ;
  levels 2;
  setenv colwidth = 17 decwidth= 3 ;
run ;

Listing 3.

Sample PROC CROSSTAB Output of Marginal Tools, Percentages, and Standard Errors

S U D A A N
Software for the Statistical Analysis of Correlated Data Copyright
Research Triangle Institute
July 200
Release 8.0.0

Number of observations read: 5019 Weighted count :273643273

Denominator degrees of freedom : 80                                 

Date: 01-08-2003    Research Triangle Institute
Time: 13:00:12         The CROSSTAB Procedure

Variance Estimation Method: Replicate Weight Jackknife
by: WHAT IS YOUR/SUBJECT'S GENDER.

  WHAT IS YOUR/SUBJECT'S GENDER 
Total 1 2
Sample Size      5011.000 2322.000 2689.000
Weighted Size    273335024.970 133394837.990 139940186.980
SE Weighted      129773.082 83463.088 95960.303
Row Percent      100.000 48.803 51.197
Col Percent      100.000 48.803 51.197
Tot Percent      100.000 48.803 51.197
SE Row Percent   0.000 0.023 0.023
SE Col Percent   0.000 0.023 0.023
SE Tot Percent   0.000 0.023 0.023

Listing 4 shows an example of running SUDAAN’s PROC DESCRIPT to compute means, and standard errors for the variable AGE[22] based on replication. The following statements were used to produce the output in Listing 4.

PROC DESCRIPT DATA = btsall design = JACKKNIFE  ;
  WEIGHT RAKEDW00 ;
  JACKWGTS RAKEDW01-RAKEDW80 /ADJJACK=1;
    VAR AGE ;
  setenv colwidth = 17 decwidth= 3 ;
  print / style = nchs ;
run ;

Listing 4.

Sample PROC DESCRIPT Output of Means and Standard Errors

Date: 01-08-2003    Research Triangle Institute
Page  : 1
Time: 13:26:21         The DESCRIPT Procedure
Table : 1

Variance Estimation Method: Replicate Weight Jackknife
by: Variable, One.

Variable One Sample Size Weighted Size Total Mean SE Mean
AGE AT SCREENER 1 4952.000 269936641.060 9544546622.010 35.358 0.081

References

Shao, J. (1996). Resampling Methods in Sample Surveys, (with Discussion). Statistics, 27, 203-254.

Wolter, K. (1985). Introduction to Variance Estimation. New York: Springer-Verlag.

Research Triangle Institute. (1997). SUDAAN® user’s manual, (Release 7.5). Research Triangle Park: Author.



RITA's privacy policies and procedures do not necessarily apply to external web sites. We suggest contacting these sites directly for information on their data collection and distribution policies.