Statistical Issues

Sample Design
Nonresponse and Imputation
Weighting and Estimation
Sampling Errors

Sample Design

Both SESTAT and CPS are based on sample surveys that use complex probability sample designs. As such they are subject to various limitations. As described in the section "SESTAT Coverage," the main limitation of the SESTAT design is that over time it excludes an increasingly larger part of the target S&E population. In addition, SESTAT currently excludes individuals who do not have a bachelor's or higher degree. CPS, on the other hand, by definition excludes individuals in the military (a group covered in SESTAT). Because CPS is based on an area probability sample, it is also subject to undercoverage of certain subgroups of the civilian population (see the section "CPS Coverage"). Thus, an important distinction between the designs for the two studies is related to coverage issues. As long as these differences are recognized, the results from CPS and SESTAT can be analyzed and compared despite the fact that the SESTAT and CPS sample designs are different. Some additional factors that have a bearing on the ability to make comparisons across the studies follow. These factors include nonresponse and imputation, weighting and estimation, and sampling errors.

Nonresponse and Imputation

Nonresponse (both unit and item nonresponse) is a concern in both SESTAT and CPS because it can introduce biases of unknown magnitude in the survey estimates. Although weighting adjustments are made in both SESTAT and CPS to compensate for nonresponse, it is unlikely that the biases are completely eliminated. However, even if the characteristics of the nonrespondents are different from those of the respondents, the effect of nonresponse on the survey estimates will be minimal if nonresponse rates are relatively low and are not highly variable among demographic groups. In other words, the higher the nonresponse rate, the greater the potential for serious biases in the survey estimates.

Table 14 shows the unweighted unit response rate for SESTAT components in 1993, 1995, and 1997. For the NSCG, the response rates in 1995 and 1997 were conditional on prior respondent status in 1993. The 1993 NSCG response rate was 80%; only respondents of the 1993 NSCG were eligible for subsequent cycles. That is, there was no follow-up of nonrespondents from one cycle to the next. The conditional NSCG response rates for 1995 and 1997 were 95% and 94%, respectively. The unconditional response rate for the 1997 NSCG (i.e., the cumulative response rate for all three cycles) was approximately 71%. The response rates for the NSRCG and the SDR are unconditional response rates computed independently at each cycle. The unconditional response rates for the 1997 NSRCG and SDR were 82% and 84%, respectively.

The CPS unit response rates are generally higher than the SESTAT response rates, ranging from 91% to 94% per month (e.g., see U.S. Census Bureau 2000 and the CPS website[10]). The lower response rates in SESTAT raise the concern that the potential for bias resulting from unit nonresponse is greater for SESTAT than for CPS. However, SESTAT has fairly complete and rich data on demographics and degrees that are used for nonresponse adjustment to attenuate nonresponse biases. Because CPS nonresponse and poststratification adjustments are made without regard to degree status, the effect of the CPS adjustments for subsets of individuals with degrees is unknown.

Item nonresponse can also have adverse effects on survey data. The extent of item nonresponse is relatively minor for both SESTAT and CPS. In CPS, item nonresponse is generally low for demographic and labor force items (about 1% or less). In SESTAT, only those questionnaires that provide complete data for all "critical" items relating to degrees and occupation were considered to be completed questionnaires (i.e., respondents). Thus, by definition, there was no item nonresponse among respondents for the critical items. Any nonresponse in SESTAT is included in the unit response rates discussed above. For the noncritical items, item nonresponse rates are generally low. For example, the item nonresponse rates for variables included in this evaluation were approximately 1% or less. Both SESTAT and CPS use hotdeck methods to impute missing data items. SESTAT uses hotdeck imputation after some logical edit imputation is completed.

Weighting and Estimation

Both SESTAT and CPS require the use of weights to inflate the sample results to population levels. The purpose of the weights is to compensate for variable probabilities of selection, differential response rates, and undercoverage. All of the population estimates presented in this report are weighted estimates using person-level weights available in public-use files.

Although some aspects of weighting have varied from year to year, the main features of the weighting procedures used in SESTAT can be summarized as follows:

In the 1993 NSCG, nonresponse adjustment was incorporated in the poststratification adjustment used to adjust the weights to match the population counts of the 1990 census. In 1995 and subsequent years, there was a separate adjustment for nonresponse. In addition, in 1995 an adjustment was made to account for the cases subsampled out after the CATI phase.

In the 1993, 1995, and 1997 NSRCG, two separate adjustments were made. The first one was a poststratification adjustment that was applied to the (first-stage) institution weight. For this adjustment, a ratio was calculated using data from the Integrated Postsecondary Education Data System (IPEDS)[11] data in each of the 12 ratio-adjustment strata based on degree level and major field. For each ratio, the numerator was the sum of the number of degrees awarded over all institutions in the universe (i.e., in IPEDS), and the denominator was the weighted sum of degrees awarded in the sampled responding institutions as reported in IPEDS using the institution nonresponse-adjusted weight. The resulting (poststratified) institution weight was then used to develop an initial person-level weight, which was subsequently adjusted for survey nonresponse within designated weighting classes.
In the 1993, 1995, and 1997 SDR, the base weights were adjusted for nonresponse within specified weighting classes. A nonresponse adjustment factor was calculated for each sampling cell; it was equal to the ratio of sample cases in the sampling cell to the number of usable responses in the sampling cell. If a nonresponse adjustment factor exceeded a prespecified ratio, collapsing procedures were used, i.e., the cell was combined with other cells with similar characteristics on the variables used for stratification. If this failed to provide adequate safeguards on the range of weights, the nonresponse adjustment weight was constrained to equal the maximum allowable rate. There was no additional poststratification of the nonresponse-adjusted weights.
Each survey database (NSCG, NSRCG, and SDR) was designed to be combined with the other two surveys to capture the advantages of a larger sample size and greater coverage of the target population. However, combining the three databases meant that the issue of cross-survey multiplicity had to be addressed. That is, scientists and engineers in SESTAT could belong to the surveyed population of more than one component survey, depending on their degrees and when they were received. For example, someone with a bachelor's degree at the time of the 1990 census who went on to complete a master's degree in 1991 could be selected in the 1993 NSCG and the 1993 NSRCG. The following unique-linkage rule was devised to remove these multiple-selection opportunities: each member of SESTAT's target population is uniquely linked to one and only one component survey, and that individual is included in SESTAT only when he or she is selected for the linked survey. As a result, each person had only one chance of being selected into the combined SESTAT database. The priority for determining overlap used the hierarchy SDR, NSRCG, and NSCG. For individuals in SDR, their analysis weight was set equal to their SDR final weight. In NSRCG, the analysis weight was set to zero for individuals who had a chance of selection in SDR. In NSCG, the analysis weight was set to zero for individuals who had a chance of selection in either SDR or NSRCG. For the remaining individuals who did not have a chance of selection in the other components, their analysis weight was the same as their survey component final weight.

For analyses of the CPS data, weights have been developed that reflect probabilities of selection and include both nonresponse adjustments and poststratification adjustments to current population counts. As described in detail in U.S. Census Bureau (2000), chapter 10, the weights derived for analysis of CPS data include the following components:

A base weight equal to the reciprocal of the probability of selecting a household for the sample. Individuals in a household use the same base weight. The base weight reflects adjustments for special sampling situations such as periodic sample reductions and in-field subsampling designed to control workload.
An adjustment for nonresponse. This adjustment is made at the household level within cells defined by geography and metropolitan status.
Ratio adjustment to known population distributions. This adjustment is done in two stages. In the first stage, weighted counts of the sampled noncertainty primary sampling units within each state are adjusted to agree with the corresponding statewide population totals by race. In the second stage, person-level weights are adjusted (poststratified) to independent population control totals by state; cross-classification by sex, age, and Hispanic origin; and cross-classification by sex, age, and race (black and other).

In conclusion, CPS adjusts weights by geographic area and demographic characteristics such as sex, age, race, and ethnicity. SESTAT adjusts weights by degree level and field as well as demographic characteristics such as sex, race/ethnicity, disability status, and citizenship.

Sampling Errors

All of the estimates cited in this report are based on sample data and are thus subject to sampling errors. Both SESTAT and CPS publish generalized variance functions (GVFs) that can be used to estimate the standard error of an estimated total. These GVFs have been used to obtain the standard errors of the estimates presented in this report. For example, table 15 shows the standard errors and the coefficients of variation (CV) of estimates of individuals working in S&E occupations by highest degree attained for SESTAT and CPS. CV is the standard error divided by the estimated total expressed as a percentage. As shown in the table, the standard errors for SESTAT estimates are considerably smaller than those for the corresponding CPS estimates. This is a reflection of the samples size for the two studies (see table 16). Thus, although CPS can provide useful information about the S&E population, detailed analyses are severely limited by the comparatively large sampling errors. In particular, analysis by subgroups, such as detailed occupation (e.g., economists) or demographic groups (e.g., women and minorities), is limited, even if several months of CPS data are accumulated. For additional information about the standard errors of the SESTAT and CPS estimates and corresponding subgroup sample sizes, see appendix C.

Footnotes

[10] http://www.bls.census.gov/cps/basic/perfmeas/typea.htm

[11] The National Center for Education Statistics has established IPEDS as its core postsecondary education data collection program. It is a single, comprehensive system that encompasses all identified institutions whose primary purpose is to provide postsecondary education. The IPEDS system is built around a series of interrelated surveys to collect institution-level data in areas such as enrollments, program completions, faculty, staff, and finances. The NSRCG poststratification adjustments used IPEDS data on number of bachelor's and master's degrees awarded by degree level and major field.

Top

Comparison of the National Science Foundation's Scientists and Engineers Statistical Data System (SESTAT) with the Bureau of Labor Statistics' Current Population Survey (CPS)
Working Paper | SRS 07-205 | August 2007