Characteristics of Doctoral Scientists and Engineers in the United States: 2003.

Appendix A: Technical Notes


The Survey of Doctorate Recipients is designed to complement two other surveys of scientists and engineers conducted by the National Science Foundation (NSF), Division of Science Resources Statistics. Together, these three surveys provide a comprehensive picture of the number and characteristics of individuals with training and/or employment in science and engineering in the United States. This combined system is known as the Scientists and Engineers Statistical Data System (SESTAT, http://sestat.nsf.gov). Additional data on education and demographic information in the SDR come from the Survey of Earned Doctorates (SED), an annual census of research doctorates earned in the United States. The annual SED data are accumulated to form the Doctorate Records File (DRF), a complete record of U.S. doctorate recipients since 1920.

Target Population and Sampling Frame Top.

The 2003 SDR target population definition was the same as those of the previous SDR cycles except that a single day, 1 October, was used as the survey's reference period instead of the week of 15 April.

The target population consisted of individuals who

To select a probability sample from this population, a sampling frame must be constructed. As in prior cycles, the 2003 SDR frame was constructed as two separate databases, the old cohort frame and the new cohort frame. The cohorts are defined by the year of receipt of the first U.S.-granted doctoral degree.[2] The old cohort frame represents individuals who received their science, engineering, or health doctorate before 1 July 2000, whereas the new cohort frame represents individuals who received their science, engineering, or health doctorate between 1 July 2000 and 30 June 2002.

The old cohort frame was constructed from the 2001 SDR sample by removing the ineligible cases—those that had reached the age of 76, were permanently institutionalized, or deceased, and non-U.S. citizens who were found to have resided outside the United States for two previous consecutive survey cycles. The new cohort frame was developed from the 2001 and 2002 SED. The total 2003 SDR sample frame consisted of 89,139 cases, including 39,436 cases from the old cohort and 49,703 cases from the new cohort. Note that the old cohort frame represents a much larger population because the frame itself was developed from a weighted sample of doctorate recipients.

The approach to frame construction for the 2003 SDR departed significantly from the prior cycles in two respects. First, the eligibility rules for inclusion in the old cohort frame were revised to include U.S. citizens who had been living outside the United States for two or more consecutive prior cycles. In the past, if a doctorate recipient was a U.S. citizen and had been outside the U.S. for two consecutive survey cycles, the individual would be classified as "permanently ineligible" and excluded from the frame. NSF determined that this policy ran an unacceptable risk of excluding sampled individuals who lived abroad briefly but then returned to the United States. This change had the effect of restoring a total of 713 U.S. citizens who had been removed from the 1999 and 2001 SDR frames because they had been living outside the United States for two consecutive survey cycles. Second, the most recent information available for the old cohort portion of the frame, including SDR-derived data, was used to determine case eligibility and to update the sample stratification variables. Because analysts typically use survey variables rather than frame variables to define analysis domains, this frame-variable updating was expected to bring sampling strata into closer agreement with reporting domains and reduce the standard errors of estimates for these reporting domains.

Sample Design Top.

The sampling frame was stratified using three variables: demographic group, degree field, and sex. The 2003 SDR sample of 40,000 cases was systematically selected from the 164 resulting strata. This stratified, systematic sample design was similar in principle to that used in previous surveys, but with sample stratification and allocation substantially modified.

The object of the stratified sample design was to create strata that both conformed as closely as possible to the reporting domains used by analysts and had associated subpopulations large enough to be suitable for separate estimation and reporting. The revised demographic-group variable features 10 categories defined by race/ethnicity, disability status, and citizenship at birth. Frame cases were classified into these categories hierarchically to ensure higher selection probability for rarer population groups. In the past, a 15-category degree-field code frame (recode) was used to stratify all demographic groups, resulting in a large number of strata with very small populations. NSF decided that an alternative degree-field recode was needed to stratify the smaller demographic groups. In 2003 only the three largest demographic groups (U.S. white, non-U.S. white, and non-U.S. Asian) were stratified by the 15-category degree-field recode. All other demographic groups were stratified by a 7-category degree-field recode, except that American Indians and Native Hawaiians/other Pacific Islanders were stratified only by sex. Thus, the 2003 SDR sample design features a total of 164 strata defined by a revised demographic group variable, a degree-field variable of 7 or 15 categories, and sex.

The 2003 sample allocation also differed from that of previous cycles. The 2001 SDR allocation was based on a simplified alternative to optimal allocation, where precision constraints were set for domains of interest and the total sample was then optimally allocated to the strata and substrata based on a full cross of the stratification variables as well as cohorts. Under this strategy, the sample size allocated to the smallest strata tended to be too small to support separate analyses. The 2003 SDR sample allocation used the following strategy: (1) allocate a minimum sample size for the smallest strata through a supplemental stratum allocation; (2) allocate extra sample for specific demographic-by-sex domains through a supplemental domain allocation; and (3) allocate the remaining sample proportionately across all strata. The final sample allocation was therefore based on the sum of a proportional allocation across all strata, a domain-specific supplement allocated proportionately across strata in that domain, and a stratum-specific supplement added to obtain the minimum stratum size.

The 2003 SDR sample selection was carried out independently for each stratum and cohort substratum. For the old cohort strata, the past practice of selecting the sample with probability proportional to size continued, where the measure of size was the sampling weight associated with the previous survey cycle without any adjustments for nonresponse or undercoverage. For each stratum, the sampling algorithm started by identifying and removing self-representing cases through an iterative procedure. A case was self-representing if its selection probability was equal to or greater than unity based on its measure of size. Iteration ended when all self-representing cases had been identified and removed. Next, the nonself-representing cases within each stratum were sorted by citizenship, disability status, DRF degree field, and year of doctoral degree award. Finally, the balance of the sample (i.e., the total allocation minus the number of self-representing cases) was selected from each stratum systematically with probability proportional to size.

The new cohort sample was selected using exactly the same algorithm as was used to select the old cohort sample. However, because the sampling weight for every case in the new cohort frame was equal to 1, there were no self-representing cases. For the same reason, each stratum sample from the new cohort was actually a self-weighting sample.

The 2003 SDR sample of 40,000 consisted of 36,582 cases from the old cohort frame and 3,418 cases from the new cohort frame. The overall sampling rate was about 1 in 18 (5.5 percent). However, sampling rates varied considerably across the strata. Sampling rates for selected demographic groups in the 2003 SDR universe are in table A-1.

TABLE A-1.  Frame counts and sampling rates for 2003 Survey of Doctorate Recipients, by characteristics of doctorate recipient.
  Table A-1 Source Data: Excel file

Survey Content Top.

The 2003 SDR maintained the questionnaire design changes that were implemented in 1993 (for the survey questionnaire, see appendix D). The questionnaire comprises a large set of core data items that are retained in each survey round to enable trend comparisons, and several sets of module questions asked intermittently on special topics of interest. For example, the 1995 SDR questionnaire had a module on temporary postdoctoral appointments awarded primarily for gaining additional education and training in research, and the 1997 questionnaire had special modules on alternative work arrangements, job security concerns, and recent doctorate recipients' initial career experiences.

A special module on publication and patenting first introduced in 1995 and fielded in 2001 was fielded again in 2003 for activities during the past 2-year period. Questions added in 2001 on individual satisfaction and importance of various job attributes were retained in the 2003 SDR questionnaire. New questions, asked only of foreign-born citizens, were added to obtain data on immigrants. Additionally, a new question determining academic positions for those working at a postsecondary academic institution was added along with a question on overall job satisfaction.

Data Collection Top.

The SDR was a paper-based, self-administered survey until the 1990s, when it became a mixed-mode survey. Since 1991 the data collection protocol has been to mail notification letters, paper questionnaires, and finally postcard reminders, followed by remailing materials to nonresponding sample members according to a set schedule, and then by contacting nonresponders by telephone. The telephone contact was used to prompt the return of the self-administered paper survey or to complete the survey by telephone interview.

With the 2003 SDR, the data collection protocol changed, and three main data collection modes were implemented: self-administered paper questionnaire (SAQ), computer-assisted telephone interview (CATI), and selfadministered online questionnaire (Web).

Data collection began in October 2003, with sampled cases starting data collection concurrently in each of the three modes. The 2003 SDR was the first time the Web mode option was offered and the first time that CATI was used as a primary, initial data collection mode for some respondents. Although the project team and sponsors sought ways to improve the SDR, the highest priority was to maintain the high response rates and data quality obtained in prior rounds. To that end, using the CATI and Web as initial modes was introduced as a controlled experiment.

A control group of 29,923 cases received the paper questionnaire in the mail as their initial mode, 7,334 cases started in the CATI mode, and 2,743 cases started in the Web mode. Based on Dillman's Total Design Method (Dillman 1978), different data collection protocols were developed for each of the three different data collection approaches.

The data collection protocol for the SAQ group was as follows: sample members first received an advance notification letter from NSF to acquaint them with the survey. The first questionnaire mailing occurred a week later, followed by a thank you/reminder postcard the following week. Approximately eight weeks after the first questionnaire mailing, the sample members who had not returned a completed questionnaire were sent a second questionnaire by U.S. priority mail. Eight weeks later, any cases still not complete received a single telephone-call prompt to encourage completion of the SAQ. Telephone follow-up to complete the CATI for all mail nonrespondents began three weeks later. Data collection protocols for the CATI and Web start mode experiment groups were similar and ran in parallel to the SAQ data collection protocol.[3]

At any given time, a sample member could ask to complete the survey in a mode other than the mode originally assigned, and 33.1 percent of the sample members did so (n = 10,446).

Quality assurance procedures were in place at each step (address updating, printing, package assembly and mailing, questionnaire receipt, data entry, coding, CATI, and post data collection processing). The data collection field period ended in July 2004. The CATI and data entry processes ended on 9 July 2004 and the Web questionnaire was closed down on 16 July 2004.

Response Rates Top.

The unweighted response rate for the 2003 SDR was 79.1 percent. This is based on 29,915 completed, eligible respondents. A total of 1,663 cases were found to ineligible during the 2003 SDR and 43 cases were found to be out-of-scope for the SDR frame. Of the ineligible cases, 391 cases were found to be permanently ineligible for the SDR sample and will be dropped from the panel along with the 43 out-of-scope cases that will also be dropped from the panel. Table A-2 shows a breakdown of the 2003 SDR sample by the final outcome. The weighted response rate for the 2003 SDR is 79.5 percent and is based on a target population size of 720,241 science, engineering, and health doctorate holders. The 2003 SDR unweighted and weighted response rates are comparable to the response rates obtained in past survey cycles. Lower response rates generally clustered in groups of non-U.S. citizens and people with large amounts of missing demographic data (table A-2). Missing demographic data indicated incomplete frame records in the Doctorate Records File, which made more difficult the task of locating these cases. Data collection experience has shown that if sample members are located, they are disposed to complete the survey. Individuals who could not be located accounted for the largest number of nonresponders.

TABLE A-2.  Survey outcomes and response rates for doctoral scientists and engineers, by characteristics of doctorate recipient: 2003.
  Table A-2 Source Data: Excel file

Weights Top.

To enable weighted analyses of the 2003 SDR data, a final weight was calculated for every person in the sample. Informally, a final weight approximates the number of persons in the population of doctorate recipients that a sampled person represents. The main goal of weighting is to reduce the nonresponse bias in the survey estimates.

The first step of the weighting process calculated a base weight for all cases selected into the 2003 SDR sample. The base weight accounts for the sample design, and it is defined as the reciprocal of the probability of selection under the sample design. In the next step, an adjustment for nonresponse was performed on completed cases to account for the sample cases that did not complete the survey. Nonresponse adjusted weights were assigned to both respondents and known ineligible cases (i.e., cases who were deceased, institutionalized, over 75 years of age, or living abroad during the survey reference period), but eligible nonrespondents and cases with unknown eligibility received a weight of zero. The total weight carried by unknown eligibility cases was distributed to respondents and known ineligible cases, assuming the same eligibility rates between the two groups of cases. By this method, the respondents represent all eligible cases in the frame, the known ineligible cases represent all ineligible cases, and cases with unknown eligibility carry no weight. Thus the sum of weights equals the frame size.

Data Editing Top.

Complete case data were captured in four separate data collection instruments for the 2003 SDR: the computer assisted data-entry system, which captured data from the complete paper forms; the CATI system; the Web survey; and the "retrieval" instrument, an additional CATI instrument used to collect critical-item follow-up data.

Data exported from each of these four instruments were coded to produce SESTAT variables with the same characteristics (i.e., code frames, lengths, names, and types) across the different instruments. In some cases, this procedure required special coding to standardize code frames across platforms. The result of these procedures was a single database on which all subsequent coding, editing, and cleaning were performed.

Once the merged dataset was created, data from a number of external sources were added to it. These additional data included occupational and educational codes, state/country geographic codes, race/ethnicity and gender data from past SDR surveys and from frame data, the Integrated Postsecondary Education Data System (IPEDS) institution codes, and assigning "Other/Specify" verbatim data to existing variable code frames. After merging all externally coded variables into the data set, the survey data were edited. These edits included checks for range errors, skip errors, multiple responses to "Mark one" questions, and data inconsistencies between items and across years.

Imputation of Missing Data Top.

The 2003 SDR used a combination of logical imputation and statistical imputation. For the most part, logical imputation was accomplished as part of editing. In the editing phase, the answer to a question with missing data was sometimes determined by the answer to another question. In some circumstances, editing was also used to create "missing" data for statistical imputation. During sample frame building for the SDR, some demographic frame variables were found to be missing for sample members. The values for these variables were imputed at the frame construction stage.

The 2003 SDR primary method for statistical imputation was hot-deck imputation. Almost all SDR variables were subjected to hot-deck imputation, where each variable had its own class and sort variables structure created based on a regression analysis. Critical items (which must be complete for all completed cases) and text variables were not imputed.

For some variables, there was no set of class and sort variables that were reliably related to or suitable for predicting the missing value. In these instances consistency was better achieved outside of the hot deck procedures using random imputation. For example, respondents with a missing marital status (question E1) may have answered questions E2 or E3, regarding their spouse or partner's employment status, but failed to answer question E1, regarding their marital status. This implies that E1 should be "1" (Married) or "2" (Living in a marriagelike relationship). Our procedure was to assign a random value for E1 with a probability proportional to the number of cases in each of the valid values (e.g., if there were three married respondents for every respondent living in a marriage-like relationship, then missing values of E1 would be filled in with a "1" 75 percent of the time and "2" 25 percent of the time).

Reliability of Estimates Top.

Because the estimates produced from the SDR are based on a random sample, they may vary from those that would have been obtained if all members of the target population had been surveyed using the same data collection procedures. Two types of error are possible when population estimates are derived from a sample survey: sampling error and nonsampling error. By looking at these errors, the accuracy and precision of the survey estimates can be assessed.

Sampling Errors Top.

Sampling error is the variation that occurs by chance because a sample, rather than the entire population, is surveyed. The particular sample that was used to estimate the 2003 population of science, engineering, and health doctorate recipients in the United States is one of a large number of samples that could have been selected using the same sample design and sample size. Estimates based on each of these samples would have been apt to vary, and such random variation across all possible samples is called the sampling error. Sampling error is measured by the variance or standard error of the survey estimate. The 2003 SDR sample is a systematic sample selected independently from each sampling stratum. The successive difference replication method (SUD) was used to estimate the sampling errors. The theoretical basis for the SUD is described in Wolter (1984) and in Fay and Train (1995).

Table A-3 contains the standard errors for the key sampling variables.

TABLE A-3.  Unweighted number, weighted estimates, standard errors, and design effects for 2003 Survey of Doctorate Recipients, by characteristics of doctorate recipient.
  Table A-3 Source Data: Excel file

Standard errors like those reported in table A-3 can be used to construct confidence intervals around the estimates. If all possible samples under the sample design were surveyed under the same conditions, and a 95 percent confidence interval were constructed from each sample, then 95 percent of all these intervals would contain the true population value. For example, the estimated total number of agriculture sciences doctorate recipients is 26,656, with a standard error of 259. The 95 percent confidence interval for this estimate is [26,656 – (1.96 × 259), 26,656 + (1.96 × 259)] or [26,148, 27,164]. The standard errors can also be used in testing hypotheses about population parameters.

Nonsampling Errors Top.

In addition to sampling error, survey estimates are subject to nonsampling error, which can arise at many points in the survey process. Sources of nonsampling error include (1) nonresponse error, which arises when the characteristics of respondents differ systematically from nonrespondents; (2) measurement error, which arises when the variables of interest cannot be precisely measured; (3) coverage error, which arises when some members of the target population are excluded from the frame and thus do not have a chance to be selected for the sample; (4) respondent error, which occurs when respondents provide incorrect data; and (5) processing error, which can arise at the point of data editing, coding, or data entry. The analyst should be aware of potential nonsampling errors, but these errors are much harder to quantify than sampling errors.

Generalized Variance Functions Top.

The SDR generates a large number of estimates. In 1999 and 2001, the U.S. Census Bureau used the SUD to compute the variance for a subset of estimates (Tupek 2003). These so-called direct variance estimates were then used to fit generalized variance functions (GVFs) for various population subgroups that represent potential analysis domains. GVFs are provided because it is not feasible to directly calculate and publish the variance for all SDR estimates. In particular, it is impossible to anticipate the numerous analysis domains that may be of interest to SDR data users. The GVFs provide a mechanism for data users to compute the variance of their estimates that are not directly provided by the SDR.

Direct variance estimates are computed for a set of key SDR variables. The lists of key variables used in the 2001 and 2003 GVF estimations are similar. These variables have been determined to be important analysis variables and are sufficiently diverse in that the observed totals cover a wide range within each analysis domain. Some of the key variables are recoded to reduce the number of response categories. Then, a binary variable is created for each response category. Overall, the set of key variables has a total of 103 categories among them; therefore, direct-point and variance estimates involve 103 binary variables.

For a binary variable X , the estimate of the population total is

Estimate of population total X hat equal summation from i equal 1 to n of X sub i times W sub i.

where Xi is the value of X for sample member i, Wi is the final weight for that individual, and n is the sample size. The variance of X hat. based on the SUD replicate weights is estimated by

Variance of X hat based on SUD replicate weights equal four divided by R times [summation from r equal 1 to R of (X hat sub r minus X hat)^2].

where R is the total number of replicates and X hat.r is the estimated population total based on the rth replicate.

The direct estimates are calculated using SUDAAN's DESCRIPT procedure. Many SDR estimates are based on small populations. This is true for most estimates associated with Blacks, American Indians/Alaska Natives, Native Hawaiians/other Pacific Islanders, and Hispanics. For such small populations, the use of the finite population correction (FPC) factor is generally recommended. As was done in 2001, the FPC was applied to all survey estimates, although its impact is minimal on populations sampled at a rate of less than 10 percent. For each GVF subgroup or domain, the FPC is calculated as

Finite population correction for each subgroup or domain, fpc sub d equals (one  minus (n sub d divided by N sub d)).

where nd is the domain sample size and Nd is the domain population size. The population size is estimated by the sum of the base weight per domain, where the base weight reflects the selection probability when the case was last selected to the SDR sample.

To account for potential differences across different population subgroups, the GVFs are estimated independently for each subgroup. (To be consistent with terminology used in the past, the analysis domains defined by degree field and demographic characteristics are called subgroups. These subgroups are not mutually exclusive.) For the GVFs to be successful, statistics that are grouped together should follow a common model, which generally implies that statistics within a subgroup have a similar design effect. Empirically, the grouping is often successful when it is defined by the main design variables, such as demographic, geographic, and racial characteristics.

In estimating the 2001 GVFs, the U.S. Census Bureau defined a total of 261 population subgroups based on the cross-classification of 29 degree-field groups and 9 demographic groups. To reflect changes in both the degree-field definition and the demographic-group definition in the 2003 SDR, NORC defined 352 subgroups for separate GVF estimation based on a cross-classification of 32 degree-field groups and 11 demographic groups. These definitions are consistent with those used in the 2003 detailed statistical tables. For subgroups that are not covered by this classification, the analyst may use the GVF estimated for all doctorate recipients combined. The 32 degree-field groups and 11 demographic groups are listed below.

Degree-field groups
  All doctorate recipients
    Science
      Biological, agricultural, and environmental life sciences
        Agricultural/food sciences
        Biochemistry/biophysics
        Cell/molecular biology
        Environmental life sciences
        Microbiology
        Zoology
        Other biological sciences
      Computer and information sciences
      Mathematics and statistics
      Physical sciences
        Astronomy/astrophysics
        Chemistry, except biochemistry
        Earth/atmospheric/ocean sciences
        Physics
      Psychology
      Social sciences
        Economics
        Political sciences
        Sociology
        Other social sciences
    Engineering
      Aerospace/aeronautical/astronautical engineering
      Chemical engineering
      Civil engineering
      Electrical/computer engineering
      Materials/metallurgical engineering
      Mechanical engineering
      Other engineering
    Health

Demographic groups
  Male
  Female
  American Indian/Alaska Native
  Asian
  Black
  Hispanic
  White
  Other/multi-race/unknown race/ethnicity (including Native Hawaiian/other Pacific Islander)
  2001–02 cohort
  Foreign born

Many mathematical models can be used as generalized variance functions to describe the relationship between the variance of a survey estimate and its expectation. Most models are based on the assumption that the relative variance is a decreasing function of the magnitude of the mean or expectation (Wolter 1985). A commonly used functional form is expressed as a two-parameter model:

Variance of X hat equal a times X^2 plus b times X.

where X hat. is an estimator of the total number of cases possessing some characteristic, X equals E of X hat. is the expectation of X hat., Variance of X hat. is the variance of X hat., and a and b are the generalized variance function parameters to be estimated.

Dividing both sides of equation 4 by X2 yields

Variance of X hat divided by X ^2 equal a plus (b divided by X).

which states that the relative variance of the estimate is a linear function of the inverse of its expectation. The model shown in equation 5 is probably the most commonly used functional form for GVF modeling. NORC used it to estimate the GVFs for the 1997 SDR, and this is the model used for the 2003 GVF estimation.[4]

For each population subgroup, the parameters of the generalized variance function were estimated through an iterative weighted linear regression procedure using the direct point and variance estimates as input. Using weighted linear regression improves the reliability of the fitted model by assigning relatively smaller weights to less reliable direct-variance estimates and larger weights to more reliable direct-variance estimates.

The iterative weighted linear regression procedure involves four regression runs: (1) a weighted linear regression model of [1/X] on the relative variance Variance of X hat divided by X squared., using as the initial regression weight the square of the inverse of the relative variance; (2) a second weighted regression of [1/X] on the relative variance, using as regression weight the square of the inverse of the predicted relative variance from the first regression model; (3) a third weighted regression of [1/X] on the relative variance, using as regression weight the square of the inverse of the predicted relative variance from the second regression model; and (4) a fourth weighted regression of [1/X] on the relative variance, using as regression weight the square of the inverse of the predicted relative variance from the third regression model. At the end of the fourth regression run, observations with an absolute standardized residual exceeding 3 are identified as outliers and are removed from consideration. After that, the four-step regression procedure is repeated on the remaining observations. This iterative process continues until all absolute standardized residuals are smaller than 3.

The estimated GVF parameters, along with relevant goodness-of-fit statistics for each model, are presented in appendix B. Note that estimated GVF parameters are available for 345 of the 352 subgroups or domains. The other 7 subgroups are either empty or have only one case, so direct variance estimation is not possible.

With the estimated generalized variance parameters, it is possible to approximate the variance (or standard error) for any 2003 SDR estimate. The following estimation formulas are for standard errors of totals, proportions, and differences.

Standard Errors of Estimated Totals. An estimator of the variance of an estimated total X hat. can be obtained by evaluating the GVF at X hat. and at a and b. The standard error of an estimated total can be derived using the following equation:

Standard error of X equals square root of (a times X^2 plus b times X).

where X is the estimate of the total and a and b are the generalized variance parameters.

Standard Errors of Estimated Proportions. If p represents a proportion based on the ratio of two estimated totals, where the numerator is a subset of the denominator, the standard error of p, SE(p), can be approximated by using the following equation:

Standard error of an estimated proportion p equal p times square root of [(standard error of X)^2 divided by X^2  minus (standard error of Y)^2 divided by Y^2].

where X and Y are estimated totals, SE(X) and SE(Y) are the corresponding standard error of X and Y derived from equation 6, and p = 100(X/Y) is the estimated proportion. Equation 7 assumes that there is zero correlation between p and Y.

Standard Errors of Estimated Difference. The standard error of the difference between two estimated totals can be approximated by the following equation:

Standard error of difference of between X minus Y equal square root of [(standard error of X)^2 plus (standard error of Y)^2].

where X and Y are estimated totals, and SE(X) and SE(Y) are the corresponding standard error of X and Y from equation 6.

Note that the estimated GVF parameters for some small domains are based on a small number of cases. The parameter estimates are for all domains, but the analyst is advised to use caution when using the GVF of very small domains.

Changes in the Detailed Statistical Tables Top.

Tables for the 2003 SDR report more detailed fieldof-doctorate and occupation classifications than did those for the 2001 SDR. In the 2003 tables, the field-of-doctorate variable "Biological, agricultural, and environmental life sciences" ("Biological and agricultural sciences" in 2001) includes seven subfields rather than the three reported in 2001. Under the heading "Physical sciences" ("Physical and related sciences" in 2001), separate subfields of "Astronomy/astrophysics" and "Physics" are reported. In 2001, these two subfields were combined into a single "Physics and astronomy" subfield.

The occupational classification in the 2003 tables differs in two major respects from the one used in the 2001 tables. "Biological, agricultural, and other life scientist," the classification identified as "Life and related scientists" in 2001, reports eight subclassifications, rather than the six reported in 2001. Non-S&E occupations are treated completely differently. Health-related occupations, S&E managers, S&E pre-college teachers, and S&E technicians/technologists have been reclassified under "Science and engineering-related occupations." As a result, all "Non-science and engineering occupations" are composed of clearly non-S&E occupations, such as those involving arts and humanities or social services.

Definitions and Explanations Top.

Employer location. Survey question A11 includes location of the principal employer, and data were based primarily on responses to this question. Individuals not reporting place of employment were classified by their last mailing address.

Field of doctorate. The doctoral field is as specified by the respondent in the SED at the time of degree conferral. These codes were subsequently recoded to the field of study codes used in SESTAT questionnaires. (See appendix tables C-1 and C-2 for field-of-study codes.)

Involuntarily out-of-field rate. The involuntarily out-of-field rate is the percentage of employed individuals who reported working part-time exclusively because a suitable job was not available and/or reported working in an area not related to the first doctoral degree (in their principal job), at least partially because a job in the doctoral field was not available.

Labor force participation rate. The labor force participation rate (RLF) is the ratio (E + U) / P, where E (employed) + U (unemployed; those not-employed persons actively seeking work) = the total labor force, and P = population, defined as all science, engineering, and health doctorate holders under age 76 who were residing in the United States during the week of 1 October 2003 and who earned their doctorates from U.S. institutions.

Non-U.S. citizen, temporary resident. This citizenship status category does not include individuals who at the time they received their doctorate reported plans to leave the United States and thus were excluded from the sampling frame.

Occupation data. These data were derived from responses to several questions on the kind of work primarily performed by the respondent. The occupational classification of the respondent was based on his/her principal job held during the reference week—or last job held, if not employed in the reference week (survey question A21 or A5). Also used in the occupational classification was a respondent-selected job code (survey question A22 or A6). (See appendix table C-3 for a list of occupations.)

Race/ethnicity. American Indian/Alaska Native, Asian, black, Native Hawaiian/other Pacific Islander, and white refer to non-Hispanic individuals only. These data are from prior rounds of the SDR and the SED. The most recently reported race/ethnicity data were given precedence.

Salary. Median annual salaries are reported, rounded to the nearest $100 and computed for full-time employed scientists and engineers. For individuals employed by education institutions, no accommodation was made to convert academic-year salaries to calendar-year salaries. Users are advised that due to changes in the salary question since 1993, the 1995 through 2003 salary data are not strictly comparable with the 1993 salary data.

Sector of employment. Employment sector was a derived variable based on responses to survey questions A15 and A17. In the detailed tables, the category "Universities and 4-year colleges" includes 4-year colleges or universities, medical schools (including university-affiliated hospitals or medical centers), and university-affiliated research institutions. "Private-for-profit" includes those self-employed in business.

Unemployment rate. The unemployment rate (Ru) is the ratio U / (E + U), where U = unemployed (those not-employed persons actively seeking work) and E (employed) + U = the total labor force.

References Top.

Dillman, D.A. 1978. Mail and Telephone Surveys: The Total Design Method. New York: Wiley-Intersciences.

Fay, R.E. and Train, G.F. 1995. Aspects of survey and modelbased postcensal estimation of income and poverty characteristics for states and counties, ASA Proceedings of the Section on Government Statistics: 154–159.

National Opinion Research Center (NORC). 2003. 2003 SDR Experiment Summary Plan — Amended. Issued in August 2003. Unpublished report prepared under contract SRS-0214279 for the National Science Foundation.

Tupek, Alan R. 2003. Calculation of generalized variance parameters for the 2001 Survey of Doctorate Recipients (SDR01-VAR-3). Internal Census Bureau Memorandum, February 11.

Wolter, K. 1984. An investigation of some estimators of variance for systematic sampling. Journal of the American Statistical Association, 79(388):781–790.

Wolter, K. 1985. Introduction to Variance Estimation. New York: Springer-Verlag New York Inc.



Footnotes

[1] See appendix table C-1 for science, engineering, and health fields included in the 2003 SDR sampling frame.

[2] The SDR frame is based on the first U.S. doctorate earned. Recipients of two doctorates whose first degree is not in a science, engineering, or health field are not included in the SDR frame, even if their second doctorate is in a science, engineering, or health field. Based on information collected annually by the SED on the number and characteristics of those earning two doctorates, this exclusion results in a slight undercoverage bias. In 1983–2000, for example, the total number of double doctorate recipients with a non-science, engineering, or health first doctorate and a science, engineering, or health second doctorate was 154, representing 0.046 percent of the total number of science, engineering, or health doctorates awarded in that period.

[3] For more complete details regarding the mode experiments, see NORC 2003.

[4] In the 2001 GVF estimation, an additional restriction was applied to this model such that the relative variance is zero when the survey estimate is equal to the population control total T, where the values of T were derived from the population control totals that were used in ratio raking adjustment in 2001. This was done to avoid a situation where the estimated relative variance could be negative for large values of the estimate. This restriction forced the value of a to be equal to (–b/T ) and the model was thus reduced to a one-parameter model: Variance of X hat divided by X^2 equals b times [(1 divided by  X) minus (1 divided by T)].. For the 2003 SDR, however, NSF decided not to implement ratio adjustments to population control totals and thus the T are not available. For practical purposes, the two models should give very similar results.


Previous Section. Top of page. Next Section. Table of Contents. Help. SRS Homepage.