text-only page produced automatically by LIFT Text
Transcoder Skip all navigation and go to page contentSkip top navigation and go to directorate navigationSkip top navigation and go to page navigation
National Science Foundation
 
Survey Descriptions
Survey of Doctorate Recipients
Questionnaire(s)
Publications and Data:
Schedule of Next Release Dates
Science Resources Statistics
SRS Home
About SRS
Topics: A to Z
View Staff Directory
Contact SRS
Search SRS


Survey of Doctorate Recipients

Overview  Survey Design  Survey Quality Measures  Trend Data  Availability of Data

1. Overview (2006 survey cycle) Top of Page.

a. Purpose

The Survey of Doctorate Recipients (SDR) gathers information from individuals who have obtained a doctoral degree in a science, engineering or health field (SEH). The SDR is conducted every 2 years and is a longitudinal survey that follows recipients of research doctorates from U.S. institutions until age 76. This group is of special interest to many decision makers, because it represents some of the highest educated individuals in the U.S. workforce. The SDR results are used by employers in the education, industry, and government sectors to understand and to predict trends in employment opportunities and salaries in SEH fields for graduates with doctoral degrees. The results are also used to evaluate the effectiveness of equal opportunity efforts. NSF also finds the results important for internal planning, because most NSF grants go to individuals with doctoral degrees. The data from this survey are combined with that from two other NSF surveys of scientists and engineers, the National Survey of College Graduates (NSCG) and National Survey of Recent College Graduates (NSRCG). The three surveys are closely coordinated and share the same reference date and nearly identical instruments. The data base developed from the three surveys, the Scientists and Engineers Statistical Data System (SESTAT), provides a comprehensive picture of the number and characteristics of individuals with training and/or employment in science, engineering or related fields in the United States.

b. Respondents

Respondents were individuals:

  • with a research doctorate in a SEH field from a U.S. institution,
  • living in the U.S. during the survey reference week,
  • non-institutionalized; and
  • under age 76.

c. Key variables

  • Citizenship status
  • Country of birth
  • Country of citizenship
  • Date of birth
  • Disability status
  • Educational history (for each degree held: field, level, institution, when received)
  • Employment status (unemployed, employed part time, or employed full time)
  • Geographic place of employment
  • Marital status
  • Number of children
  • Occupation (current or past job) [1]
  • Primary work activity (e.g., teaching, basic research, etc.)
  • Postdoctorate status (current and/or 3 most recent postdoctoral appointments)
  • Race/ethnicity
  • Salary
  • Satisfaction and importance of various aspects of job
  • School enrollment status
  • Sector of employment (e.g., academia, industry, government, etc.)
  • Sex
  • Work-related training

2. Survey Design Top of Page.

a. Target population and sample frame

The target population of the 2006 survey consisted of all individuals:

Under the age of 76 as of the survey reference date (i.e., born on or after April 1, 1930) [2] who received a research doctorate in a science, engineering or health field from a U.S. academic institution, were non-institutionalized, and were living in the U.S. or one of its territories during the survey reference week of April 1, 2006.

The sample frame used to identify these individuals was the Doctorate Records File (DRF), maintained by the National Science Foundation. The primary source of information for the DRF is the Survey of Earned Doctorates (SED). For individuals who received a doctoral degree prior to 1957, when the SED started, information was taken from a register of highly qualified scientists and engineers that the National Academy of Sciences had assembled from a variety of sources, including university and college catalogues of doctorate-granting institutions, federal laboratories, selected industrial organizations, and American Men and Women in Science.

The 2006 SDR sampling frame included individuals whom:
  • Earned a research doctoral degree from a U.S. college or university in a science, engineering, or health field through June 30, 2005;
  • at time of receipt of their doctoral degree were U.S. citizens, or if non-U.S. citizens, indicated on the SED they had plans to remain in the United States after degree award [3]; and
  • were younger than 76 years of age as of April 1, 2006 (the survey reference date); or if age was unknown, had not received a baccalaureate prior to 1948.

The 2006 SDR frame was constructed as two separate databases, the new cohort frame and the old cohort frame. The cohorts are defined by the year of receipt of the first U.S.-granted doctoral degree [4]. The new cohort frame included individuals who received a science, engineering, or health doctorate between July 1, 2002 and June 30, 2005; and the old cohort frame represented individuals who received a science, engineering, or health doctorate prior to July 1, 2002. The new cohort frame was a "primary frame" including all known eligible cases, while the old cohort frame was a "secondary frame" because it was the SDR sample selected for the previous survey cycle and each frame member carried a sampling weight from the previous cycle.

New Cohort Frame Construction. The data source for constructing the SDR new cohort sampling frame for 2006 was the three most-recent doctoral cohorts included in the DRF. The most recent SED cohort always lags one year behind the current SDR reference year; the three most-recent cohorts for the 2006 SDR were thus the 2003, 2004 and 2005 (academic year, ending June 30) doctoral cohorts.

There is an important element to note with regard to the 2006 SDR new cohort frame construction.  Until 2006, the SDR new cohort frame consisted of the two-most recent cohorts from the DRF.  The previous reference date for the SDR was October 1, 2003, and that survey incorporated the 2001 and 2002 DRF cohorts.  Because the reference date for 2006 was April 1, 2006 (a gap of 2.5 years since the previous reference date), it was possible to incorporate three recent cohorts from the DRF.  However, because the 2005 frame was not available until June 2006, two months after the reference date, the new cohort frame was constructed in two files.  The first file available in September 2005 included the 2003 and 2004 doctoral cohorts, and the second file available in June 2006 included the 2005.

Old Cohort Frame Construction. The SDR old cohort frame was constructed principally from the final sample file from the previous cycle plus additional cases from the previous cycle frame file that were classified as temporarily ineligible for that survey [5]. Old cohort cases were all originally selected into the SDR as new cohort members, and as such were originally selected from the DRF. However, the DRF cannot be used as a primary frame for the old cohort because information relevant to the SDR frame on the old cohort population is not systematically updated. The final sample from the previous SDR cycle thus was used as the basic frame for the current old cohort sample.

The cases within these two frame sources were analyzed individually for SDR eligibility requirements. Persons who did not meet the age criteria or who were known to be deceased, terminally ill or incapacitated, or permanently institutionalized in a correction or health care facility were dropped from the 2006 sampling frames. Sample persons who were non-U.S. citizens and were known to be residing outside the United States or one of its territories during at least two prior consecutive survey cycles were also eliminated from the sampling frame. After ineligible cases were removed from consideration, the remaining cases from the two sources were combined to create the 2006 SDR sampling frame. In total, there were 116,637 cases in the 2006 SDR frame, 77,834 new cohort cases and 38,803 old cohort cases.

b. Sample design

The 2006 SDR sample design was basically the same as the 2003 SDR design. The 2006 SDR sample consisted of 42,955 cases. The frame was stratified into 164 strata by three variables: demographic group, degree field, and sex. The sample was then selected from each stratum systematically.  The goal of the 2006 SDR sample stratification design was to create strata that conformed as closely as possible to the reporting domains used by analysts and for which the associated subpopulations were large enough to be suitable for separate estimation and reporting. The demographic group variable included 10 categories defined by race/ethnicity, disability status, and citizenship at birth. The classification of frame cases into these categories was done in a hierarchical manner to ensure higher selection probability for rarer population groups. Prior to 2003, a 15-category degree field variable was used to stratify all demographic groups, resulting in a large number of strata with very small populations. NSF decided that an alternative degree field variable was needed to stratify the smaller demographic groups. Beginning in 2003, only the three largest demographic groups (U.S. white, non-U.S. white, and non-U.S. Asian) were stratified by the 15-category degree field variable. All other demographic groups were stratified by a 7-category degree field variable except that American Indians and Pacific Islanders were stratified only by sex. Thus, the 2006 SDR design featured a total of 164 strata defined by a revised demographic group variable, two degree field variables, and sex.

The 2006 SDR sample allocation strategy consisted of three main components: (1) Allocate a minimum sample size for the smallest strata through a supplemental stratum allocation; (2) allocate extra sample for specific demographic group-by-sex domains through a supplemental domain allocation; and (3) allocate the remaining sample proportionately across all strata. The final sample allocation was therefore based on the sum of a proportional allocation across all strata, a domain-specific supplement allocated proportionately across strata in that domain, and a stratum-specific supplement added to obtain the minimum stratum size.

The 2006 SDR sample selection was carried out independently for each stratum and cohort-substratum. For the old cohort strata, the past practice of selecting the sample with probability proportional to size continued, where the measure of size was the base weight associated with the previous survey cycle.  For each stratum, the sampling algorithm started by identifying and removing self-representing cases through an iterative procedure. Next, the non-self-representing cases within each stratum were sorted by citizenship, disability status, DRF degree field, and year of doctoral degree award. Finally, the balance of the sample (i.e., the total allocation minus the number of self-representing cases) was selected from each stratum systematically with probability proportional to size.

The new cohort sample was selected using the same algorithm used to select the old cohort sample. However, because the base weight for every case in the new cohort frame was identical, each stratum sample from the new cohort was actually an equal-probability or self-weighting sample.

The 2006 SDR sample of 42,955 consisted of 38,027 cases from the old cohort frame and 4,928 cases from the new cohort frame. The overall sampling rate was about 1 in 18 (5.5 percent). However, sampling rates varied considerably across the strata.

c. Data collection techniques

The SDR has been conducted for the NSF/SRS by survey contractors. The National Opinion Research Center at the University of Chicago (Chicago, IL) conducted the 2006 survey.  Prior to 1997, the survey was conducted by the National Research Council of the National Academy of Sciences under contract to SRS; the 1997 and 2003 surveys were conducted by the National Opinion Research Center and the U.S. Census Bureau conducted the 1999 and 2001 surveys.

The data collection approach from 1991 to 2001 consisted of mailing letters, self-administered paper questionnaires (SAQ), and postcards, with a sequence of follow-up mailings to non-responders according to a set schedule, followed finally by contacting the non-responding sample members via telephone. The telephone contact has been used to prompt the return of the self-administered paper survey or completion of the survey by telephone interview. In 2003, CATI and a self-administered online questionnaire (Web) were introduced as initial response modes on an experimental basis.  The experiment results indicated that both of these approaches had merit.  For certain types of cases, starting in CATI or Web improved both response quality and response rate. 

The multiple starting mode approach initiated in 2003 was expanded in the 2006 survey cycle and a tri-mode approach to data collection was fully implemented.  Using mode preference information reported during in the 2003 SDR and response information from the 2003 SDR mode experiments, the 2006 selected sample was assigned to various starting mode data collection protocols.  Old cohort sample members who responded to the 2003 SDR were stratified by explicit or implicit mode preference, and the cases were assigned to start mode accordingly.  Explicit responses were determined by the answer to the mode preference question on the 2003 SDR survey; for those that did not respond to the preference question or indicated no preference, implicit preference was defined as the mode they used to complete the 2003 SDR.  2003 SDR non-respondents were assigned a starting mode based on analysis conducted on the 2003 data which indicated that past refusals are more likely to cooperate if started in the SAQ mode and other non-respondents were most likely to cooperate if started in the Web mode.  All new cohort members were assigned to the CATI mode; this decision was also based on analysis conducted on the 2003 SDR data.  Those that were living abroad who had not completed the 2003 were started in the Web mode to decrease mailing costs for sample members most likely to be ineligible for the 2003 SDR.  Those without any physical or e-mail address were started in CATI.

At the start of data collection, 15,733 cases received the paper questionnaire in the mail as their initial mode, 7,902 cases started in the CATI mode, and 19,082 cases started in the Web mode. Based on Dillman's Total Design Method [6], different data collection protocols were developed for each of the three different data collection approaches. The data collection approach used for the SAQ start mode consisted of the following steps: Sample members first received an advance notification letter from the NSF to acquaint them with the survey. A week later, the first mailout questionnaire was sent out, followed by a thank you/reminder postcard the following week. Approximately eight weeks after the first questionnaire mailout, the sample members who had not returned a completed questionnaire from the first mailout received a second questionnaire. Four weeks later, any cases still not finalized received an e-mail prompt to encourage completion of the SAQ. Telephone follow-up to complete the CATI for all mail nonrespondents began four weeks later.  Two additional prompting letters were sent late in the data collection period.  The first was sent approximately six months into the field period as part of an incentive experiment and the final letter announcing the end of the field period was sent at the beginning of December 2006.  The data collection protocols for the CATI and Web start mode groups were similar and ran in parallel to the SAQ data collection protocol.  At any given time, a sample member could request to complete the survey in a mode other than the mode they were originally assigned. A total of 41.0% of the sample members completed the survey in a mode that was different from their start mode (n=13,110).  Additionally, there were 238 cases that strongly refused to ever participate in prior SDR rounds.  These cases only received a single letter at the start of data collection.  The data collection field period ended in December 2006. The CATI and data entry processes ended on January 2007 and the Web was closed down in January 2007, as well.  There is an important exception to note in the data collection protocol for the new cohort.  As mentioned previously, the 2005 academic cohort was not available until June 2006, two months after the start of data collection.  These new cohort cases began data collection in the beginning of July 2006 and received a condensed data collection protocol.  Quality assurance procedures were in place at each step (address updating, printing, package assembly and mailout, questionnaire receipt, data entry, coding, CATI, and post data collection processing).

d. Estimation techniques

Because the SDR is a sample survey, sampling weights are needed to develop accurate population estimates. Weights are attached to each responding graduate record to make it simple to estimate characteristics of the population of graduates. The primary purpose of the weights is to correct for potential bias due to unequal selection probabilities and nonresponse. The final analysis weights were calculated in three stages:

  • First, a base weight was calculated for every case in the sample to account for its selection probability under the sample design.
  • Second, an adjustment for unknown eligibility was made to the base weight by distributing the weight of the unknown eligibility cases to the known eligibility cases proportionately to the observed eligibility rate within each adjustment class.
  • Third, an adjustment for nonresponse was made to the adjusted base weight to account for the eligible sample cases for which no response was obtained.

3. Survey Quality Measures Top of Page.

a. Sampling variability

The sample size is sufficiently large that estimates based on the total sample should be subject to no more than moderate sampling error. However, sampling error can be quite substantial in estimating the characteristics of small subgroups of the population. Estimates of the sampling errors associated with various measures are included in the methodology report for the survey and in the basic publications.

b. Coverage

Coverage for the Survey of Earned Doctorates is believed to be excellent. Because this is the sample frame for most of the SDR sample, the SDR benefits from this excellence. For years prior to 1957 (the commencement of the SED), the sample frame was compiled from a variety of sources. While it is likely that this component of the sample frame was more subject to coverage problems than is true for later cohorts, pre-1957 doctorates constitute less than 1 percent of the target population in 2006.

c. Nonresponse

(1) Unit nonresponse - The weighted response rate for this survey in 2006 was 78 percent. In order to minimize the impact of this source of error, data were adjusted for nonresponse through the use of statistical weighting techniques.

(2) Item nonresponse - In 2006, the item nonresponse rates for key items (employment status, sector of employment, field of occupation, and primary work activity) ranged from 0.0 percent to 2.2 percent. Some of the remaining variables had nonresponse rates that were considerably higher. For example, salary and earned income, particularly sensitive variables, had item nonresponse rates of 8.2 and 12.2 percent, respectively. Personal demographic data such as marital status, citizenship and race/ethnicity had item nonresponse rates ranging from 0.0 to 3.6 percent. Any cases missing the critical complete items were considered as survey nonresponse. Critical complete items included, working for pay or profit, looking for work, last job code, principal job code, living in the U.S., and birth date.  All non-critical complete missing data were imputed using a hot deck imputation procedure, except for verbatim text items, and some coded variables based on verbatim text items.

(3) Imputation - The 2006 SDR used a combination of logical imputation and statistical imputation.

Logical Imputation. For the most part, logical imputation was accomplished as part of editing. In the editing phase, the answer to a question with missing data was sometimes determined by the answer to another question. In some circumstances, editing was also used to create "missing" data for statistical imputation [7].  During sample frame building for the SDR, some demographic frame variables were found to be missing for sample members [8].  The values for these variables were imputed at the frame construction stage, and when possible, updated with data obtained during data collection

Statistical Imputation. The 2006 SDR primary method for statistical imputation was hot-deck imputation. Almost all SDR variables were subjected to hot-deck imputation, where each variable had their own class and sort variables structure created based on a regression analysis. Critical items (which must be complete for all completed cases) and text variables were not imputed.  Efforts were made to collect missing sampling variables directly from the sample members during data collection, in order to replace the imputed values with respondent-reported information. 

For some variables, there is no set of class and sort variables that are reliably related to or predict the missing value. In these instances consistency was better achieved outside of the hot deck procedures using random imputation. For example, respondents with a missing marital status (question E1) may have answered questions E2 or E3 regarding their spouse or partner's employment status but failed to answer question E1 regarding their marital status. This implies that E1 should be '1' (Married) or '2' (Living in a marriage-like relationship). Our procedure was to assign a random value for E1 with a probability proportional to the number of cases in each of the valid values (e.g., if there are three married respondents for every respondent living in a marriage-like relationship, then missing values of E1 would be filled in with a '1' 75% of the time and '2' 25% of the time).

d. Measurement

Several of the key variables in this survey are difficult to measure and thus are relatively prone to measurement error. For example, individuals do not always know the precise definitions of occupations that are used by experts in the field and may thus select occupational fields that are technically incorrect. In order to reduce measurement error, the instrument was pretested, using cognitive interviews and a mail pretest. The SDR instrument also benefited from the extensive pretesting of the NSCG and NSRCG instruments, because most SDR questions also appear on the NSCG and NSRCG.

As is true for any multimodal survey, it is likely that the measurement errors associated with the different modalities are somewhat different. This possible source of measurement error is especially troublesome, because the proclivity to respond by one mode or the other is likely to be associated with variables of interest in the survey. To the extent that certain types of individuals may be relatively likely to respond by one mode compared with another, the multimodal approach may have introduced some systematic biases into the data. A study of differences across modes was conducted after the 2003 survey and showed that all three modes yielded comparable data for the most critical data items.  Further, data captured in the the web mode had lower item nonresponse for contacting variables and more complete verbatim responses for the occupation questions than the data captured in the SAQ mode.

4. Trend Data Top of Page.

There have been a number of changes in the definition of the population surveyed over time. For example, prior to 1991, the survey included some individuals who had received doctoral degrees in fields outside of SEH or had received their degrees from non-U.S. universities. Because coverage of these individuals had declined over time, the decision was made to delete them beginning with the 1991 survey. Survey improvements made in 1993 were sufficiently great that SRS staff suggest that trend analyses between the data from the surveys after 1991 and the surveys in prior years must be performed very cautiously, if at all. Individuals who wish to explore such analyses are encouraged to discuss this issue further with the survey project officer listed below.

5. Availability of Data Top of Page.

The data from this survey are published biennially in Detailed Statistical Tables in the series Characteristics of Doctoral Scientists and Engineers in the United States, as well as in several InfoBriefs and Special Reports. Information from this survey is also included in Science and Engineering Indicators, Women, Minorities, and Persons With Disabilities in Science and Engineering, and Science and Engineering State Profiles.

Data also available in the SESTAT data system. Selected aggregate data are available in public use data files upon request. Access to restricted data for researchers interested in analyzing microdata can be arranged through a licensing agreement.

Additional information about this survey may be obtained by contacting:

Daniel Foley
Survey Statistician
Human Resources Statistics Program
Division of Science Resources Statistics
National Science Foundation
4201 Wilson Boulevard, Suite 965
Arlington, VA 22230

Phone: (703) 292-7811
E-mail: dfoley@nsf.gov


Footnotes

[1] For a complete listing of occupational classifications used in the SDR, see http://sestat.nsf.gov/docs/occ03maj.html.

[2] Individuals who turn 76 on the survey reference date are considered eligible, in order to simplify survey operations.

[3] In 2003 and 2006, a sample of non-U.S. citizens who indicated plans to leave the United States were followed.  The purposes of this experiment (called the International SDR or "ISDR") was to determine the feasibility of locating, following and obtaining information from this portion of new graduates who are not included in the regular SDR sampling frame.

[4] The SDR frame is based on the first U.S. research doctoral degree earned. Persons who have earned two research doctoral degrees where the first degree was a non-S&E degree and the second degree was an S&E or health degree were not included in the SDR frame. Based on information collected annually by the SED on the number and characteristics of those earning two doctorates, this exclusion results in a slight undercoverage bias. In 2000, for example, the total number of double doctorate recipients with a non-S&E first degree and an S&E or health second doctorate was 154, representing 0.046 percent of the total number of S&E or health doctorates awarded in that period.

[5] Individuals are considered temporarily ineligible for a current cycle if they were either temporarily institutionalized or living outside the U.S. on the current SDR reference date. Those living outside the U.S. are classified as temporarily ineligible if the previous SDR surveys found they were (1) U.S. citizens; or (2) non-U.S. citizens living abroad for only the previous SDR cycle.

[6] Dillman, Don A. 1978. Mail and Telephone Surveys: The Total Design Method. New York: Wiley-Interscience.

[7] This type of edit would occur when the respondent provided data that were inconsistent with previously reported data or with the allowable range of responses according to the logic of the question. For instance, if the respondent reported working in 2006 and reported starting the job before 10/2003, but consistency checks show that the respondent marked never worked in 2003, then reported start year and month would be set to missing and a start date between 10/2003 and 4/2006, inclusive, would be imputed. Another example would be a case in which a respondent was asked to designate the two most important reasons for taking a post-doc, but reported the same reason for the first and second reason. The second reason would have to be set to missing and have its value imputed from the list of reasons provided in the previous question if the respondent supplied more than two valid reasons for pursuing a post-doc.

[8] A small number of 2006 SDR old cohort and new cohort cases are missing critical SDR sampling stratification variables. Those variables are race, ethnicity, citizenship and sex.

Last updated: February 27, 2008

 

Print this page
Back to Top of page
  Web Policies and Important Links | Privacy | FOIA | Help | Contact NSF | Contact Webmaster | SiteMap  
National Science Foundation Division of Science Resources Statistics (SRS)
The National Science Foundation, 4201 Wilson Boulevard, Arlington, Virginia 22230, USA
Tel: (703) 292-8780, FIRS: (800) 877-8339 | TDD: (800) 281-8749
Text Only
Last Updated:
Sep 18, 2008