Federal
Committee on Statistical
Methodology Office of Management and Budget |
FCSM
Home ^ Methodology Reports ^ |
Statistical Policy Working Paper 21 - Indirect Estimators in Federal Programs
Click HERE for graphic. Statistical Policy Working Paper 21 Indirect Estimators in Federal Programs Prepared by Subcommittee on Small Area Estimation Federal Committee on Statistical Methodology Statistical Policy Office Office of Information and Regulatory Affairs Office of Management and Budget July 1993 MEMBERS OF THE FEDERAL COMMITTEE ON STATISTICAL METHODOLOGY (July 1993) Maria E. Gonzalez, Chair office of Management and Budget Yvonne M. Bishop Daniel Melnick Energy Information Substance Abuse and Mental Administration Health Services Administration, Warren L. Buckler Robert P. Parker Social Security Administration Bureau of Economic Analysis Cynthia Z.F. Clark Charles P. Pautler, Jr. National Agricultural Bureau of the Census Statistics Service David A. Pierce Steven Cohen Federal Reserve Board Administration for Health Policy and Research Thomas J. Plewes Bureau of Labor Statistics Zahava D. Doering Smithsonian institution Wesley L. Schaible Bureau of Labor Statistics Roger A. Herriot National Center for Fritz J. Scheuren Education Statistics internal Revenue Service C. Terry Ireland Monroe G. Sirken National Computer Security National Center for Center Health Statistics Charles D. Jones Robert D. Tortora Bureau of the Census Bureau of the Census Daniel Kasprzyk Alan R. Tupek National Center for National Science Foundation Education Statistics Nancy Kirkendall Energy Information Administration PREFACE The Federal Committee on Statistical Methodology was organized by OMB in 1975 to investigate issues of data quality affecting Federal statistics. Members of the committee, selected by OMB on the basis of their individual expertise and interest in statistical methods, serve in a personal capacity rather than as agency representatives. The committee conducts its work through subcommittees that are organized to study particular issues. The subcommittees are open by invitation to Federal employees who wish to participate. Working papers are prepared by the subcommittee members and reflect only their individual and collective ideas. The Subcommittee on Small Area Estimation was formed in 1991 to document the uses of indirect estimators by Federal statistical agencies to prepare and publish estimates. An indirect estimator uses values of the variable of interest from a domain and/or time period other than the domain and time period of the estimate being produced. Users of indirect estimators should consider the errors to which these estimates are subject. Eight programs that publish indirect estimators are described in this report. These programs sometimes respond to legislative requirements or, alternatively, to State data needs. The programs and sponsor agencies are: infant and maternal health characteristics (National Center for Health Statistics (NCHS)); personal income, annua1 income, and gross product (Bureau of Economic Analysis); postcensal population estimates for counties (Bureau of the Census (BOC)); employment and unemployment for States (Bureau of Labor Statistics); cotton, rice, and soybean acreage (National Agricultural Statistics Service (NASS)); livestock inventories, crop production, and acreage (NASS); disabilities, hospital utilization, physician and dental visits (NCHS); and median income for 4-person families (BOC). The Subcommittee on Small Area Estimation was chaired by Wesley L. Schaible of the Bureau of Labor Statistics, Department of Labor. MEMBERS OF THE SUBCOMMITTEE ON SMALL AREA ESTIMATION Wesley L. Schaible (Chair) Bureau of Labor Statistics Department of Labor Robert E. Fay Bureau of the Census Department of Commerce Joe Fred Gonzalez National Center for Health Statistics Department of Health and Human Services Linnea Hazen Bureau of Economic Analysis Department of Commerce William C. Iwig National Agricultural Statistics Service Department of Agriculture John F. Long Bureau of the Census Department of Commerce Donald J. Malec National Center for Health Statistics Department of Health and Human Services Alan R. Tupek National Science Foundation ACKNOWLEDGMENTS This report is the result of the collaborative efforts of members of the Subcommittee on Small Area Estimation of the Federal Committee on Statistical Methodology. Subcommittee members volunteered a considerable amount of time over a two year period to complete individual chapters in the report. Chapter authors are identified at the beginning of each chapter. Although the introductory and concluding chapters were authored by the Subcommittee Chair, they resulted from discussions which included the entire Subcommittee as well as other interested parties. Throughout the preparation of the report, a number of reviewers read drafts and provided valuable comments. The Subcommittee thanks Maria Gonzalez, Chair of the Federal Committee on Statistical Methodology, for her support and contributions throughout the development and preparation of the report. The Subcommittee also expresses its appreciation to the members of the Federal Committee on Statistical Methodology for reviewing the report and providing many useful suggestions. The Subcommittee extends special thanks to the following Committee members: Yvonne Bishop, Cynthia Clark, Robert Parker, David Pierce, Thomas Plewes, Monroe Sirken, Robert Tortora, and, in particular, Fritz Scheuren. The Subcommittee also thanks Alan Dorfman, Steve Mlier, and especially Robert Casady, all of the Bureau of Labor Statistics, for helpful discussions and comments. In addition, the Subcommittee extends its thanks to Gordon Brackstone and the staff of Statistics Canada, to Wayne Fuller of Iowa State University, and especially to Graham Kalton of Westat, Inc. for valuable comments and the time so generously provided to review the report. TABLE OF CONTENTS Chapter 1. Introduction and Summary........................ 1-1 Chapter 2. Synthetic Estimation in Follwback Surveys at the National Center for Health Statistics.............. 2-1 Chapter 3. State, Metropolitan Area, and County Income Estimation......................................... 3-1 Chapter 4. Postcensal Population Estimates: States, Counties and Places....................................... 4-1 Chapter 5. Bureau of Labor Statistics' State and Local Area Estimates of Employment and Unemployment.................. 5-1 Chapter 6. County Estimation of Crop Acreage Using Satellite Data...................................... 6-1 Chapter 7. The National Agricultural Statistics Service County Estimates Program......................................... 7-1 Chapter 8. Model-Based State Estimates from the National Health Interview Survey................................... 8-1 Chapter 9. Estimation of Median Income for Four-Person Families by State......................................... 9-1 Chapter 10. Recommendations and Cautions.................. 10-1 iv CHAPTER 1 Introduction and Summary 1.1 Introduction Federal statistical agencies produce estimates of a variety of population quantities for both the nation as a whole and for subnational domains. Domains are commonly defined by demographic and socioeconomic variables. However, geographic location is perhaps the single variable used most frequently to define domains. Regions, states, counties, and metropolitan areas are common geographic domains for which estimates are required. Federal agencies use different data systems and estimation methods to produce domain estimates. Those systems designed for the purpose of producing published estimates use standard, direct estimation methods. Data systems are designed within time, cost and other constraints which restrict the number of estimates that can be produced by standard methods. However, the demand for additional information and the lack of resources to design the required dam systems have led federal statistical agencies to consider non-standard methods. Estimation methods of a particular type, referred to as small area or indirect estimators, have sometimes been used in these situations. The purpose of this report is to document, in a manner that will facilitate comparisons, the practices and estimation methods of the federal statistical programs that use indirect estimators. Only programs that use indirect estimators for the production of published estimates are included; whether a data system is based on a census (including administrative records) or a: sample survey has no bearing on the inclusion of a program. The focus of this report is on the method by which estimates are produced. The methods and practices of eight programs are documented here; three are located in the Department of Commerce, two in the Department of Agriculture, two in the Department of Health and Human Services, and one in the Department of Labor. Other applications of indirect estimators occur in federal statistical agencies but descriptions of these applications have not been included in this report. Most of these methods were not included because they were not used, to produce published estimates. This publication restriction, a somewhat arbitrary indicator of program importance, keeps the number of programs included to a manageable level but leads to the omission of other interesting methods (for example, Fay and Herriott 1979). This introductory chapter includes brief discussions of small area estimation terminology; definitions of direct and indirect estimators; some characteristics, of indirect estimators; and summary descriptions of the programs included in the report. Each program is documented, following a standard format, in the individual chapters of the report. The intent is to create program descriptions that will not only provide complete, self-contained documentation for each individual program but also facilitate comparisons among programs. Although the focus of the report is on estimation methods, the description of each program includes material on program history, policies, evaluation practices, estimation methods, and current, problems and activities. In addition to the standard chapter format, attempts have been made to employ common notation throughout the chapters to facilitate comparisons of estimation methods. The report, concludes with a number of recommendations and cautions. 1.2 Terminology Terms used to describe indirect estimators can be confusing. Increased interest in non- traditional estimators for domain statistics has occurred recently among survey statisticians and, even though the term "small area estimator" is commonly used, uniform terminology has not yet evolved. This term is frequently used because in most applications of these estimators the domains of interest have been geographic areas. However, the word "small" is misleading. It is the small number of sample observations and the resulting large variance of standard direct estimators that is of concern, rather than the size of the population in the area or the size of the area itself The word "area" is also misleading since these methods may be applied to any arbitrary domain, not just those defined by geographic boundaries. Other terms used to describe these estimators include "local area" (Ericksen 1974), "small domain" (Purcell and Kish 1979), subdomain" (Laake 1979), "small subgroups" (Holt, Smith, and Tomberlin 1979), subprovincial" (Brackstone 1987), "indirect" (Dalenius 1987), and "model-dependent" (Sarndal 1984). The term "synthetic estimator" has also been used to describe this class of estimators (NIDA 1979) and, in addition, to describe a specific indirect estimator (NCHS 1968). Survey practitioners sometimes refer to indirect estimators as "model-based" whereas this term is rarely, if ever, used to describe direct estimators. However, direct estimators can be motivated by and justified under models as readily as indirect estimators. There is also lack of agreement on what to call the class of direct estimators. In addition to "direct" (Royall 1973), authors have used "unbiased" (Gonzalez 1973), "standard" (Holt, Smith, and Tomberlin 1979), "valid" (Gonzalez 1979), and "sample-based" (Kalton 1987). In the remainder of this paper, the words "direct" and "indirect" will be used to describe traditional and small area estimators, respectively. 1.3 Direct and Indirect Estimators Perhaps the most common measure of error of an estimator is the mean square error, composed of the sum of the variance of the estimator and the squared bias of the estimator. Biases can rarely be estimated with any degree of confidence. If an estimator is unbiased or approximately unbiased, the variance of the estimator, which can be estimated from the available data, is a satisfactory measure of error of the estimator. This leads to the selection of estimators that are unbiased or approximately unbiased in most applications. Such estimators allow data systems to be designed so that estimates with a predictable level of error can be produced with high probability and, in addition, estimated measures of error can be provided to accompany estimates. Federal statistical programs are generally designed using direct estimators which are unbiased, or approximately unbiased, under finite population sampling theory. Samples are often used and, given adequate resources, the sample design specifies population and domain sample sizes large enough to produce direct estimates that meet reliability requirements for the survey. When a domain sample size is too small to make a reliable domain estimate using the direct estimator, a decision must be made whether to produce estimates using an alternative procedure. The alternative estimators considered are those that increase the effective sample size and decrease the variance by using data from other domains and/or time periods through models that assume similarities across domains and/or time periods. These estimators are generally biased, but if the mean square error of the alternative estimator can be demonstrated to be small compared to the variance of the direct estimator, the selection of the alternative estimator may be justified. In extreme situations, there may be no sample units in the domain of interest and, if an estimate is to be produced, an alternative estimator will be required. Indirect estimators have been characterized in the Bayesian and empirical Bayes literature as estimators that "borrow strength" by the use of values of the variable of interest from domains other than the domain of interest. This approach can be used to provide a working definition of direct and indirect estimators for a broad class of population quantities including means and totals. A direct estimator uses values of the variable of interest only from the time period of interest and only from units in the domain of interest. An indirect estimator uses values of the variable of interest from a domain and/or time period other than the domain and time period of interest. Three types of indirect estimators can be identified. A domain indirect estimator uses values of the variable of interest from another domain but not from another time period. A time indirect estimator uses values of the variable of interest from another time period but not from another domain. An estimator that is both domain and time indirect uses values of the variable of interest from another domain and another time period. Indirect estimators depend on values of the variable of interest from domains and/or time periods other than that of interest. These values are brought into the estimation process through a model that, except in the most trivial case, depends on one or more auxiliary variables that are known for the domain and time period of interest. To the extent that applicable models can be identified and the required auxiliary variables are available, indirect estimators can be created to produce estimates. Perhaps the simplest example of an indirect estimator is the use of the sample mean of the entire sample as the estimator for a specific domain. For example, the use of the mean from a national sample as an estimate for a particular state. To the extent that information related to the variable of interest is available for the state, an indirect estimator which is "better" than the national mean can be defined. The availability of auxiliary variables and an appropriate model relating the auxiliary variables to the variable of interest are crucial to the formation of indirect estimators. However, the definition of direct and indirect estimators does not depend on whether or not auxiliary variables from outside the domain or time period of interest are used. The clear distinction between direct and indirect estimators made in the discussion above reflects the situation during the design stage of a data system. However, when estimators reflect the realities associated with data system implementation, the distinction becomes a little less clear. For example, nonresponse is a common problem in dam collection efforts. To the extent that nonresponse occurs, even direct estimators must rely on model-based assumptions relating the known information for responders to the unknown information for nonresponders. 1.4 Organization of Program Chapters As discussed in the previous section, indirect estimators borrow strength and can be classified into three types: domain indirect, time indirect, and domain/time indirect. In addition to this classification, indirect estimators are commonly expressed in different forms, that is different algebraic expressions. Each of the eight programs described in this report uses one of the following three common indirect estimators: synthetic, regression, or composite. The order of chapters describing programs follows this classification of estimators. That is, the program that uses a synthetic estimator is presented first in Chapter 2, followed by the programs that use regression estimators in Chapters 3 through 6; those programs that use composite estimators are presented in Chapters 7 through 9. Some of the programs have used different estimators at different times; however, emphasis is placed on the estimator that was last used to publish estimates. As with all indirect estimators, synthetic estimators may be domain indirect, time indirect, or domain and time indirect. For example, a domain indirect synthetic estimator for a population total in domain d and time t may be written as Click HERE for graphic. Regression estimators may be direct or, like the synthetic estimator, domain indirect, time indirect, or domain and time indirect depending on how the parameters are estimated. For example, a domain indirect regression estimator for a population total may be written as Click HERE for graphic. Click HERE for graphic. It should be noted that not all indirect estimators are linear. For examples of nonlinear indirect estimators see MacGibbon and Tomberlin (1989) and Malec, Sedransk, and Tompkins (1993). This latter, nonlinear indirect estimator is being considered for use in conjunction with the National Health Interview Survey and is discussed in Chapter 8. 1.5 Characteristics of Indirect Estimators There are several fairly well-known characteristics of indirect estimators that are important for producers and users to keep in mind. o In general, indirect estimators have relatively small variances since they not only incorporate observations from the domain and time period of interest, but also, from other domains and/or time periods. The variance of a modified synthetic estimator is discussed by Holt, Smith, and Tomberlin (1979) and variances of several indirect estimators resulting from different prediction models are discussed by Royall (1979). Care must be taken since the variance alone may be a misleading measure of error. See, for example, Raback and Sarndal (1982) and Sarmdal and Hidiroglou (1989). o An indirect estimator will be biased if the model assumptions leading to the estimator are not satisfied. Even so, an indirect estimator may be a useful alternative to a direct estimator when the mean squared error of the indirect estimator is sufficiently small compared to the variance of the direct estimator. However, the magnitude of the bias is likely to vary with each application and estimation of biases is difficult. Gonzalez and Waksberg (1973) consider the problem of estimating the mean squared error of synthetic estimators, and Prasad and Rao (1990) discuss the estimation of the mean squared error of indirect estimators. Care must be taken when interpreting estimated mean squared errors of indirect estimators. Some approaches provide an average measure over all domains rather than a measure associated with a specific domain. Confidence intervals for biased estimators is a related issue that has, been addressed by Miller (1992). o For a given application and estimator, biases in different domains will differ since the model will likely be a better representation of reality in some domains than in others. In general, when an indirect estimator is used to produce estimates for a number of domains, the distribution of estimates will have a smaller variance than the corresponding distribution of domain population values. This is a result of the tendency for indirect estimators to have relatively small biases when domain population values are close to the average total population value and, when domain population values are not close to the overall population value, to have relatively large directional biases which make the estimates closer to the overall population value. There is considerable evidence illustrating this characteristic (Gonzalez and Hoza 1978; Schaible et al. 1977 and 1979; and Heeringa 1981). Not all indirect estimators display this characteristic to the same extent. Spjovoll and Thomsen (1987), Lahiri (1990), and Ghosh (1992) have addressed this problem and suggest constrained approaches. o From a model-based, prediction point of view, direct and indirect estimators are model unbiased under the model that generates the estimator. A direct estimator is robust against model failure in the sense that it is unbiased, not only under the domain/time specific model which generates the estimator, but under each of the models associated with the corresponding indirect estimators. Indirect estimators are not robust in the same sense. However, the domain indirect estimator and the time indirect estimator are both more robust against model failure, in a similar sense, than the estimator that is both domain and time indirect. The bias of indirect estimators, under the domain and time specific model, is a source of concern that results in a reluctance to fully accept indirect estimators in many applications. An example and additional discussion of this aspect of indirect estimator bias is given in Schaible (1993). 1.6 Program Summaries The programs described in this report were initiated in response to a variety of needs and directives. Several are a direct result of legislative requirements to allocate federal funds. Others were created in response to state needs for data and/or to standardize estimation methods across states. Others are viewed as research programs that periodically publish estimates when an improved methodology has been developed. Table 1 below allows a comparison of summary information on the programs described in Chapters 2 through 9 in this report. The eight programs that use indirect estimators to publish estimates are located in five large statistical agencies. In some instances, a program produces estimates for a single variable; in other instances, estimates are produced for numerous variables. States and counties are the only domains for which indirect estimates are presently published. Four of the programs publish estimates for states, three for counties, and one for both states and counties. There is considerable variability in the frequency with which estimates are published. Two programs publish estimates only periodically, every few years. The remainder publish indirect estimates on a fixed schedule: four annually, one annually with selected estimates on a quarterly schedule, and one monthly. As noted above, a variety of indirect estimators are used to produce estimates. Synthetic, regression, and composite estimators that borrow strength over domains, over time, or over both domain and time are found among these programs. The estimation procedures for six of the programs are based on data from sample surveys. There is no sampling involved in the procedures used in the two programs that produce estimates of personal income and postcensal populations. Given the differing demands on Federal statistical agencies, it is not surprising that considerable variation is seen in the programs described in this report. Further investigations and improvements in the quality of indirect estimates published by Federal agencies are needed. It is hoped that recognition of the differences, as well as the similarities, in these programs will help provide a foundation for this further effort. Click HERE for graphic. REFERENCES Brackstone, G. J.. (1987), "Small Area Data: Policy Issues and Technical Challenges," in Small Area Statistics, New York: John Wiley and Sons. Dalenius, T. (1987), "Panel Discussion" in Small Area Statistics, New York: John Wiley and Sons. Ericksen, E.P. (1974), "A Regression Method for Estimation Population Changes for Areas," Journal of the American Statistical Association, 69, 867-875. Fay, R.E. and Herriott, R.A. (1979), "Estimates of Income for Small Places: An Application of James-Stein Procedures to Census Data," Journal of the American Statistical Association, 74, 269-277. Ghosh, M. (1992), "Constrained Bayes Estimation With Applications," Journal of the American Statistical Association, 87, 533-540. Gonzalez, M.E. (1973), "Use and Evaluation of Synthetic Estimates," Proceedings of the Social Statistics Section, American Statistical Association, 33-36. Gonzalez, M.E. (1979), "Case Studies on the Use and Accuracy of Synthetic Estimates: Unemployment and Housing Applications" in Synthetic Estimates for Small Areas (National Institute on Drug Abuse, Research Monograph 24), Washington, D.C.: U.S. Government Printing Office. Gonzalez, M.E. and Hoza, C. (1978), "Small-Area Estimation with Application to Unemployment and Housing Estimates," Journal of the American Statistical Association, 73, 7- 15. Gonzalez, M.E. and Waksberg, J (1973), "Estimation of the Error of Synthetic Estimates," paper presented at the first meeting of the International Association of Survey Statisticians, Vienna, Austria, 18-25 August, 1973. Heeringa, S.G. (1981), "Small Area Estimation Prospects for the Survey of Income and Program Participation," Proceedings of the Section on Survey Research Methods, American Statistical Association, 133-138. Holt, D., Smith, T.M.F., and Tomberlin, T.J. (1979), "A Model-Based Approach to Estimation for Small Subgroups of a Population," Journal of the American Statistical Association, 74, 405- 410. Kalton, G. (1987), "Panel Discussion" in Small Area Statistics, New York: John Wiley and Sons. Laake, P. (1979), "A Prediction Approach to Subdomain Estimation in Finite Populations," Journal of the American Statistical Association, 74, 355-358. Lahiri, P. (1990), "Adjusted Bayes and Empirical Bayes Estimation in Finite Population Sampling," Sankhya B, 52, 50-66. MacGibbon, B. and Tomberlin, T.J. (1989), "Small Area Estimation of Proportions Via Empirical Bayes Techniques," Survey Methodology, 15-2, 237-252. Malec, D., Sedransk, J., and Tompkins, L. (1993), "Bayesian Predictive Inference for Small Areas for Binary Variables in the National Health Interview Survey." In Case Studies in Bayesian Statistics, eds., Gatsonis, Hodges, Kass and Singpurwalla. New York: Springer Verlag. Miller, S.M. (1992), "Confidence Interval Coverage for Biased Normal Estimators," Proceedings of the Section on Survey Research Methods, American Statistical Association. National Center for Health Statistics (1968), Synthetic State Estimates of Disability (PHS Publication No. 1759), Washington, D.C.: U.S. Government Printing Office. National Institute on Drug Abuse (1979), Synthetic Estimates for Small Areas (NIDA Research Monograph 24), Washington, D.C.: U.S. Government Printing Office. Prasad, N.G.N. and Rao, J.N.K. (1990), The Estimation of the Mean Squared Error of Small Area Estimators," Journal of the American Statistical Association, 85, 163-171. Purcell, N.J. and Kish, L. (1979), "Estimation for Small Domains," Biometrics, 35, 365-384. Raback, G. and Sarndal, C.E. (1982), "Variance Reduction and Unbiasedness for Small Area Estimators," Proceedings of the Social Statistics Section, American Statistical Association, 541- 544. Royall, R.A. (1973), "Discussion of papers by Gonzalez and Ericksen," Proceedings of the Social Statistics Section, American Statistical Association, 42-43. Royall, R.A. (1979), "Prediction Models in Small Area Estimation," in Synthetic Estimates for Small Areas (National Institute on Drug Abuse, Research Monograph 24), Washington, D.C.: U.S. Government Printing Office. Sarndal, C.E. (1984), "Design-Consistent versus Model-Dependent Estimation for Small Domains," Journal of the American Statistical Association, 79, 624-631. Sarndal, C.E. and Hidiroglou, M.A. (1989), "Small Domain Estimation: A Conditional Analysis," Journal of the American Statistical Association, 84, 266-275. Schaible, W.L. (1993), "Use of Small Area Estimators in U.S. Federal Programs," in Small Area Statistics and Survey Designs, Vol. 1, Central Statistical Office, Warsaw, Poland. Schaible, W.L., Brock, D.B., and Schnack, G.A. (1977), "An Empirical Comparison of the Simple Inflation, Synthetic and Composite Estimators for Small Area Statistics," Proceedings of the Social Statistics Section, American Statistical Association, 1017-1021. Schaible, W.L., Brock, D.B., Casady, R.J., and Schnack, G.A. (1979), Small Area Estimation: An Empirical Comparison of Conventional and Synthetic Estimators for States, (PHS Publication No. 80-1356), Washington, D.C.: U.S. Government Printing Office. Spjovoll, E. and Thomsen, I. (1987), "Application of Some Empirical Bayes Methods to Small Area Statistics," Proceedings of the International Statistical Institute, Vol. 2, 435-449. CHAPTER 2 Synthetic Estimation in Followback Surveys at The National Center for Health Statistics Joe Fred Gonzalez, Jr., Paul J. Placek, and Chester Scott National Center for Health Statistics 2.1 Introduction and Program History The National Center for Health Statistics (NCHS) through its vital registration system collects and publishes data on vital events (births and deaths) for the United States (NCHS 1989). NCHS produces national, State, county, and smaller area vital statistics for sociodemographic and health characteristics which are available from birth and death certificates. The Division of Vital Statistics of NCHS produces annual summary tables for the United States showing trends in period and cohort fertility measures and characteristics of live births. Also, NCHS produces detailed tabulations by place of residence and occurrence for each State, county, and city with a population of 10,000 or more by race and place of delivery and place of residence for population-size groups in metropolitan and nonmetropolitan counties within each State by race, attendant and place of delivery, and birth weight. These statistics are based on a complete count of vital records. In addition to the limited vital statistics tabulations which are produced annually, there has always been a continuing need for more detailed national and State level estimates of health status, health services, and health care utilization related to vital events. Because vital records (birth and death certificates) serve both legal and statistical purposes, they provide limited social, demographic, health, and medical information. Each vital record is a one page document with extremely limited information. The data from these vital records can be augmented, however, through periodic "followback" surveys. These surveys are referred to as "followback" because they obtain additional information from sources named on the vital record. A followback survey is a cost effective means of obtaining supplementary information for a sample of vital events. From the sample it is possible to make national estimates of vital events according to characteristics not otherwise available. Examples of supplementary information which may be needed by health researchers, health program planners, and health policy makers are: mother's smoking habits before and during pregnancy; complications of pregnancy; drug or surgical procedure to induce or maintain labor; amniocentesis during pregnancy; electronic fetal monitoring; respiratory distress syndrome; infant jaundiced; medical x-ray use; birth injuries; and, congenital anomalies. Periodic followback surveys respond to the changing data needs of the public health community without requiring changes in the vital record forms. The specific NCHS followback surveys that will be discussed in this chapter are the 1980 National Natality Survey (NNS) and the 1980 National Fetal Mortality Survey (NFMS) (NCHS 1986). In order to produce State estimates for certain health characteristics not available on the vital records, synthetic estimation (NCHS 1984a, 1984b) was applied to national data from the 1980 NNS and 1980 NFMS. In addition to the usual appeal of using synthetic estimation over direct estimation, especially when small sample sizes are concerned, synthetic State estimates were compared to direct State estimates as well as the "true" values for a limited number of variables from State vital statistics via fetal death records and birth and death certificates. 2.2 Program Description, Policies and Practices The 1980 NNS is based on a probability sample of 9,941 from a universe of 3,612,258 live births that occurred in the United States during 1980. The NNS sample included a four-fold oversampling of low birth weight infants. The live birth certificate represents the basic source of information. Based on information from the sample birth certificates, eight page Mother's questionnaires were mailed to mothers who were married. These mothers were asked to provide information on prenatal health practices, prenatal care, previous pregnancies, and social and demographic characteristics of themselves and their husbands. Each mother was also asked to sign a consent statement authorizing NCHS to obtain supplemental information from her medical records. If the mother did not respond after two questionnaires were sent by mail, a telephone interviewer attempted to complete an abbreviated questionnaire and to obtain a consent statement. To ensure their privacy, unmarried mothers were not contacted. As a result of sending the Mother's questionnaire only to married mothers, the 1980 NNS population of inference for data collected through the Mother's questionnaire was 2,944,580 live births. Regardless of the mother's marital status, questionnaires were mailed to the hospital's and to the attendants at delivery (for example, physicians or nurse-midwives) named on the birth certificates. A questionnaire was sent to the hospital for each sample birth that occurred either in a hospital or en route to a hospital. If the mother signed a consent statement authorizing NCHS to obtain supplemental medical information, a copy was included with the questionnaire. The focus of the hospital questionnaire was on characteristics of labor and delivery, health characteristics of the mother and infant, information on prenatal care visits, and information on radiation examinations and treatments received by the mother during the 12 months before delivery of the sample birth. For the hospital component of the 1980 NNS, the population of inference was 3,580,700 live births. The 1980 NNS is composed of information from birth certificates and information from questionnaires sent to married mothers, hospitals, attendants at delivery, and providers of radiation examinations and treatments. The survey represents an extensive source of information concerning specific maternal and child health conditions and obstetric practices for live births in the United States. The 1980 NNS response rates were 79.5 percent for mothers, 76.1 percent for hospitals, and 61.6 percent for physicians. The 1980 NFMS is based on a probability sample of 6,386 fetal deaths (out of a universe of 19,202 fetal deaths) with gestation of 28 weeks or more, or delivery weight of 1,000 grams or more, that occurred in the United States during 1980. The report of fetal death represent the basic source of information in this survey. Married mothers, hospitals, attendants at delivery, and providers of radiation examinations and treatments were surveyed under the same conditions as those described for the 1980 NNS. The 1980 NFMS populations of inference for all fetal deaths, fetal deaths in hospitals, and fetal deaths to married mothers were 19,202, 18,930, and 14,790, respectively. The same questionnaires were used for both surveys. Although some questions pertained only to live births and other pertained only to fetal deaths, instructions to skip inappropriate questions were included in the questionnaires. The sampling design for the NFMS was developed so that the NFMS would be large enough to permit comparisons between live births in the NNS and fetal deaths in the NFMS. The 1980 NFMS response rates were 74.5 percent for mothers, 74.0 percent for hospitals, and 55.0 percent for physicians. Table 1 presents the 1980 NNS and NFMS distribution of sample cases of live births and fetal deaths by State of occurrence. As shown in Table 1, it may be possible to produce direct State level estimates of certain health characteristics for some of the larger States. However, the sample sizes for most States are generally too small to produce reliable direct State estimates. This was the main justification for exploring synthetic State estimation as an alternative for producing State level estimates. 2.3 Estimator Documentation The underlying rationale for synthetic estimation is that the distribution of a health characteristic is highly related to the demographic composition of the population (NCHS 1984a). It is assumed that differences in the prevalence of the characteristics between two areas are due primarily to differences in demographic composition (e.g. age, race, sex, etc.). That is, it is assumed that a particular measure would be the same in two States that had the same population composition with respect to certain demographic variables. This rationale was used to select the demographic variables that were deemed to be the most appropriate and relevant to the 1980 NNS and NFMS in order to produce Synthetic State estimates. The following is the basic estimator that was used to produce Synthetic State estimates of proportions for certain health variables from the 1980 National Natality Survey (NNS) and the 1980 National Fetal Mortality Survey (NFMS). Click HERE for graphic. Table 2 gives an illustration of the computation of the synthetic State estimate of the percent jaundiced infants in Pennsylvania in 1980. The stub of Table 2 shows the 25 demographic cells (race, age of mother, and live-birth order groups) that were used to produce the Synthetic State estimates. Column (1) shows the national (based on the 1980 NNS) estimates of percent of live births that were jaundiced in each of the respective 25 demographic cells. Column (2) shows the number of hospital births (derived from State Vital Registration System) within the 25 demographic cells in Pennsylvania. Column (3), the estimated number of jaundiced live births in Pennsylvania, is computed by taking the product of entries in columns (1) and (2) within each of the 25 respective cells. Finally, the Synthetic State estimate is found by taking the ratio of the sum of column (3) to the sum of column (2). Since there were three different populations of inference (all vital events, vital events to married mothers, and vital events in hospitals) for each of the 1980 NNS and NFMS, appropriate State aggregates of vital events were incorporated into the calculation of corresponding synthetic State estimates (NCHS 1984a, 1984b). 2.4 Evaluation Practices The following is a description of some of the tabulations that were produced. Table 3 gives Synthetic State estimates of 11 health characteristics of mothers and infants for five selected States. A complete listing of all 57 NNS/NFMS health variables for which Synthetic State estimates were produced can be found in Tables 2-8 in (NCHS 1984a, 1984b). Click HERE for graphic. The synthetic State estimates are subject to sampling error because they are based on corresponding national estimates derived from the 1980 NNS and NFMS by race, maternal age, and live-birth order group. Therefore, the standard errors of the synthetic State estimates are relatively small because they are based on the standard errors of the national estimates. The standard errors for the NNS and NMFS were estimated by a balanced-repeated-replicated procedure using 20 replicate half samples. This procedure estimates the standard errors for survey estimates through the observation of the variability of estimates based on replicate half samples of the total sample; This variance estimation procedure was developed and described by McCarthy (NCHS 1966, 1969). Although the synthetic State estimate has a relatively small standard error, it is subject to bias. The bias is a measure of the extent to which the national maternal age, race, live-birth order specific estimates differ from the true values for a given State. The closer the demographic variables used in the synthetic estimate come to accounting for all the interstate variation in a particular health characteristic, the smaller the bias will be. Unfortunately, the bias cannot be computed without knowing the true values. However, through the U.S Vital Registration System, true State values for vital events (collected through birth and death certificates) are known for a limited number of available sociodemographic and health characteristics. Therefore, we can compare certain synthetic estimates with their corresponding true values. This yields a degree of confidence for the synthetic estimates of similar characteristics which cannot be checked against the true values from State vital statistics. Thus, the evaluation of this study only provides an indicator of the quality of the synthetic State estimates. The last two columns of Table 4 show the mean square error (MSE) of the NNS synthetic. estimates as compared with the MSE of the NNS direct estimates. The MSE of an estimate x is the variance of x plus the square of the bias of x, i.e. Click HERE for graphic. 2.5 Current Problems And Activities Work is currently underway at NCHS to produce synthetic State estimates from the 1988 National Maternal and Infant Health Survey (NMIHS) which is very similar to its predecessor the 1980 NNS and NFMS. In the NMIHS 9,953 out of a universe of 3,898,922 live-birth certificates are linked with mothers' responses on 35-page questionnaires about the mothers' prenatal health behavior, maternal health, the birth experience, and infant health. The 1988 NMIHS live birth estimates will be used to produce synthetic State estimates by infant's race. birth weight, and maternal age and marital status. Click HERE for graphic. Click HERE for graphic. Click HERE for graphic. Click HERE for graphic. REFERENCES National Center for Health Statistics: Vital Statistics of the United States, 1987 Vol. 1, Natality, DHHS Pub. No. (PHS) 89-1100. Public Health Service, Washington. U.S. Government Printing Office, 1989. National Center for Health Statistics, K.G. Keppel, R.L. Heuser, P.J. Placek, et al.: Methods and Response Characteristics, 1980 National Natality and Petal Mortality Surveys. Vital and Health Statistics, Series 2, No. 100. DHHS Pub No. (PHS) 86-1374. Public Health Service, Washington. U.s. Government Printing Office, Sept. 1986. National Center for Health Statistics: State Uses of Followback Survey Data, R.L. Heuser, K.G. Keppel, C.A. Witt, and P.J.Placek, Presented at the Annual Meeting of the Association for Vital Records and Health Statistics, July 9-12, 1984, Niagara Falls, NY. National Center for Health Statistics: R.L. Heuser, K.G. Keppel, C.A. Witt, and P.J. Placek, Synthetic Estimation Applications form the 1980 National Natality Survey (NNS) and the 1980 National Fetal Mortality Survey (NFMS), Presented at the NCHS Data Use Conference on Small Area Statistics, August 29-31, 1984, Snowbird, Utah. National Center for Health Statistics, P.J. McCarthy: Replication: An Approach to the Analysis of Data From Complex Surveys. Vital and Health Statistics, Series 2, No. 14, PHS Pub No. 1000. Public Health Service. Washington, U.S. Government Printing Office, April 1966. National Center for Health Statistics, P.J. McCarthy: Pseudoreplication: Further Evaluation and Appliication of the Balanced Half-Sample Technique. Vital and Health Statistics. Series 2, No. 31. DHEW Pub No. (HSM) 73-120. Health Services and Metal Health Administration. Washington. U.S. Government Printing Office, Jan. 1969. * Chapter 8 (authored by Donald Malec) of this report contains several references on small area estimation as applied to the National Health Interview Survey of the National Center for Health Statistics. CHAPTER 3 State, Metropolitan Area, and County Income Estimation Wallace Bailey, Linnea Hazen, and Daniel Zabronsky Bureau of Economic Analysis 3.1 Introduction and Program History 3.1.1 Program Description The Bureau of Economic Analysis (BEA) maintains a program of State and local area (county and metropolitan area) economic measurement that centers on the personal income measure. This program originated in 1939 when estimates of income payments to individuals by State were first published. At the national level, personal income is the principal income measure in the personal income and outlay account, one of the five accounts that compose the national income and product accounts. The State and local area personal income estimates are derived by disaggregating the detailed components of the national personal income estimates to States and counties. Estimates for all other geographic areas are made by aggregating either the State or county estimates in the appropriate combinations. This building block approach permits estimates for areas whose boundaries change over time, such as metropolitan areas, to be presented on a consistent geographic definition for all years. 3.1.2 Uses of the State and Local Area Income Estimates BEA's State and local area income estimates are widely used in the public and private sector to measure and track levels and types of incomes received by persons living or working in an area. They provide a framework for the analysis of each area's economy and serve as a basis for decision making in both the public and private sectors. Personal income is among the measures used in evaluating the socioeconomic impact of public- and private-sector initiatives; for example, it is widely used in preparing the environmental impact statements required by the National Environmental Policy Act of 1969. One of the first uses made of State personal income estimates (or a derivative) was as a variable in formulas for allocating Federal funds to States. The most often used derivative is per capita personal income, which is computed using the Census Bureau's estimates of total population; these population estimates are described by Long in Chapter 4 of this report. At present, BEA's State personal income estimates are used by the Federal Government to allocate over $92 billion annually for various Federal domestic programs, including the medical assistance (Medicaid) program, and the aid to families with dependent children program. Table 3.1 highlights the major Federal Government programs which use BEA personal income estimates in allocation formulas for Federal domestic assistance funds. Federal agencies also use the components of personal income in econometric models, such as those used to project energy and water use. The U.S. Forest Service is using these estimates to identify resource dependent rural areas and to allocate funds for their economic diversification as required by the National Forest-Dependent Rural Communities Economic Diversification Act of 1990. The U.S. Census Bureau uses the BEA estimates of State per capita personal income as the key predictor variable in its estimates of mean annual income for 4-person families by State. These estimates are described by Fay, Nelson, and Litow in Chapter 9 of this report. During the past decade, State governments have substantially increased their use of the State personal income estimates. The estimates are used in the measurement of economic bases and in models developed for planning for such things as public utilities and services. They are also used to project tax revenues. In recent years, legislation that limits a State's expenditures or tax authority by the level of, or changes in, State personal income or to one of its components has been enacted in 16 States. These 16 States account for nearly one-half of the U.S. population. Some of these States used BEA's annual State personal income estimates; the others use fiscal year estimates derived from BEA's quarterly State personal income estimates (ACIR, 1990). State governments also use the local area estimates to measure the economic base of State planning areas. University schools of business and economics, often worldng under contract for State and local governments, use the BEA local area estimates for theoretical and applied economic research. Businesses use the estimates to evaluate markets for new or established products and to determine areas for the location, expansion, and contraction of their activities. Trade associations and labor organizations use them for product and labor market analyses. 3.1.2 A History of BEA's Regional Income Estimates In the mid-1930's, BEA's predecessor began work on the estimation of regional income as part of the effort to explain the processes and structure of the Nation's economy. As a result of its work, it produced a report that showed State estimates of total "income payments to individuals' in May 1939 (Nathan and Martin, 1939). These income payments were defined as the sum of (1) wages and salaries, (2) other labor income and relief, (3) entrepreneurial withdrawals, and (4) dividends, interest, and net rents and royalties. In 1942, the State estimates of wages and salaries and entrepreneurial income were expanded to include a further breakdown by broad industry group--agriculture, other commodity-producing industries, distribution, services, and government. The industry breakdown was for 1939, when the availability of census information on payrolls and the employed labor force by industry and by State made possible more reliable estimates than for prior years (Creamer and Merwin, 1942). The estimates for most nonagricultural industries and for the military services were based on reports in which establishments, not employees, were classified by State and in which the State of residence of the employees was not indicated; therefore, the estimates for these industries were on a "place-of-work" (where-earned) basis. No systematic adjustment was made in the total income payments series to convert the estimates to a "place-of-residence" (where-received) basis. However, using the limited information that was available, residence adjustments were made for a few States for the per capita series. During the later 1940's and early 1950's, BEA continued to work on improving these estimates by seeking additional source data and by improving the estimating methods that were used. The industrial detail of the wage and salary estimates was expanded to include each Standard Industrial Classification (SIC) division and additional detail for some SIC divisions. As one result of the major reworking and expansion of the national income and product accounts, BEA developed State personal income--a measure of income that is more comprehensive than State income payments. During the 1960's and 1970's, BEA continued its work to provide more information about regional economies. Annual State estimates of disposable personal income were published in the April 1965 Survey of Current Business (Survey), and the first set of quarterly estimates of State personal income was published in the December 1966 Survey. Estimates of personal income for metropolitan areas were published in the May 1967 Survey, for nonmetropolitan counties in the May 1974 Survey, and for metropolitan counties in the April 1975 Survey. In the late 1970's, BEA introduced annual estimates of employment for States, metropolitan areas, and counties. Refinement of the residence adjustment procedures and a fuller presentation of industrial detail for earnings--the term introduced to cover wages and salaries plus other labor income plus proprietors' income--emerged in the estimates published in 1974. The residence adjustment procedures had been extended to all States in 1966, but the residence adjustment estimates (i.e., the net flows of interstate commuters' earnings), along with earnings by industry on a place-of-work basis, were not published explicitly until 1974. 3.2 The Regional Economic Measurement Program 3.2.1 Estimating Schedule for State and Local Area Personal Income Series The annual estimates of State personal income for a given year are subject to successive refinement. Preliminary estimates, based on the current quarterly series, are published each April, 4 months after the close of the reference year, in the Survey. The following August, more reliable annual estimates are published. These estimates are developed independently of the quarterly series and are prepared in greater component detail, primarily from Federal and State government administrative records. The annual estimates published in August are subsequently refined to incorporate newly available information used to prepare the local area estimates for the same year. These revised State estimates, together with the local area estimates, are published the following April. The annual estimates emerging from this three-step process are subject to further revision for several succeeding years (the State estimates in April and August and the local area estimates in April), as additional data become available. For example, the 1992 State estimates that were first released in April 1993 will be revised in August 1993 and in April and August of subsequent years; the 1991 local area estimates that were first released in April 1993 will be revised in April of 1994 and of subsequent years. The routine revisions of the State estimates for a given year are normally completed with the fourth April publication, and the local area estimates, with the third April publication. After that, the estimates will be changed only to incorporate a comprehensive revision of the National Income and Product Accounts--which takes place approximately every 5 years--or to make important improvements to the estimates through the use of additional or more current State and local area data. Quarterly estimates of State personal income, which are available approximately 4 months after the close of the reference quarter, are published regularly in the January, April, July, and October issues of the Survey. In,October and again the following April, the quarterly series for the 3 previous years is revised for consistency with the revised annual estimates. In January and July, at least the quarter immediately preceding the current quarter is revised. 3.2.2 Availability of State and Local Area Estimates The State and local area personal income and employment estimates are available through the Regional Economic Information System (REIS), which operates an information retrieval service that provides a variety of standard and specialized analytic tabulations for States, counties and specified combinations of counties. Standard tabulations include personal income by type and earnings by industry, employment by industry, transfer payments by program, and major categories of farm gross income and expenses. These tabulations are available from REIS in magnetic tape, computer printout, microcomputer diskette and CD-ROM; some of the tabulations are also available electronically on the Department of Commerce's Economic Bulletin Board, available through the National Technical Information Service. In addition, summary tabulations of the State and local estimates are published regularly in BEA's major publication, the Survey. An extensive set of State-level historical estimates is available (BEA, 1989). BEA also makes its regional estimates available through the BEA User Group, members of which include State agencies, universities, and Census Bureau Primary State Data Centers. BEA provides its estimates of income and employment for States, metropolitan areas, and counties to these organizations with the understanding that they will make the estimates readily available. Distribution in this way encourages State universities and State agencies to use data that are comparable for all States and counties and that are consistent with national totals; using comparable and consistent data enhances the uniformity of analytic approaches taken in economic development programs and improves the recipients' ability to assess local area economic developments and to service their local clientele. 3.3 BEA Annual State and County Personal Income Estimation Methodology 3.3.1 Overview The following discussion will focus on the annual estimates of State and county personal income. BEA's quarterly State personal income, annual State disposable personal income, annual State and county full- and part-time employment, and gross State product (GSP) estimates are produced in a manner similar to those described below. (The methodologies for quarterly State personal income and for annual State disposable personal income are described in BEA (1989, pp. M-32-37); the methodology for GSP is described in BEA (1985) and in Trott, Dunbar, and Friedenberg (1991, pp. 43-45). The personal income of an area is defined as the income received by, or on behalf of, all the residents of the area. It consists of the income received by persons from an sources, that is, from participation in production, from both government and business transfer payments, and from government interest. Personal income is measured as the sum of wage and salary disbursements, other labor income, proprietors' income, rental income of persons, personal dividend income, personal interest income, and transfer payments less personal contributions for social insurance. Per capita personal income is measured as the personal income of the residents of an area divided by the resident population of the area. At the national level, personal income is part of the personal income and outlay account, which is one of five accounts in a set that constitutes the national income and product accounts. Such accounts do not now exist below the national level; however, personal income has long been available for States and local areas. In addition, GSP, which corresponds to the national measure gross domestic product, and some elements of personal outlays (personal tax and nontax payments) are available for States but not for local areas. GSP is estimated separately from State personal income, but the two measures share most of the elements of wages and salaries, other labor income, and proprietors' income by State of work. For a tabular representation of the relationships among gross domestic product, State earnings, and GSP, see Table 2 in Trott et. al. (1991, p. 44). 3.3.2 Differences Between the National and Subnational Estimates The definitions underlying the State and local area estimates of personal income are essentially the same as those underlying the national estimates of personal income. However, the national estimates of personal income include the labor earnings (wages and salaries and other labor income) of residents of the United States temporarily working abroad, whereas the subnational estimates include the labor earnings of persons residing only in the 50 States and the District of Columbia. Specifically, the national estimates include the labor earnings of Federal civilian and military personnel stationed abroad and of residents who are employed by U.S. firms and are on temporary foreign assignment. An "overseas" adjustment is made to exclude the labor earnings of these workers from the national totals before the totals are used as controls for the State estimates. An important classification difference between national and subnational estimates relates to border workers--that is, residents of the United States who work in adjacent countries (such as Canada) and foreigners who work in the United States but who reside elsewhere. At the national level, the net flow of the labor earnings of border workers and the labor earnings of U.S. residents employed by international organizations and by foreign embassies and consulates in the United States are included in the measurement of the "rest-of-the-world" sector. At the State and local area levels, however, only the labor earnings of U.S. residents employed by international organizations and by foreign embassies and consulates in the United States are treated as a component of personal income. Border workers are treated as commuters and their earnings flows are reflected in personal income through the residence adjustment procedures. Statistical differences between the national and subnational series may reflect the different estimating schedules for the two series. The State and local area estimates usually incorporate source data that are not available when the national estimates are prepared. The national estimates are usually revised the following year to reflect the more current State and local area data. 3.3.3 Sources of Data BEA uses information collected by others to prepare its estimates of State and local area personal income. Generally, two kinds of information are used to measure the income of persons: Information generated at the point of disbursement of the income and information elicited from the recipient of the income. The first kind is data drawn from the records generated by the administration of various Federal and State government programs; the second kind is survey and census data. The following are among the more important sources of the administrative record data: The State unemployment insurance programs of the Employment and Training Administration, Department of Labor; the social insurance programs of the Social Security Administration and the Health Care Financing Administration, Department of Health and Human Services; the Federal income tax program of the Internal Revenue Service, Department of the Treasury; the veterans benefit programs of the Department of Veterans Affairs; and the military payroll systems of the Department of Defense. The two most important sources of census data are the censuses of agriculture and of population. (BEA uses little survey data to prepare the State and local area estimates; however, the Department of Agriculture makes extensive use of surveys to prepare the State farm income estimates and the county cash receipts and crop production estimates that BEA uses in the derivation of the farm income components of personal income.) The data obtained from administrative records and censuses are used to estimate about 90 percent of personal income. Data of lesser scope and relevance are used for the remaining 10 percent. When data are not available in time to be incorporated into the current estimating cycle, interim estimates are prepared using the previous year's State or county distribution. The interim estimates are revised during the next estimating cycle to incorporate the newly available data. Using data that are not primarily designed for income measurement has several advantages and disadvantages. Using administrative record data and census data, BEA can prepare the estimates of State and local area personal income on an annual basis, in considerable detail, at relatively low cost, and without increasing the reporting burden of businesses and households. However, because these data are not designed primarily for income measurement, they often do not precisely "match" the series being estimated and must be adjusted to compensate for differences in content (definition and coverage) and geographic detail. 3.3.4 Controls and the Allocation Procedure The national estimates for most components of wages and salaries and transfer payments, which together account for about 75 percent of personal income, are based largely on the sum of subnational source data, and the procedure used to prepare the State and county estimates causes only minor changes to the source data. For other components of personal income, either detailed geographic coding is not available for all source data, or more comprehensive and more reliable information is available for the Nation than for States and counties. For these reasons, the estimates of personal income are first constructed at the national level. The subnational estimates are constructed as elements of the national totals, using the subnational data. Thus, the national estimates, with some adjustment for definition, serve as the "control" for the State estimates, and the State estimates, in turn, serve as controls for the county estimates. The State estimates are made by allocating the national total for each component of personal income to the States in proportion to each State's share of a related economic series. Similarly, the county estimates are made, in somewhat less component detail, by allocating the State total. In some cases, the related series used for the allocation may be a composite of several items (e.g., wages, tips, and pay-in-kind) or the product of two items (e.g., average wages times the number of employees). In every case, the final estimating step for each income estimate is its adjustment to the appropriate higher level total. This procedure is called the allocation procedure. The allocation procedure, as used to estimate a component of State personal income, is given by Click HERE for graphic. The source data that underlie the national estimates are frequently more timely, detailed, and complete than the available State and county data. The use of the allocation procedure imparts some of these aspects of the national estimates to the subnational estimates and allows the use of subnational data that are related but that do not always precisely match the series being estimated. The use of this procedure also yields an additive system wherein the county estimates sum to the State totals and the State estimates sum to the national total. 3.3.5. Place of Measurement Personal income, by definition, is a measure of income received; therefore, estimates of State and local area personal income should reflect the residence of the income recipients. However, the data available for regional economic measurement are frequently recorded by the recipients' place of work. The data underlying the estimates can be viewed as falling in four groups according to the place of measurement. (1) For the estimates of wages and salaries, other labor income, and personal contributions for social insurance by employees, most of the source data are reported by industry in the State and county in which the employing establishment is located; therefore, these data are recorded by place of work. The estimates based on these data are, subsequently adjusted to a place-of-residence basis for inclusion in the personal income measure. (2) For nonfarm proprietors' income and personal contributions for social insurance by the self-employed, the source data are reported by tax-filng address. These data are largely recorded by place of residence. (3) For farm proprietors' income, the source data are reported and recorded at the principal place of production, which is usually the county in which the farm has most of its land. (4) For military reserve pay, rental income of persons personal dividend income, personal interest income, transfer payments, and personal contributions for supplementary medical insurance and for veterans life insurance, the source data are reported and recorded by the place of residence of the income recipients. 3.3.6 Sources and Methods for Annual State and County Income Estimates 3.3.6.1 Framework Personal income is estimated as the sum of its detailed components; the major types of payments that comprise those components are shown in Table 3.2, together with the related percents of personal income and the principal sources of data used to estimate the components. The following methodology presentation consists of a section for each of the six types of payment and a section for the residence adjustment. The methodologies for some types of payment and for many of the individual income components are omitted from this presentation, but a complete presentation is available (BEA 1991, pp. M-7-27). 3.3.6.2 Wage and salary disbursements Wage and salary disbursements, which accounted for about 58 percent of total personal income at the national level in 1990, are defined as the monetary remuneration of employees, including the compensation of corporate officers; commissions, tips, and bonuses; and receipts in kind that represent income to the recipient. They are measured before deductions, such as social security contributions and union dues. The estimates reflect the amount of wages and salaries disbursed during the current period, regardless of when they were earned. The following description of the procedures used in making the estimates of wage and salary disbursements is divided into three sections: Wages and salaries that are covered under the unemployment insurance (UI) program, wages and salaries that are not covered under the UI program, and wages and salaries that are paid in kind. Wages and salaries covered by the UI program The estimates of about 95 percent of wages and salaries are derived from tabulations by the State employment security agencies (ESA's) from their State employment security reports (form ES-202). These tabulations summarize the data from the quarterly UI contribution reports filed with a State ESA by the employers subject to that State's UI laws. Employers usually submit reports for each "county reporting unit"--i.e., for the sum of all the employer's establishments in a county for each industry. However, in some cases, an employer may group very small establishments in a single "statewide" report without a county designation. Each quarter, the various State ESA's submit the ES-202 tabulations on magnetic tape to the Bureau of Labor Statistics (BLS), which provides a duplicate tape to BEA. The tabulations present monthly employment and quarterly wages for each county in Standard Industrial Classification four-digit detail. (The ES-202 tabulations through 1987 reflect the 1972 SIC, and those for 1988-90, the 1987 SIC.) Under the reporting requirements of most State UI laws, wages include bonuses, tips, gratuities, and the cash value of meals and lodging supplied by the employer. The BEA estimates of wage and salary disbursements are made, with a few exceptions, at the SIC two-digit level. However, the availability of the ES-202 data in SIC four-digit detail facilitates the detection of errors and anomalies; this detail also makes it possible to isolate those SIC three-digit industries for which UI coverage is too incomplete to form a reliable basis for the estimates. In this case, the SIC two-digit estimate is prepared as the sum of two pieces: The fully covered, portion, which is based on the ES-202 data, and the incompletely covered portion, which is estimated as described in the section on wages and salaries not covered by the UI program. The ES-202 wage and data do not precisely meet the statistical and conceptual requirements for BEA's personal income estimates. Consequently, the data must be adjusted. to meet the requirements more closely. The adjustments affect both the industrial and geographic patterns of the State and county UI-based wage estimates. Adjustment for statewide reporting.--Wages and salaries reported for statewide units are allocated to counties in proportion to the distribution of the wages and salaries reported by county; the allocations for each State are made for each private-sector industry (generally at the SIC two-digit level) and for five government components. Adjustment for industry nonclassification.--The industry detail of the ES-202 tabulations regularly shows minor amounts of payroll that have not been assigned to any industry. For each State and county, the amount of ES-202 payrolls in this category is distributed among the industries in direct proportion to the industry-classified payrolls. Misreporting adjustment.--This adjustment--the addition of estimates of wages and salaries subject to UI reporting that employers do not report--is made to the ES-202 data for all covered private-sector industries. At the national level, the estimate for each industry is made in two parts--one for the underreporting of payrolls on UI reports filed by employers and one for the payrolls of employers that fail to file Ul reports (Parker, 1984). The source data necessary to replicate this methodology below the national level are not available. Instead, the national adjustment for each industry is allocated to States and counties in proportion to ES-202 payrolls. Adjustments to government components.--Alternative source data are substituted for the ES-202 data when the latter series reflects excessively large proportions of Federal civilian payrolls that are not reported by county or of State government payrolls that are apparently reported in the wrong counties. For Federal civilian wages and salaries, the alternative source data are tabulations of employment by agency and county prepared by the Office of Personnel Management. For State government wages and salaries, the alternative source data are place-of-work wage data derived from an unpublished tabulation of journey-to-work (JTW data from the 1980 Census of Population. (All income estimates using 1980 Census of Population data will be updated to incorporate 1990 Census of Population data in a regional comprehensive revision to be released in the spring of 1994.) Adjustments for noncovered elements of UI-covered industries.--BEA presently makes adjustments for the following noncovered elements: 0 Tips; 0 Commissions received by insurance solicitors and real estate agents; 0 Payrolls of electric railroads, railroad carrier affiliates, and railway labor organizations; 0 Salaries of corporate officers in Washington State; 0 Payrolls of nonprofit organizations exempt from UI coverage because they have fewer than four employees;. 0 Wages and salaries of students employed by the institutions of higher education in which they are enrolled; 0 Allowances paid to Federal civilian employees in selected occupations for uniforms; and 0 Salaries of State and local government elected officials and members of the judiciary. Except for tips, these elements are exempted from State UI coverage. Tips are covered by the various UI laws. BEA assumes that this form of income payment is considerably underreported, and it therefore makes additional estimates of tips in industries where tipping is most customary. National and State estimates of each of the noncovered elements are made (based on either direct data or indirect indicators). These estimates are added to the ES-202 payroll amount for the industry of the noncovered element to produce the final estimates for that industry. Because of the lack of relevant data, county estimates are made by allocating the final State total by the distribution of ES-202 payrolls for the appropriate industry. Wages and salaries not covered by the UI program Eight industries are treated as noncovered in making the State and county estimates of wage and salary disbursements: (1) Farms, (2) farm labor contractors, (3) railroads, (4) private elementary and secondary schools, (5) religious membership organizations, (6), private households, (7) military, and (8) "other." The estimates for these industries are based on a variety of sources. For example, the estimates for railroads ar based mainly on employment data provided by the Association of American Railroads, and the estimates for the military services are based mainly on payroll data provided by the Department of Defense. See BEA (1991) for the methodology for the noncovered industries. Wages and salaries paid in kind The value of food, lodging, clothing, and miscellaneous goods and services furnished to employees by their employers as payment, in part or in full, for services perfomed is included in the wage and salary component of personal income and is referred to as "pay-in-kind." The estimates for UI-covered industries are prepared as an integral part of total wages and salaries for those industries, based on the ES-202 data. The estimates for most on the noncovered industries are based on pertinent employment data. See BEA (1991) for the methodology for pay-in-kind. 3.3-6.3 Other labor income Other labor income (OLI), which accounted for about 5.5 percent of total personal income at the national level in 1990, consists primarily of employer contributions to private pension and welfare funds; these employer contributions account for approximately 98 percent of OLI. The "all other" component of OLI consists of directors' fees, judicial fees, and compensation of prisoners. Employer contributions for social insurance, which are paid into govemment-administered funds, are not included in OLI; under national income and product account conventions, it is the benefits paid from social insurance funds--which are classified as transfer payments--that are measured as part of personal income, not the employer contributions to the funds. Employer contributions to private pension and welfare funds Private pension and profit-sharing funds, group health and life insurance, and supp1emental unemployment insurance.--The larger part of the national estimates of employer contributions to private pension and welfare funds is developed from Internal Revenue Service tabulations of data from proprietorship and corporate income tax returns published in Statistics of Income. However, these data are not suitable for making the subnational estimates because most multiestablishment corporations file tax returns on a companywide basis instead of for each establishment and because the State in which a corporation's principal office is located is often different from the State of its other establishments. As a result, the geographic distribution of the data tabulated from the tax returns does not necessarily reflect the place of work of the employees on whose behalf the contributions are made. For private-sector employees, the State and county estimates of employer contributions to private pension and profit-sharing funds, group health and life insurance, and supplemental unemployment insurance are made, for all types of employer contributions combined, at the SIC two-digit level, the same level of industrial detail as the wage and salary estimates. The national total of employer contributions for each industry is allocated to the States and counties in proportion to the estimates of wage and salary disbursements for the corresponding industry. The use of subnational wage estimates to allocate the national estimates of employer contributions to private pension and welfare funds is based on the assumption that the relationship of contributions to payrolls for each industry is the same at the national, State, and county levels. The procedure reflects the wide variation in contribution rates--relative to payrolls--among industries (and therefore reflects appropriately the various mixes of industries among States and counties). It does not reflect the variation in contribution rates among States and counties for a given industry. The Federal Government makes contributions to a private pension fund, called the Thrift Savings Plan, on behalf of its civilian employees who participate in the Federal Employees Retirement System (mainly employees hired after 1983). In the absence of direct data below the national level, the national estimate is allocated to States and counties in proportion to the estimates of Federal civilian wages and salaries. State government contributions to private pension plans consist of annuity payments made by State governments on behalf of selected employee groups--primarily teachers. The State estimates are based on direct data from the Teachers Insurance and Annuity Association/College Retirement Equities Fund. The county estimates are prepared by allocating the State estimates in proportion to the estimates of State and local government education wages and salaries. In the absence of direct data below the national level, the national estimates of Federal, State, and local government contributions to private welfare funds on behalf of their employees are allocated to States and counties in proportion to ES-202 employment data for each level of government. Privately administered workers' compensation.--The State estimates for this subcomponent are based mainly on direct data provided by the National Council on Compensation insurance and by the Social Security Administration; the county estimates for each SIC two-digit industry reflect the geographic distribution of wages and salaries. The methodology for this income component is given in BEA (1991). "All other" OLI The methodology for "all other" OLI--primarily directors' fees and jury and witness fees--is given in BEA (1991). The State and county estimates for directors' fees--the largest of these subcomponents--reflect the geographic distribution of wages and salaries in each industry. 3.3.6.4 Proprietors' Income Proprietors' income, which accounted for about 8.5 percent of total personal income at the national level in 1990, is the income, including income-in-kind, of sole proprietorships and partnerships and of tax-exempt cooperatives. The imputed net rental income of owner-occupants of farm dwellings is included. Dividends and monetary interest received by proprietors of nonfinancial business, monetary rental income received by persons who are not primarily engaged in the real estate business, and the imputed net rental income of owner-occupants of nonfarm dwellings are excluded; these incomes are included in dividends, net interest, and rental income of persons. Proprietors' income, which is treated in its entirety as received by individuals, is estimated in two parts--nonfarm and farm. Nonfarm prorrietgrt' income Nonfarm proprietors' income is the income received by nonfarm sole proprietorships and partnerships and by tax-exempt cooperatives. The State and county estimates of the income of sole proprietors and partnerships for all but three of the SIC two-digit industries are based on 1981-83 tabulations from Internal Revenue Service (IRS) form 1040, Schedule C (for sole proprietors), and form 1065 (for partnerships). Tabulations either of gross receipts or of profit less loss from the two forms combined are used either to attribute a national total to the States or as direct data. Two national totals are used for each industry: One for income reported on the income tax returns as adjusted to conform with national income and product accounting conventions--and one for an estimate of the income not reported on tax returns. For the adjustments for unreported income, no direct data are available below the national level. The national total for each industry is attributed to States in proportion to the IRS State distribution of gross receipts for the industry. For the reported portion of nonfarm proprietors' income, the State estimates for each of 45 industries are based on the IRS distribution of profit less loss for the industry, and the estimates for each of another 20 industries (together accounting for 3 percent of total nonfarm proprietors' income) are based on the IRS distribution of gross receipts for the industry. For the latter group, the ERS distribution of profit less loss, although preferable in concept, is not used as a basis for State estimates because the extreme year-to-year volatility of the State data suggests that they are unreliable. The 1983 State estimates prepared by the foregoing methodology are extended to later years based mainly on the number of small establishments in each industry as determined from the Census Bureau's County Business Patterns; see BEA (1989) for a full description of the methodology. For the three remaining industries, limited partners' income presents a special estimating problem. In these industries--crude petroleum and natural gas extraction, real estate, and holding and investment companies--limited partnerships are often used as tax shelters. Limited partners' participation in partnerships is often purely financial; their participation more closely resembles that of investors than that of working partners. Accordingly, the usual assumption that the State from which the partnership files its tax return is the same as the residence of the individual partners is unsatisfactory. No direct data on the income of partners by their place of residence are available. The national estimates of proprietors' income for these industries are attributed to States in the same proportion as dividends received by individuals (based on all-industry dividends reported on IRS form 1040). The State estimates of the income of tax-exempt cooperatives are based on data provided by the Rural Electrification Administration (for electric and telephone cooperatives) and the Agricultural Cooperative Service (for farm supply and marketing cooperatives); see BEA (1989) for the methodology. The methodology for the county estimates of nonfarm proprietors' income is similar to the State methodology, but less direct data are used for many industries because problems with data volatility are greater at the county level. See BEA (1991) for a full description of the county methodology. Farm proprietors' income The estimation of farm proprietors' income starts with the computation of the realized net income of all farms, which is derived as farm gross receipts less production expenses. This measure is then modified to reflect current production through a change-in-inventory adjustment and to exclude the income of corporate farms and salaries paid to corporate officers. Tables showing the derivation of State and county farm proprietors' income in detail are available from the Regional Economic Information System. The concepts underlying the national and State BEA estimates of farm income are generally the same as those underlying the national and State farm income estimates of the U.S. Department of Agriculture (USDA). The major definitional difference between the two sets of estimates relates to corporate farms. The USDA totals include net income of corporate farms, whereas the BEA personal income series, which measures farm proprietors' net income, by definition excludes corporate farms. Additionally, BEA classifies the salaries of officers of corporate farms as part of farm wages and salaries; USDA treats the corporate officers' salaries as returns to corporate ownership and as part of the total return to farm operators. The State control totals for the BEA county estimates of farm proprietors' income are taken from the component detail of the USDA State estimates, which are modified to reflect BEA definitions and to include interfarm intrastate sales. The methods used to estimate farm proprietors' income at the county level rely heavily on data obtained from the 1974, 1978, and 1982 censuses of agriculture and on selected annual county data prepared by the State offices affiliated with the National Agricultural Statistics Service (NASS), USDA. (Data from the 1987 Census of Agriculture will be incorporated into the estimates with the next cycle of comprehensive revisions.) The NASS data, which are described by Iwig in Chapter 7 of this report, are used, wherever possible, to interpolate and extrapolate to noncensus years. In addition, data from other sources within USDA, such as the Agricultural Stabilization and Conservation Service, are used to prepare a fairly detailed income and expense statement covering all farms in the State and county. For census years, BEA prepares county estimates of 46 components of gross income and 13 categories of production expenses. For intercensal and postcensus years, the component detail of the estimates for each State is set to take advantage of the best annual county data available for the State. Farm gross income includes estimates for the following items: (1) The cash receipts from farm marketing of crops and livestock (in component detail); (2) the income from other farm-related activities, including recreational services, forest products, and custom-feeding services performed by farm operators; (3) the payments to farmers under several government payment programs; (4) the value of farm products produced and consumed on farms; (5) the gross rental value of farm dwellings; and (6) the value of the net change in the physical volume of farm inventories of crops and livestock. Cash receipts from marketing is the most important component of farm gross income. The USDA generally has annual production, marketing, and price data available for preparing the State estimates for about 150 different commodities. However, annual county estimates of cash receipts--usually for total crops and for total livestock--are currently available for only 19 States (BEA 1991, fn. 15, p. M-14). For the other States, the USDA State estimates of cash receipts from the marketing of individual commodities are summed into the 13 crop and 5 livestock groups for which value-of-sales data are reported by county in the censuses of agriculture. The aggregates for the census years are then allocated by the related census county distributions. Estimates for intercensal years are based on supplemental county estimates of annual production of selected field crops and on State season average prices available from the State NASS offices, or they are calculated by straight-line interpolation between the census years and adjusted to State USDA levels. The county estimates of the remaining components of gross income, of production expenses, of the adjustment for interfarm intrastate transactions, and of the adjustment to exclude the income of corporate farms are based mainly on data from the censuses of agriculture and data provided by NASS and by the Agricultural Stabilization and Conservation Service. See BEA (1991) for a full description of the methodology. 3.3.6.5 Personal Dividend Income, Personal Interest Income, and Rental Income of Persons These components accounted for more than 17 percent of total personal income in 1990. Dividends are payments in cash or other assets, excluding stock, by corporations organized for profit to noncorporate stockholders who are U.S. residents. Interest is the monetary and d imputed interest income of persons from all sources. Imputed interest represents the excess of income received by financial intermediaries from funds entrusted to them by persons over income disbursed by these intermediaries to persons. Part of imputed interest reflects the value of financial services rendered without charge to persons by depository institutions. The remainder is the property income held by life insurance companies and private noinsured pension funds on the account of persons; one example is the additions to Policyholder reserves held by life insurance companies. Rental income of persons consists of the monetary income of persons (except those primarily engaged in the real estate business) from the rental of real property (including mobile homes); the royalties received by persons from patents, copyrights, and rights to natural resources; and the imputed net rental income of owner-occupants of nonfarm dwellings. The State and county estimates of dividends, interest, and rent are based mainly on data tabulated from Federal individual income tax returns by the Internal Revenue Service. The methodology for dividends, interest, and rent is given in BEA (1991). 3.3.6.6 Transfer payments Transfer payments are payments to persons, generally in monetary form, for which they do not render current services. As a component of personal income, they are payments by government and business to individuals and nonprofit institutions. Nationally, transfer payments accounted for almost 15 percent of total personal income in 1990. At the county level, approximately 75 percent of total transfer, payments are estimated on the basis of directly reported data. The remaining 25 percent are estimated on the basis of indirect, but generally reliable, data. For the State and county estimates, approximately 50 subcomponents of transfer payments are independently estimated using the best data available for each subcomponent. The methodology for all of these subcomponents is given in BEA (1991); the following items are presented here as examples. Old-age, survivors, and disability insurance (OASDI) payments.--These payments, popularly known as social security, consist of the total cash benefits paid during the year, including monthly benefits paid to retired workers, dependents, and survivors and special payments to persons 72 years of age and over; lump-sum payments to survivors; and disability payments to workers and their dependents. The State estimates of each OASDI segment are based on Social Security Administration (SSA) tabulations of calendar year payments. The county estimates of total OASDI benefits are based on SSA tabulations of the amount of monthly benefits paid to those in current-payment status on December 31, by county of residence of the beneficiaries. Medical vendor payments.--These are mainly payments made through intermediaries for care provided to individuals under the federally assisted State-administered medicaid program. Payments made under the general assistance medical programs of State and local governments are also included. The county estimates are based on available payments data from the various State departments of social services. For States where no county data are available, the county estimates are based on the distribution of payments made under the aid to families with dependent children program. Aid to families with depenndent children (AFDC).--This State-administered program receives Federal matching funds to provide payments to needy families. The State estimates are based on unpublished quarterly payments data provided by the SSA. The county estimates are prepared from payments data provided by the various State departments of social services. County data are no longer being received from some State for these States, the most recent available data are used for the county estimates for each subsequent year. State unemployment compensation.--These are the cash benefits, including special benefits authorized by Federal legislation for periods of high unemployment, from State-administered unemployment insurance (UI) programs. Most States report benefits directly by county, but a few report by local district office. In the latter case, local district office data are distributed among the counties within the jurisdiction of the local district office in proportion to the annual average number of unemployed persons estimated by the Bureau of Labor Statistics (BLS). When the State is unable to supply the county data in time to meet the publication deadline, a preliminary set of estimates is made and is revised the following year to incorporate the delayed county data. The preliminary county estimates are prepared by extrapolating the preceding year's estimates forward by the change in the BLS estimate of the annual average number of unemployed persons. Veterans life insurance benefit payments.--These are the claims paid to beneficiaries and the dividends paid to policyholders from the five veterans life insurance programs administered by the Department of Veterans Affairs. The county allocations of the combined payments of death benefits and dividends are based on the distribution of the veteran population. Interest payments on guaranteed student loans.--These are the payments to commercial lending institutions on behalf of individuals who receive low-interest deferred-payment loans from these institutions to pay the expenses of higher education. The State estimates are based on Department of Education data on the number of persons enrolled in institutions of higher education. The county allocations are based on the distribution of the civilian population. 3.3.6.6 Personal Contributions For Social Insurance Personal contributions for social insurance are the contributions made by individuals under the various social insurance programs. These contributions are excluded from personal income by treating them as explicit deductions. Payments by employees and the self-employed for social security, medicare, and government employees' retirement are included in this component. Also included are the contributions that are made by persons participating in the veterans life insurance program and in the supplementary medical insurance portion of the medicare program. The State and county estimates of personal contributions for social insurance are generally based either on direct data from the administering agency or on the geographic distribution of the appropriate earnings component; see BEA (1989 and 1991) for the full methodologies. 3.3.6.7 Residence Adjustment Personal income is a "place-of-residence" measure of income, but the source data for the components that compose more than 60 percent of personal income are recorded by place of work. The adjustment of the estimates of these components to a place-of-residence basis is the subject of this section. At the national level, place of residence is an issue only for border workers (mainly those living in the United States and working in Canada or Mexico and vice versa). At the State and county levels, the issue of place of residence is more significant. Individuals commuting to work between States are a major factor where metropolitan areas extend across State boundaries--for example, the Washington, DC-MD-VA MSA. Individuals commuting between counties are a major factor in every multicounty metropolitan area and in many nonmetropolitan areas. BEA's concept of residence as it relates to personal income refers to where the income to be measured is received rather than to "usual," "permanent," or "legal" residence. It differs from the Census Bureau's concept mainly in the treatment of migrant workers. The decennial census counts many of these workers at their usual place of residence rather than where they are on April 1 when the census is taken. Except for out-of-State workers in Alaska (where migrant workers are unusually important) and for certain groups of border workers, BEA assigns the wages of migrant workers to the area in which they reside while performing the work. Similarly, BEA assigns the income of military personnel to the county in which they reside while on military assignment, not to the county in which they consider themselves to be permanent or legal residents. Thus, in the State and local area personal income series, the income of military personnel on foreign assignment is excluded because their residence is outside of the territorial limits of the United States. Three of the six major components of personal income are recorded, or are treated as if recorded, on a place-of-residence (where-received) basis. They are transfer payments; personal dividend income, personal interest income, and rental income of persons; and proprietors, income. Nonfarm proprietors' income is treated as income recorded on a place-of-residence basis because the source data for almost all of this part of proprietors' income are reported to the IRS by tax-filing address, which is usually the filer's place of residence. The source data for farm proprietors' income are recorded by place of production, which is usually in the same county as the proprietor's place of residence. The remaining three major components--wages and salaries, other labor income (OLI), and personal contributions for social insurance--are estimated, with minor exceptions, from data that are recorded by place of work (point of disbursement). The sum of these components (wages plus OLI minus contributions) is referred to as "income subject to adjustment" (ISA). Residence adjustment procedure (excluding border workers The county residence adjustment estimates for 1981 and later years are based on those for 1980 because intercounty commuting data are available only from the decennial censuses of population. (Data from the 1990 Census of Population will be introduced into the residence adjustment estimates as part of the comprehensive revisions to the State and local area personal income estimates that are now underway.) The estimation of these adjustments can be understood using the example of a two-county area comprising counties f and g. The two-county example is easily generalized to more complex situations. Click HERE for graphic. data from the 1980 Census of Population on the number of wage and salary workers (W) and on their average earnings (E) by county of work for each county of residence: Click HERE for graphic. The initial 1980 BEA estimates were modified in three situations. First, for clusters of counties identified as being closely related by commuting (mostly multicounty metropolitan areas), modifications were made to incorporate the 1979 wage and salary distribution from the 1980 Census of Population. The 1979 wage and salary distribution from the 1980 Census of Population reflects the residential distribution of the income recipients as of April 1, 1980, regardless of where they were living when they received the wages and salaries.) These modifications are needed because in numerous cases the 1980-census JTW data and the source data for the BEA wage estimates are inconsistently coded by place of work. (For example, the source data may attribute too much of the wages of a multiestablishment firm to the county of the firm's main office, or the geographic coding of the Defense Department payroll data and of the JTW data may attribute a military base extending across county boundaries to different counties.) Initial county estimates of place-of-residence wages and salaries were derived as place-of-work wages and salaries plus net residence adjustment for wages and salaries. (For the calculation of this net residence adjustment, only the gross flows for wages and salaries were used.) Then, the initial 1980 BEA place-of-residence wage and salary estimates were summed to a total for each cluster. Finally, the BEA total for each cluster was redistributed among the counties of the cluster in the same proportion as the 1979 wage and salary distribution from the 1980 census. To facilitate the extension of the 1980 residence adjustment estimates to later years, the cluster-based modifications--derived as net additions to or subtractions from the initial residence adjustment estimates for each of the 1,287 counties--were expressed as gross flows between pairs of counties within the same cluster. In the simplest case--a two-county cluster--the additional gross flow was assumed to be from the county with the negative modification to the county with the (exactly offsetting) positive modification. Second, modifications were made for selected noncluster adjacent counties if large, offsetting differences occurred between the initial 1980 BEA estimates and the census wage data for these counties. These adjacent-county modifications were expressed as gross flows in the same way and for the same reason as the cluster-based modifications. Third, modifications were made for eight Alaska county equivalents (boroughs and census areas) to reflect the large amounts of labor earnings received by seasonal workers from out of State. The 1980-census JTW data reflect the "commuting" of many of these workers, and the initial 1980 residence adjustment estimates for a majority of the county equivalents did not require modifications. However, for eight county equivalents, the initial 1979 estimates yielded BEA place-of-residence wage and salary totals that were so much higher than the comparable census data that the could not be an accurate reflection of the wages of only the permanent residents. The 1979 residence adjustment estimates, although based mainly on the 1980-census JTW data, also reflect--at the appropriate one-tenth weight--1970-census JTW data.) Based on the assumption that the excess amounts were attributable to out-of-State migrant workers, these amounts were removed by judgmentauy increasing the JTW-based gross flows to the large metropolitan counties of Washington, Oregon, and California. Click HERE for graphic. Click HERE for graphic. As a last step, the total place-of-residence ISA OSA plus net residence adjustment) for each cluster is derived and then distributed to the counties of the cluster based on 1980 place-of-residence ISA extrapolated to later years by the percentage change in the IRS-based wage series. The net residence adjustment, estimate for each cluster county is calculated as place-of-residence ISA minus place-of-work ISA. 3.4 Evaluation Practices In the past few years, two major studies were undertaken by BEA to evaluate the State and local area income estimates: (1) a reliability study of the State quarterly personal income series and (2) a study of the accuracy of the county residence adjustment estimates. In addition, In March of this year, the U.S. General Accounting Office (GAO) completed a study of BEA's national and State estimates. 3.4.1 Evaluation of the State Quarterly Personal Income Series This study provided a detailed measurement and analysis of the reliability of quarterly and annual estimates of State personal income (Brown and Stehle 1990). The study, which covered the State estimates from 1980-87, assessed the reliability of State quarterly personal income using several statistical measures to examine the size of the revisions made to the estimates. One measure used analyzes the range of revisions, where revision is defined as the percent change in the final estimates minus the percent change in the preliminary estimates. Other sets of measures used were dispersion, relative dispersion, bias, and relative bias. The findings of the study were intended to: (1) help BEA isolate particular problem areas in the production of these estimates; and (2) help users of these data determine the suitability for their purposes of the estimates released at different stages of the estimating process. The four principle findings of the study were: (1) the major sources of the revisions to the quarterly percent changes in the preliminary quarterly estimates of State personal income are farm proprietors' income and wages and salaries; (2) largely reflecting wages and salaries, the preliminary quarterly estimates of total personal income tend to be underestimated in fast-growing States and overestimated in slow-growing States; (3) beginning in 1984, the reliability of the second quarterly estimates (that is, the estimates yielded by the first routine revision) was improved by the incorporation of quarterly data from employers' payroll tax reports (the ES-202 data), and (4) the annual revisions of total personal income are smaller than the quarterly revisions. 3.4.2 Residence Adjustment Reliability Study In October 1988, a study was completed which measured the reliability of the Census commuting data used to prepare the net residence adjustments for county personal income (Zabronsky 1988). While the impact of the residence adjustments are generally small at the region and State level, the residence adjustments constitute a large portion of total personal income for most counties in the U.S. In 1989 for instance, the absolute value of the net residence adjustments accounted for about 12.5 percent of total personal income for all counties, on average, while accounting for about 25 percent of total personal income in metropolitan area counties, on average. In this residence adjustment reliability study, a comparison of a Census file of county commuting data constructed from the journey-to-work question on the 1980 decennial census was made with a file of aggregate wages and salaries independently tabulated by Census. In the course of the study, comparisons between thejourney-to-work and aggregate wage series were explored across a variety of geographic demographic, and industrial detail to develop a comprehensive reliability profile for the census commuting data. The major conclusion of this study was that taking the 1980 Census aggregate income series as a benchmark measure of county wages and salary income, the 1980 census journey-to-work data proved to be a highly reliable source for measuring commuter's income in the development of BEA's county residence adjustment estimates. Although careful analysis of the Census journey-to-work wage data did reveal a bias in that series that was correlated with county size, wage amputations undertaken by BEA largely corrected the problem while commuting patterns between counties indicated that for the relevant comparisons, the Census journey-to-work data were consistent with the Census aggregate income-based wage series. 3.4.3 GAO Study of BEA's State and National Estimates The GAO study (GAO, 1993) was conducted in response to a request by the Honorable Ernest F. Hollings, Chairman, Committee on Commerce, Science, and Transportation, U.S. Senate. Senator Hollings expressed concern about press reports that alleged that BEA "did not incorporate, for political purposes, a downward revision of original employment levels into its October 1991 estimate of first quarter 1991 State personal income growth and its December 1991 estimate of first quarter 1991 gross domestic product (GDP) growth. The report concluded that "We found no evidence that BEA manipulated first quarter 1991 personal income or GDP estimates for political purposes. BEA generally followed its standard procedures for using employment data in these estimates and deviated from these procedures only when required by what we believe were reasonable technical judgments" (GAO, 1993, p. 1) 3.5 Current Problems and Activities BEA's regular publication schedules are carefully developed to take into account the needs of users, balanced against the responsibility to produce data of high quality. In general, the four-month lag of the State quarterly and preliminary annual State personal income data and the eight-month lag in the release of more detailed annual State personal income estimates are timely enough for most purposes and cause few hardships for the users of these series. For county and metropolitan area data, the fifteen-month lag required to produce these estimates is considered too long for many purposes and has limited the usefulness of these data. In an effort to address the issue of the timeliness of its local area estimates, BEA has recently been testing the feasibility of developing preliminary annual estimates of personal income for metropolitan areas and non-metropolitan portions of States. These estimates would be available with a seven-month lag. 3.6 Conclusions Rapid advances in computer technologies continue to provide improvements in the range of regional data available for local area estimation as well as in the timing of their availability. For example, the more timely availability of ES-202 wage and salary data coupled with BEA's improved computing capabilities and estimating procedures may allow for the much more timely release of preliminary income estimates for metropolitan areas. These rapid advances in computer technologies also continue to expand the ease of data transfer, storage, and manipulation. For example, BEA recently introduced a CD-ROM containing the local area personal income estimates; many data users can now acquire the entire set of estimates rather than placing an order each time they need some of the data. As in the past, it is anticipated that these advances in electronic capabilities will continue to expand the uses and users of BEA's regional estimates. Table 3. 1: Programs Using BEA Personal Income Estimates in Allocation Formulas for Federal Domestic Assistance Funds, Fiscal Year 1992 Program Program FY 1992 Obligations Number Name (Millions of $) --------------------------------------------------------------------- 17.235 Senior Community Service 395.2 Employment Program 84.126 Rehabilitation Services 1,783.5 84.154 Public Library Construction 29.8 and Technology Enhancement 93.020 Family Support Payments to 13,814.9 States (AFDC) 93.138 Protection and Advocacy for 19.1 Mentally Ill Individuals 93.630 Developmental Disabilities 90.2 Basic Support and Advocacy 93.645 Child-Welfare Services--State 273.9 Grants 93.658 Foster Care--Title IV-E 2,342.1 93.659 Adoption Assistance 201.9 93.778 Medical Assistance Program 72,502.7 (Medicaid; Title XIX) 93.779 Health Care Financing Research 78.4 93.992 Alcohol & Drug Abuse & Mental 292.0 Health Services TOTAL 92,823.7 ---------------------------------------------------------------------- Source: Office of Management and Budget and U.S. General Services Administration (1992), 1992 Catalog of Federal Domestic Assistance, Washington, DC: U.S. Government Printing office. For information about the grant formulas, see U.S. General Services Administration (1992), 1992 Formula Report to the Congress, Washington, DC: U.S. Government Printing Office. ---------------------------------------------------------------------------- Click HERE for graphic. REFERENCES Advisory Commission on Intergovernmental Relations (ACIR) (1990), Significant Features of Fiscal Federalism, Volume 1: Budget Processes and Tax Systems, M-169, pp. 10-13, Washington, DC: U.S. Government Printing Office. Bureau of the Census, U.S. Department of Commerce (1992), Statistical Abstract of the United States: 1992, Appendix II, Washington, DC: U.S. Government Printing Office. Bureau of Economic Analysis (BEA), U.S. Department of Commerce (1985), Expermental Estimates of Gross State Product by Industry, BEA Staff Paper 42, Washington, DC: National Technical Information Service. ______(1989), State Personal Income. 1929-87, Estimates and a Statement of Sources and Methods, Washington, DC: U.S. Government Printing Office. ______(1991), Local Area Personal Income, 1984-89, Volume 1: Summary, Washington, DC: U.S. Government Printing Office. Brown, R.L. and Stehle, J.E. (1990), "Evaluation of the State Personal Income Estimates," pp. 20-29, Survey of Current Business 70 (December, 1990). Creamer, D. and Merwin, C. (1942), "State distribution of Income Payments, 1929-4l," Survey of Current Business 22 (July 1942). Nathan, R.R, and Martin, J.L. (1939), "State Income Payments, 1929-37," mimeographed report, Washington, DC: Bureau of Foreign and Domestic Commerce, U.S. Department of Commerce. Parker, R.P. (1984), "Improved Adjustments for Misreporting of Tax Return Information Used to Estimate the National Income and Product Accounts, 1977," pp. 17-25, Survey of Current Business 64 (June 1984). Trott, E.A., Dunbar, A.E., and Friedenberg, H.L. (1991), "Gross State Product by Industry, 1977-89," pp. 43-59, Survey of Current Business 71 (December 1991). U.S. General Accounting Office (GAO) (1993), Gross Domestic Product: No Evidence of Manipulation in First Ouarter 1991 Estimates, Washington, DC: U.S. Government Printing Office. Zabronsky, D. (1988), "Reliability of the Census Journey-to-Work data in the Residence Adjustment for County Personal Income," Discussion Paper #35, Bureau of Economic Analysis, U.S. Department of Commerce. CHAPTER 4 Postcensal Population Estmates: States, Counties, and Places John F. Long, U. S. Bureau of the Census 4.1 Introduction and Program History The U. S. Bureau of the Census produces population estimates for the nation, states, counties, and places (cities, towns, and townships) as part of its program to quantify changes in population size and distribution since the last census. These estimates provide updates to the population counts by demographic and geographic characteristics from the last census. They also indicate the pace of population change since the last census and the relative influence of the components of population change. While the national estimates can be produced by a careful accounting system that adds annual births, deaths, and international migration to the previous year's population, subnational estimates require development of methods for dealing with the largely unmeasured component effects of internal migration. Many of these methods represent the type of small domain estimates that constitute the subject of this working paper. 4.1.1 Uses of postcensal population estimates There are five major categories of uses for the Census Bureau's population estimates: 1) Federal and state funds allocation, 2) denominators for vital rates and per capita time series, 3) survey controls, 4) administrative planning and marketing guidance, and 5) descriptive and analytical studies (Table 4.1). More than 70 federal programs distribute tens or billions of dollars annually on the basis of population estimates (GAO, 1990). Even more money was distributed indirectly on the basis of indicators which used population estimates for denominators or controls (GAO, 1991). Many states also use the postcensal subnational estimates to allocate state funds to counties, townships, and incorporated places within the state. A large number of Federal statistical series including state and county per capita income, national and state birth and death rates, and county level cancer rates by age, sex, and race use the results of the postcensal estimates. While many Federal agencies directly collect time series data on events and amounts, they require annual postcensal estimates of state and county population to produce per capita rates. These series provide an indication of national and subnational trends for fertility and mortality rates, incidence of cancer and other diseases, per capita economic changes, and other social, demographic, and administrative indicators. Population surveys require independent controls from national population estimates by age, sex, race, and ethnicity as well as data on the geographic distribution of the population by states and selected metropolitan areas. These estimates are used to weight the sample cases such that the survey results equal the postcensal estimates used as controls. Each of the major surveys conducted by the Census Bureau control to somewhat different levels of geographic and demographic detail (Table 4.2). There are a number of reasons to control surveys to independent estimates. They were initially instituted to reduce the variance of the survey estimates. They are also used for a number of secondary reasons: reduction in month-to- month variability of longitudinal data from consecutive surveys, partial correction for the large rates of undercoverage of surveys relative to the census, and improved consistency between different surveys and other population data series based on independent estimates. There are numerous other administrative and analytical uses of the postcensal population estunates. They provide the only regular mechanism by which the components of population change are combined to track changes in the size and demographic and geographic distribution of the nation's population. The postcensal estimates provide essential information for administration and planning in the government and private sectors. In addition, they are used as a standard by state and local governments and the private sector in producing their own population estimates for smaller scale geography or for greater social and economic detail. 4.1.2 History or Census Bureau estimates program Since the early 1900s, the Census Bureau has produced national population estimates. The methodology for these estimates developed into a component method in which the measured components of population change (births, deaths, immigration, and emigration) are added to or, in the case of deaths, subtracted from the most recent decennial census to estimate the current population. When the Census Bureau attempted state population estimates beginning in the 1940s, it faced the difficult prospect of adding internal migration to the other components of population change. Since annual measures of internal migration by state are not available, many attempts were made to develop other ways to estimate state population change. Through 1960, the principal method (known as Component Method II) was to estimate net migration based on annual changes in school, enrollment. In the 1960s, a second method was added that estimated changes in the population level rather than measuring the components of population change. This method (the ratio-correlation method) uses regression analysis that relates changes in selected independent variables to changes in state population since the last census. These independent variables come from federal or state data sources. In the 1960s, the major proxy variables were vital events, school enrollment, tax returns, number of votes cast, motor vehicle registrations, and building permits. In the 1970s, the variables for votes cast and building permits were dropped and a variable for the size of the work force was added. As the demand for estimates spread to the county level, the Federal State Cooperative Program for Population Estimates was formed to involve state governments in a joint effort with the Census Bureau. This organization permitted the extension of Component Method II, the ratio correlation method, and a housing unit method to the county level by providing data on school enrollment and various state administrative data systems at the county level. This system permitted the flexibility of using data sets selected for each individual state. The enactment of General Revenue Sharing created a demand for population estimates for all general purpose governments(incorporated places, towns, and townships). To estimate these subcounty areas the Census Bureau returned to a component based method (the administrative record methoo)'in which migration was estimated using income tax data from the Internal Revenue Service (IRS). This method required matching addresses on successive years of tax returns and calculating a migration rate based on the total number of exemptions that moved into and out of each area. The key challenge in developing this methodology was to design a suitable method of coding mailing addresses to counties, incorporated places and minor civil divisions. The result was a probability coding guide based on a question on place of residence placed on the tax returns in selected years. This methodology proved so successful that it was added as an independent method in the estimation of state and county populations as well. 4.2 Program Description, Policies, and Practices The level or geographic and characteristic detail and the methodologies or the current population estimates program are legacies of the expansion of estimates demands during the last three decades. Tables 4.3 and 4.4 show the frequency, detail, and methodology used at each geographic level of the Census Bureau's population estimates program. While the national population is estimated by age, sex, race, and Hispanic Origin, the subnational population estimates vary greatly in demographic and socioeconomic detail. In general, the level of characteristic detail declines as the level of geography becomes finer. Each level of geography also has its own combination of methods and input data. State population is currently produced on an annual basis by age and sex. County estimates are produced annually for the total population and, on an experimental basis, by age, race, and sex. Estimates for the total population of metropolitan areas are produced annually by summing the appropriate county data and by making adjustments for New England areas which are composed of townships rather than counties. Every other year, the Census Bureau produces total population estimates for incorporated places, towns, and townships. 4.3 Estimator Documentation The methodology for postcensal estimates varies by level of geography with the widest array of methods used in county estimates. This methodological discussion focusses on the county estimates with occasional extensions to include methods specific to states or places. Postcensal population estimates update the last census population based on changes in the population or in components of population change. Actual information on such components of population, change as births and deaths or on changes in symptomatic indicators related to changes in the population since the last census provide benchmarks to anchor the estimates. The art of postcensal estimation of population comes in choosing appropriate benchmarks (or auxiliary data) to use in estimating the population change since the last census. One type of benchmark data, population flow data, consists of measures of the components of population change (eg. births, deaths, internal and external migration). The other type of benchmark data, population stock data, includes indicators that are correlated with population size an uses changes in those indicators to estimate the total change in population. Methods based on each of these two classes of data are found in several variations in the Census Bureau's postcensal population estimates program. 4.3.1 Flow methods Flow methods are also known as component methods. They require some estimation of each of the components of population change since the last census. In the most general form, the component method reduces to a basic accounting equation for population change. Click HERE for graphic. (IRS) for changes in filing addresses between two consecutive annual tax filings (U.S. Bureau of the Census, 1988). In the estimates process, tax returns from one year are matched with those from previous years by matching Social Security numbers of the filers. For persons with a new address, the new mailing address is coded to state, place, and county. If the state, place, or county is different from the previous year, the filer and all exemptions are classified as migrants. These data are then used to construct net migration rates for each county and place as an input to the population estimation formula. An estimate of the rate of net migration is calculated by dividing the net flow of exemptions (the tax filer plus his or her dependents) moving into the area by the number of exemptions filed in the area (See equation 4.2). Click HERE for graphic. This net migration rate is then multiplied by the initial population as shown in equation 4.1. A critical assumption in this method is that the population not covered by the administrative data set moves similarly to the population covered or that the uncovered population is too small to affect the results markedly. Since this assumption is especially inappropriate for the population over 65 and for certain military and institutionalized populations, those populations are handled separately as explained below. Other potential problems include the difficulty of coding addresses to geography, changes in administrative coverage over time, and the elimination of administrative data sources as governmental programs change. Click HERE for graphic. migration rate of the school-aged population in the most recent census.The critical assumption here is that the relationship of net school-aged migration and net total migration remains constant over time. 4.3.2 Change in Stock Methods A fundamentally different approach to population estimates emphasizes the total change in population size since the last census rather than demographic components of change. These change in stock methods relate changes in population size to changes in other measured variables that are assumed to be correlated with population change. The choice of possible variables is wide: number of housing units, automobile registrations, total number of deaths (and or births), tax returns, etc. Note that births and deaths in this method are not viewed as components but as indicators of the size of the population. Similarly, drivers licenses and tax returns are not used as indicators of migration as they were in the flow methods but as proxies for the size of the total population. Click HERE for graphic. Click HERE for graphic. The key assumption in this method is that the relationship among geographic units between change in population and change in the selected indicator variables remains constant over time (Tayman and Schafer, 1985). Complications also arise if indicator variables change over time in selected areas for reasons unrelated to population -- for example, changes in the tax law, changes in general fertility rates, increases in automobile registrations per person, etc. Another population stock method used to estimate the ratio of the current population to the household change is the housing unit method. In this method, tax rolls, construction permits, certificates of occupancy, or utility data could be used to calculate changes in the number of housing units in an area (Smith and Mandell, 1984). In the Census Bureau's methodology the housing stock from the last Census is updated using data on housing construction, demolitions, and conversions (Eq. 4.4). Click HERE for graphic. The number of households in area i for date t is estimated by multiplying the estimated number of housing units at time t by an updated estimate of the occupancy rate for area i at time t. By assuming that the local occupancy rate changes as the national rate, we can update the area's rate by multiplying the occupancy rate for area i at the time of the census by the ratio of the national occupancy rate at time t from the Current Population Survey (CPS) to the national occupancy rate at the time of the census. Click HERE for graphic. Finally, the population for the area i is calculated by multiplying the area's household estimate by an updated estimate of population per household. Again we assume that the area's population per household from the last census can be updated by multiplying by the ratio of the national population per household from the CPS to the national population per household in the last census. Click HERE for graphic. All of the methods discussed so far refer to the household population under 65. The two other segments of the population, the population 65 and over and the group quarters population, are measured by their own specific change in stock methodologies. Since these two groups have unique characteristics (especially in terms of their migration patterns), we use administrative, records systems that are unique to each of the two groups. The population over 65 is estimated by using changes in the medicare population since the last census as a direct measure of the change in the population 65 and over. No such nationwide systems exists for the group quarters populations (defined for estimates purposes as the population in military barracks, college dormitories, prisons and other institutions). Changes in these population since the last census are obtained from an inventory of major group quarters locations that is maintained and annually updated by a special data collection process in the Population Estimates Branch of the Population Division in cooperation with state agencies affiliated with the Federal-State Cooperative Program for Population Estimates. 4.3.3 Combined methods The U.S. Census Bureau's postcensal population estimates program combines methods in two ways. Within each level of geography (states, counties, and places) several of the above methods are combined (Table 4.4). Since certain methods represent given subpopulations better, a combination of methods may be viewed as more robust -- less likely to change due to extraneous factors that might affect one or the other of the estimates. There is a further mixing of methods since the estimates at each level of geography are controlled to the results of the estimates made at the next higher level of geography. The methodology for making state estimates during the 1980s averaged the results of the administrative record method with those of the composite method. In the composite method, the population is divided into three age groups, each of which is estimated by a separate method. The population under 15 is estimated using changes in the levels of school enrollment (similar to Component Method II). The population ages 15-64 is estimated by a ratio- correlation method in which the independent variables are tax returns, school enrollment, and housing units. The population over 65 is estimated using a method in which changes in the number of persons on medicare since the last census date are added to the population aged 65 and over at the last census (U.S. Bureau of the Census, 1984). The total state population by age is then controlled to equal the estimated national population age structure. Annual county population estimates are produced independently for each state to coincide with the state's total population estimated above. A distinct methodology for each state is developed in consultation with that state's member of the Federal-State Cooperative Program for Population Estimates. In most cases, it consists of the average of two or three of the methods described above: the administrative records method, component method II, and the ratio-correlation method. Moreover, within the ratio correlation method, different states use different independent variables which may include school enrollment, tax returns, medicare enrollment, automobile registrations, births, deaths, dummy variables for county size, or other state-specific data series. Additional adjustments are made for changes in selected military and institutional populations and for changes in the population over 65. Final results are controlled to the state population estimate produced by the Census Bureau using a uniform method across all states (van der Vate, 1988). Place estimates use a strict administrative record methodology where migration is based solely on the migration rates derived from changes in addresses on tax returns. The only other adjustments for place estimates are for changes in selected military and institutional populations and a final control to county level population estimates (U.S. Bureau of the Census, 1980). 4.4 Evaluation Practices The estimation process demands continuous vigilance. Methods that appear to work well at the beginning of a decade may be unsatisfactory later in the decade. Only constant testing, data evaluation, quality control, and checks for reasonableness can ensure a sound program of population estimation. Whatever the method of estimation chosen, a number of considerations should be kept in mind, No matter how sophisticated the methodology, the estimate will only be accurate if the underlying assumptions hold and the input data are reliable. Many things can happen to endanger these conditions. For example, the relationships that held between variables in a previous decade might no longer hold in the current decade. The data series that one is depending upon to update the population may deteriorate or fail to measure the same underlying phenomenon as conditions change. Even if the administrative or other indicator data measure the population well, there may well be problems of geographic coding that fail to assign the population to the correct geography. Finding an appropriate yardstick against which to measure the postcensal population estimates is difficult. During the decade, aside from special censuses for a handful of places, there are no suitable numbers to compare to the estimates -- thus we know little about the short run accuracy of population estimates. We can only measure their accuracy at the extreme end of their range (after 10 years) using the next decennial census. Even here, the changing level of coverage between censuses for any given area can lead to imprecision in our measurement of estimates accuracy. Using the results of the 1980 and 1990 censuses as enumerated, the Census Bureau evaluated the accuracy of the population estimates program. The results (summarized in Table 4.5) show that population estimates made for the nation, for states, and for counties were reasonably accurate, but that estimates made for small places were quite inaccurate. Estimates for places under 5,000 had a mean absolute error of more than 15 percent while places over 50,000 had a mean absolute error of less than 5 percent. The last two columns in Table 4.5 present a more telling comparison. Column two compares the 1990 census and the provisional 1990 postcensal estimate while column three compares the 1990 census with the 1980 census. For most levels of geography the postcensal population estimate provides a far more accurate estimate than simply holding the population constant at the level of the last census. For example, state postcensal estimates had an mean absolute error of only 1.5 percent, while holding the last census constant would give an error of 10.0 percent. On average, the estimates methodology is also much better than using the last census for counties and incorporated places over 5,000 population. However, for many incorporated places under 5,000, holding the population constant at the 1980 level would have given more accurate results that did our postcensal estimate program. These inaccuracies for small places may be due to a number of sources: The problem of coding administrative records to small units of political geography, the greater importance of migration in population change for small areas, and the greater likelihood that the broad assumptions that might apply on average for larger areas would not apply to small localities with very specific characteristics. Since the Census Bureau is required by law to produce data for all incorporatedd places and townships, we will need to show places under 5,000 as well as the larger places for which we can produce good estimates. However, it is incumbent on us to show the uncertainty in the estimates for small areas in future publications in addition to making continual progress in refining and improving our estimates methodologies and data bases. 4.5 Current Problems and Planned Activities Many of the problems of the current population estimates system are the results of its past success and rapid growth during the 1960s and 1970s. Each new program, each expansion of characteristic detail, each reduction in the size of geographic unit has been accompanied by new data sets, by new methods, and by new production procedures. Although the Census Bureau has done a good job of meeting users expectations as these demands have increased, there is room for improvement in the estimates methodology and operations. We have embarked on a set of seven initiatives to revamp the population estimates program and lead it into the next century. These initiatives fall under the following headings: 1) defining the mission, 2) methodological integration, 3) input data quality, 4) geographic flexibility, 5) characteristic detail, 6) analysis of trends, and 7) production efficiency. 4.5.1 Defining the Mission The products currently estimated by the Census Bureau's Population Estimates Program are the results of opportunities and legislative requirements over a period of three decades. We plan to reexamine the demands for and uses of population estimates. A thorough study of the needs for population estimates and the Census Bureau's proper mission in filling those needs is an initial priority. We are currently polling a number of our users -- Federal government agencies, the Federal-State Cooperative Program members, private data vendors, and a number of other groups to ascertain their needs for population estimates. Some of the suggestions received so far involve modifying the population estimates program in order to produce more detailed characteristic information at the state and county level. We hope to produce age, sex, race, and Hispanic Origin data for counties. With more research, we may also be able to produce the county-level data on households -- number, size, and income -- that is currently demanded by many users. We are examining the feasibility of producing estimates for larger places on a yearly basis and producing estimates for other subcounty geography as well -- possibilities include census tract aggregates, subareas within large cities, and (for some purposes) Zip codes. 4.5.2 Methodological Integration The many different methods of estimating population developed over the past decades have resulted in a complex population estimates program. The need now is to integrate these disparate methods into an orderly system. Traditionally, the various estimation models used at the Bureau have been integrated by a simple averaging of the different estimates at a given level of geography and by controlling the sum of estimates at one level of geography to the averaged estimate at the next higher level. The time has come to reexamine each set of methods for suitability as parts of an integrated, parsimonious model for producing population estimates. In order to discuss methods of integrating our current methods, it is useful to distinguish between methods that measure the changes in the population stock and those that measure the components of population change. Methods showing the change in population stock (the ratio correlation method, the medicare change methodology, and the change in group quarters population) use changes in proxy variables since the last census to produce estimates of the total net change since the last census. These methods permit the use of many symptomatic measures of population size that may not be amenable to a flow approach. Component methods such as the administrative records method and component method II represent flow methods in which the components of population change births, deaths, international migration, and internal migration are each measured separately and added to or subtracted from the initial population. The advantage of this type of method is that it gives an estimate ont only of the population but also of the components of population change. This method provides additional information about the reasons for change, the reasonableness of the estimates, and provides inputs for population projections. Component methods are often preferable for larger areas because they use relatively accurate counts of births and deaths to compute a large part of population change. Consequently, administrative records which are often less accurate need only be used to estimate the portion of population change due to migration. Current research at the Census Bureau is underway to quantify the relative effects of errors in each component on the final population estimates. For small area, these advantages disappear and change in stock methods such as the housing unit method may be more appropriate. As we integrate methods, we should be careful to retain the flexibility offered by multiple independent methods of estimating population. Since methodologies for population estimates are dependent upon the use of data sets collected for purposes other than population estimates, the quality and availability of a given input data set is never certain. Only with multiple methods can we be assured of the ability to produce population timely and reliable population estimates. Multiple methods also provide a necessary check on the validity of the estimates results; surprising changes in demographic trends can be checked using independent sources in order to see if the results are merely idiosyncracies of a given input data source. The existence of independent methods of estimating population could prove a distinct advantage in trying to gauge the accuracy of estimates between censuses. We should examine the potential of using measures of divergence between independent estimates to determine the reliability and degree of confidence we have in the accuracy of postcensal estimates. If three independent estimates give very close values, we should have more confidence in those estimates than if the estimates vary widely. 4.5.3 Input Data Quality Perhaps even more important than the type of method chosen is the choice of data set used in, the estimate. Producing postcensal population estimates requires integrating traditional ting traditional demographic data sets such as census results, birth and death records, and immigration statistics with nontraditional sources collected for other administrative purposes such as tax returns, school enrollment, drivers' licenses, housing construction, survey data, etc. The art of population estimation is to combine these traditional and nontraditional sources to make maximum advantage of all the data available. The most challenging aspect of working with population estimates is the use of data sets designed and collected for administrative purposes rather than for statistical or demographic purposes. Ideally, such data sets should have universal coverage, change in direct relation with population changes, and be consistent over time in content and form. No data set actually meets these criteria. The level of population coverage is often less than 100 per cent. Programmatic changes or changes in social behavior independent of population change may affect the coverage rate. Worst of all, the administrative data set may even disappear if its programmatic need or funding disappears. Consequently, a healthy population estimates program requires careful attention to the quality and timeliness of input data as well as to the reliability of access to the input data. This requires working with our data providers to monitor the input databases on a number of requirements including reliability, consistency, coverage, characteristic detail, and idiosyncracies produced by programmatic and other changes. It also entails work with data producers to address questions of mutual interest such as cost, confidentiality, and legal requirements for data handling. Since administrative datasets may disappear over time, work must also continue on nurturing alternative data sets to provide similar or superior data. The need for flexibility to address changing data set availability and quality is yet another argument for using multiple independent methods and data sets to provide redundancy in the estimates program. 4.5.4 Geographic Flexibility Linking data on population to geography is the key to population estimation methodology. Any system for making subnational population estimates must have a credible method for developing such geographic correspondence. Population estimates are required for legally defined geographic entities such as counties and incorporated places and the estimates methodology must take these requirements into account. In the county estimates conducted jointly with states under the Federal-State Cooperative Program, we assume that the input data used in the ratio correlation methodology systems are correctly coded by county of residence. In the administrative records method matching tax returns to determine state, county, and place migration, the Census Bureau must provide the geographic coding for movers based on the mailing addresses or filers from the IRS tax forms. In order to categorize these filers by county and place of residence, the current methodology uses a probability coding guide. With the aid of data from a residence question on the 1980 tax form, mailing addresses were categorized by P.O. name, state, zip code, and address type (street address, P. 0. box, RFD route) and assigned a probability of falling within each of 3100 counties and 39000 places. There are several problems that lead to deterioration in the coding guide over time. Some of the more obvious ones can be corrected by manual adjustments in the coding guide, eg. creation of new Zip codes or revised boundaries for old Zip codes, changes in the boundaries of incorporated places, etc. A key cause of deterioration that cannot be fixed is the change over time in the distribution of the population within a given address key (post office, state, Zip code, address type combination). To the extent that those changes in distribution cross county, town, and city boundaries, the resulting coding will be incorrect. Moreover, the probability system itself may well put individual persons in the wrong county or place. We know little about how these errors propagate through the system after several years and multiple migrations. The Census Bureau is currently developing a new geographic coding system that permits frequent updating and, if possible, exact matching rather than probability matching of addresses to geography. The system is based on the "master address list" proposed by the Census Bureau's Geography Division as an outgrowth of the development of the TIGER digitized mapping project and the "address control file" created for use with the 1990 census. This system would provide an annually updated digitized data base that could place most addresses in the United States into the appropriate census block (and thus into any unit of geography that also has its boundaries in the TIGER system). In the estimates area, we are exploring the feasibility of developing a coding system that would code street addresses to subcounty areas using such a master address list. The existence of a continuously updated master address list could provide far greater geographic detail, ease of updating and correcting for boundary changes, and flexibility in dealing with changing geographic concepts and shifts in population distribution. This methodology also provides the promise of a far greater benefit in the future. The ability to provide exact matching based on geography might one day permit the matching of records on the basis of address rather than an identifier such as social security number. Such an ability would provide the opportunity to bring far more information sources to bear on the estimation effort. 4.5.5 Characteristic Detail Another major area for innovation is the expansion of data on population characteristics both demographic characteristics such as age, race, and sex and social/economic characteristics such as household structure and income. In order to get a better hold on the demographic structure of substate areas and to use as a denominator in calculating incidence rates, there is a major increase in the demand for age, race, and sex distributions at the county level between censuses. These data are not available from the IRS tax records that form the principal part of our administrative records processing. Consequently, we are developing alternative methods to provide these data for counties and large places as an integral part of the estimation process. We have experimented with a number of possible approaches. One of these experimental programs developed a projected estimate by which county trends in migration by age, race, and sex from the previous decennial census were extrapolated into the current decade, added to actual birth and death rates to produce a population by age, race, and sex that was then controlled to the official estimate of total population for a county. Another experimental program extends the current administrative record method by adding information by age, race, and sex from Social Security records to a sample of IRS returns to provide internal migration data for states and large metropolitan areas. Current plans call for integrating these programs into our standard procedures by the mid 1990s. There are also possibilities for using survey data combined with administrative data to obtain characteristic information. While matched survey and administrative data records on an individual basis may prove difficult, there have been efforts to combine data on an aggregate basis. A recent example is an analysis of internal migration that combined aggregate data from the decennial census, matched tax return migration data, and survey data from the Current Population Survey (CPS) to provide a time series of migration by characteristics for state to state flows. Research is proceeding on whether more information from surveys could be combined with the administrative record methods by either aggregate or individual statistical modeling approaches. Another major effort is underway to produce estimates for housing units and households for survey controls to the American Housing Survey and other housing based surveys. This program uses data on additions and deletions from housing stock to update the housing inventory from the decennial census. While this method is similar to the housing unit method for population estimates described above, the resulting housing unit estimates are used directly as survey controls rather than only used to estimate population. There is also the potential for integrating more administrative data into the estimates procedure. A number of federal, state, and even private data sets have been suggested. Possible data sets include state tax data, post office change of address forms, state drivers license information, food stamp enrollment information, utility hookup records, and telephone directory information. These and other data sets will be explored for their potential utility for, making subnational estimates assuring proper attention is given to protection of privacy and proper disclosure safeguards. 4.5.6 Analysis of Trends A prime advantage of the population estimates programs is its information on the changes in spatial population distribution between censuses. While the Census Bureau has put great emphasis on the production of estimates for individual states, counties, and places we have only occasionally provided the summary information on the broader trends in population redistribution. An analysis of population redistribution trends between cities and suburbs, high and low density areas, areas of high and low unemployment, and other analytical categories should be an annual part of our activities. In order to do this, a simple first step is to classify counties by relevant analytical characteristics so that such summaries could be a standard part of our processing. In addition, we plan an annual analytical report on population distribution trends based on the entire range of population estimates. Much of the intermediate data on components of population change (migration, births, deaths, numbers of housing units, etc) used in constructing the population estimates is of analytical interest in its own right. These data should be developed as their own data products and used to provide an analytical view of the dynamics of current population change. An integrated set of historically consistent data series on births, deaths, international, and internal migration should be developed for all major geographic areas for which population estimates are produced. As As a first step, we are producing a consistent time series of population counts for all counties and for cities over 25,000 from 1790 through 1990. 4.5.7 Production Efficiency The uncoordinated and erratic growth pattern in the population estimates area has had a substantial effect on production efficiency. During the 1980s, delays in production and unreliable publication dates have frequently resulted from the unwieldiness of the current production process. For the 1990s, we have streamlined the production process as a result of more parsimonious methodologies and a more focused set of products. Many of our users repeatedly tell us that it is more important to have a firm production date than to be too optimistic in our timetables. Efforts toward redesigning the estimates product have as a major goal a firm production schedule with realistic deadlines. While considerable progress has been made on this commitment, we expect to strive toward continuous improvement in timeliness as well as reliability and cost reduction. 4.6 Conclusion Postcensal population estimates are an integral part of the U.S. statistical system combining census results with tabulations on vital events, providing the population controls by which household survey results can be weighted, and producing a continuous and up-to-date time series of changing population size and distribution between censuses. These estimates are only possible with the creative use of censuses, vital events, administrative data, and other unconventional sources for estimating changes in population on a timely basis. As we approach the twenty-first century, the population estimates program provides an ideal starting point for an integrated demographic and social accounting system. The system already unites the decennial census and population survey results through a series of longitudinal controls. These longitudinal controls are based on previous censuses and vital events, and could be modified to incorporate measurements of undercoverage if desired. In the 1990 census, the estimates system provided substantial information for coverage improvement during the operation of the census and in evaluating coverage after the results were in. The system provides the opportunity to integrate the results of administrative records collected for other purposes to augment and improve traditional demographic data. Our efforts to integrate our geographic coding with the decennial census data base (TIGER), to maintain estimates of housing units and households as well as population, and to use data on social and economic characteristics from surveys in the estimation process take us beyond a purely demographic system to an enhanced estimates program that could eventually provide continuously updated data on many of the variables now only measured by the census. Moreover, such an integrated estimates system could provide data on the components and rhythm of population, housing, geographic, social, and economic change that no individual data source can now provide. Table 4.1: USES OF CENSUS BUREAU POPULATION ESTIMATES National - Survey Controls - National Social and Economic Series - Descriptive and Analytical Studies - Controls for Subnational Estimates State - Direct Federal Fund Allocation Formulas - Indirect Federal Fund Allocation - Denominators for Federal and Other Data Series - Federal Regulatory Actions - Survey Controls - Descriptive and Analytical Studies - Controls for Substate Estimates Counties - Fund Allocation by State Governments - Denominators for Federal, State, and Other Data Series - Regulatory Action by State Governments - Guides for Government and Private Sector Planning - Descriptive and Analytical Studies - Federal Data Series - Controls for Subcounty Estimates Places - Federal Block Grants - Fund Allocation and Regulatory Actions by Federal and State Governments - Descriptive and Analytical Studies - Government and Private Sector Planning - Private Sector Marketing Efforts - Base Data for Private Sector Data Development Click HERE for graphic. Click HERE for graphic. Click HERE for graphic. Click HERE for graphic. BIBLIOGRAPHY Batutis, Michael J. 1991. "Subnational Population Estimates Methods of the U. S. Bureau of the Census," U. S. Bureau of the Census, Population Division Working Paper. General Accounting Office. 1990. Federal Formula Programs: Outdated Population Data Used to Allocate Most Funds. September. GAO/HRD-90-145. General Accounting Office. 1991. Formula Programs: Adjusted Census Data Would Redistribute Small Percentage of funds to States. November. GAO/GGD-92-12. Mandell, M. and J. Tayman. 1982. "Measuring Temporal Stability in Regression Models of Population Estimation." Demograpby, 19:135-136. Namboodiri, N. K. 1972. "On the Ratio-Correlation and Related Methods of Subnational Population Estimation." Demograpby. 9:443-453. National Academy of Sciences. 1980, Estimating Population and Income of Small Areas. Washington, D.C., National Academy Press. O'Hare, W. P. 1976. "Report on a Multiple Regression Method for Making Population Estimate." Demography. 13:369-379. O'Hare, W. P. 1980. "A Note on the Use of Regression Methods in Population Estimates." Demography. 17:341-343. Roe, Linda K., John F. Carlson, and David A. Swanson. "A Variation of the Housing Unit Method for Estimating the Population of Small, Rural Areas: A Case Study of the Local Expert Procedure," Survey Methodology, 19: 155-163 Smith, Stanley K. and Bart Lewis. 1980. "Some New Techniques for Applying the Housing Unit Method of Local Population Estimation," Demography, 17: 323-339. Smith, Stanley K. and Bart Lewis. 1983. "Some New Techniques for Applying the Housing Unit Method of Local Population Estimation: Further Evidence", Demography, 20: 407-413. Smith, Stanley K. and Marylou Mandell. 1984. "A Comparison of Population Estimation Methods: Housing Unit Versus Component II, Ratio Correlation, and Administrative Records,' Journal of the American Statistical Association, 79:282-289. Smith, Stanley K. 1986. "A Review and Evaluation of the Housing Unit Method of Population Estimation," Journal of the American Statistical Association, 82: 287-296. Statistics Canada. Population Estimation Methods: Canada. Ottawa: Ministry of Supply and Services. Swanson, David A. 1980. "Improving Accuracy in Multiple Regression Estimates of Population Using Principles from Causal Modelling," Demography. 17:413-427. Swanson, David W. 1989. "Confidence Intervals for Postcensal Population Estimates: A Case Study for Local Areas," Survey Methodology. 15: 217-280. Swanson, David W. and L. Tedrow. 1984. "Improving the Measurement of Temporal Change in Regression Models Used for County Population Estimates," Demography. 21: 373-381. Tayman, Jeff and Edward Scharer. 1985. "The Impact of Coefficient Drift and Measurement Error on the Accuracyor Ratio-Correlation Population Estimates." The Review of Regional Studies. 15:3-10. U.S.Bureau of the Census. 1980. "Population and Per Capita Money Income Estimates for Local Areas: Detailed Methodology and Evaluation," Current Population Reports. Series P-25, No 699. U. S. Bureau of the Census. 1983. "Evaluation of Population Estimation Procedures for States, 1980: an Interim Report." Current Population Reports. Series P-25, No. 933. U. S Bureau of the Census. 1984. "Estimates of the Population of States: 1970 to 1983," Current Population Reports. Series P-25, No. 957. U. S. Bureau or the Census. 1995. "Evaluation of 1980 Subcounty Population Estimates," Current Population Reports. Series P-25, No. 963. U. S. Bureau of the Census. 1986. "Evaluation of Population Estimation Procedures for Counties: 1980," Current Population Reports. Series P-25, No. 984. U. S. Bureau of the Census. 1987. "State Population and Household Estimates, With Age, Sex, and Components of Change: 1981-1986", Current Population Reports. Series P-25, No. 1010. U. 8. Bureau of the Census. 1988. "Use of Federal Tax Returns in the Bureau of the Census' Population Estimates and Projections Program". Population Division Working Paper. U. S. Bureau of the Census. 1988. "Methodology for Experimental County Population Estimates for the 1980s", Current Population Reports. Special Studies. Series P-23, No. 158. U. S. Bureau of the Census. 1989. "Population Estimates by Race and Hispanic Origin for States, Metropolitan Areas, and Selected Counties: 1980 to 1985." Current Population Reports. Series P-25, No. 1040-RD-1. U. S. Bureau of the Census. 1989. "County Population Estimates: July 1, 1988, 1987, and 1986," Current Population Reports, Series P-26, No. 88-A. U. S. Bureau of the Census. "Population Estimates for Metropolitan Statistical Areas: July 1, 1988, 1987, and 1986," Current Population Reports. Series P-26, No. 88-B. U. S. Bureau of the Census. 1990. "State Population and Household Estimates: July 1, 1989." Current Population Reports. Series P-25, No. 1058. U. S. Bureau of the Census. 1990, "1988 Population and 1987 Per Capita Income Estimates for Counties and Incorporated Places," Current Population Reports. Series P-26, No. 88-SC. van der Vate, Barbara J. 1988. "Methods Used in Estimating the Population of substate Areas in the United States," Paper presented at the International Symposium on Small Area Statistics, New Orleans, LA, Aug. 26-27. CHAPTER 5 Bureau of Labor Statistics' State and Local Area Estimates of Employment and Unemployment Richard Tiller and Sharon Brown, Bureau of Labor Statistics Alan Tupek, National Science Foundation 5.1 Introduction and Program History The Bureau of Labor Statistics' (BLS) Local Area Unemployment Statistics (LAUS) Program produces state and area employment and unemployment estimates under a federal-state cooperative program. At present, monthly employment and unemployment estimates are prepared, for the 50 states and the District of Columbia, all Metropolitan Statistical Areas (MSA's), all counties, and selected subcounty areas for which data are required by legislation -- more than 5,300 areas. The Current Population Survey (CPS), conducted by the Bureau of the Census for the BLS, is the official survey instrument for measuring the labor force in the United States. The CPS sample provides direct monthly survey estimates of employment and unemployment for the nation, selected states and New York City and Los Angeles. However, the CPS sample is not sufficiently large in most states and substate areas to provide reliable monthly estimates. Therefore, methods are used to combine data from other sources with current and historical CPS sample estimates to produce monthly estimates of employment and unemployment for the remaining states, the District of Columbia, and substate areas. The CPS began during the Great Depression as a project of the Works Project Administration (WPA). During and following World War II, the need for unemployment data at the local level began to develop. A number of state and federal agencies began making estimates using various procedures. In 1950, the U.S. Department of Labor's Bureau of Employment Security, in an attempt to standardize the estimation methods, issued guidelines in a booklet: Techniques for Estimating Unemployment. In 1960, the Handbook Method on Estimating Unemployment was issued. This building block or accounting method for developing total employment and unemployment estimates is essentially still used for substate areas today. About the same time, Congress began passing legislation using local unemployment data for the allocation of funds, such as the Area Redevelopment Act in 1961 and the Public Works Economic Development Act in 1965. Legislated programs which currently allocate funds to states and local areas based on unemployment estimates, include the "Disadvantaged Adults and Youths", "Summer Youth", and "Dislocated Workers"' programs of the Job Training Partnership Act, the "Emergency Food and Shelter Program", and the "Public Works Program". In FY91, more than 9 billion dollars in appropriations to states and local areas were based, in full or in part, on local area unemployment statistics. In 1972, the BLS acquired responsibility for, unemployment statistics. BLS subsequently introduced chnages to the Handbook methodology, including the use of annual average estimates from the CPS as controls for the state and area monthly estimates. Beginning in 1973, the CPS sample size was expanded to allow for direct sample based estimates for the 10 largest states, Los Angeles and New York City. In 1984, an llth state was added. The Handbook method was still used for the 39 remaining states and the District of Columbia. However, a 6-month moving average adjustment using CPS data was applied to the state estimates. For substate areas, Handbook estimates are prepared for all labor market areas in the state, which are controlled to the state CPS- based estimates of employment and unemployment. At present, monthly employment and unemployment estimates are also prepared for all MSA'S, all counties, and selected subcounty areas for which data are required by legislation. In 1989, a new methodology was introduced for producing monthly state employment and unemployment statistics for the 39 smaller states and the District of Columbia. This method is a time-series regression model, and uses a state-space Kalman filter approach. Monthly estimates for the 39 smaller states and the District of Columbia are published approximately 6 weeks after the reference week of the CPS, which is the week including the 12th. Sample based estimates for the largest 11 states are released a few weeks earlier, (usually the first Friday of the month following the reference month) with the national estimates. Estimates for the 39 smaller states and the District of Columbia are revised a month later to reflect revisions in the Unemployment Insurance Statistics and the Current Employment Statistics (Payroll Employment Survey) Program, which are used in both the Handbook Method and State Modeling Method. At the end of the year, monthly state estimates are revised (benchmarked) so that their annual average equals the CPS sample based annual average. For the 11 large, states, data are revised to reflect population controls. 5.2 Program Description, Policies, and Practices Only five labor force estimates are published monthly for state and substate areas: Civilian noninstitutional population, civilian labor force, employed, unemployed, and the unemployment rate. Each month a press release - The Employment Situation - is issued and the Commissioner of Labor Statistics testifies before the Joint Economic Committee of Congress. The press release includes employment and unemployment estimates for the 11 largest states, in addition to national estimates. These data, as well as data for the remaining 39 states, the District of Columbia and Metropolitan Statistical Areas (MSAS) are published about four weeks later in Employment and Earnings. Seasonally adjusted data are provided for all 50 states and the District of Columbia, beginning in January 1992. Although the data for the smaller states are published with the data for the 11 largest states, the data are published in two sets in a table. The estimating methodology for the smaller states is provided in a footnote at the bottom of the page. A separate monthly publication - State and Metropolitan Unemployment - also includes data using direct sample survey estimates for the 11 largest states, the State Model methodology for the remaining states and the state CPS additively- adjusted Handbook method for sub-state estimates. This publication provides more detailed estimates for sub-state areas. In all, monthly labor force estimates are provided for 5,300 areas, including Metropolitan Statistical Areas (MSA's), Labor Market Areas (LMA's), all counties (cities and towns in New England), and cities of 25,000 population or more. Estimates for all but the 11 largest states, Los Angeles, and New York City are revised a month following the initial publication in which they appear, and again at the end of the year. The first revision takes into consideration revisions to the Payroll Employment and Unemployment Insurance statistics. The end of year revision adjusts the monthly estimates such that their annual average equals the CPS sample based annual average estimates for those states and sub-state areas for which CPS data are provided. 5.2.1 Design or the Current Population Survey The CPS monthly sample consists of 72,000 housing units. This sample size was chosen to meet national and state reliability requirements. Assuming a 6% unemployment rate, the national sample size was chosen so that a month-to-month change of 0.2 percentage points in the unemployment rate would be statistically significant at the 90 percent confidence level. This translates to a coefficient of variation (CV) of 1.8% for the national unemployment rate. The 11 largest states have a CV of 8.0% on the monthly unemployment rate. Tbe other 39 states and the District of Columbia have a CV of 8.0% on the annual average unemployment rate. The CPS sample is located in 729 areas comprising over 1,000 counties and independent cities with coverage in every state and the District of Columbia. Prior to 1984, the CPS had been designed as a national sample with the goal of providing the best estimates of employment and unemployment for the U.S. as a whole. The CPS sample is selected by first dividing the entire area of the United States into 1,973 primary sampling units (PSU's), where a PSU is a county or a number of contiguous counties. The 1,973 PSU's are grouped into strata within each state. One PSU is selected from each stratum with probability of selection proportionate to the population size of the PSU. The most populated PSU's are grouped by themselves and selected with certainty. Since the sample design is state based, the sampling ratio differs by state, ranging roughly from 1 in every 200 households to 1 in every 2500 households. There are several stages of selecting the household units within PSU'S. First, enumeration districts, which are administrative units and contain about 300 housing units, are ordered so that the sample would reflect the demographic and residential characteristics of the PSU. Within each enumeration district the housing units are sorted geographically and are grouped into clusters of approximately four housing units. A systematic sample of these clusters of housing units is then selected. Part of the sample is changed each month. For each sample, eight systematic subsamples (rotation groups) or segments are identified. A given rotation group is, interviewed for a total of 8 months -- 4 consecutive months in the survey, followed by 8 months out of the survey, followed by 4 more consecutive months in the survey. Under this system, 75 percent of the sample segments are common from month-to-month and 50 percent of the sample segments are common from year-to- year. The estimation procedures involves weighting the data from each sample person by the inverse of the probability of the person being in the sample. These estimates are then adjusted for noninterviews, followed by two ratio estimation procedures to adjust the CPS estimates to known population totals. The last step in the preparation of estimates makes use of a composite estimating procedure. The composite estimate for the CPS is a weighted average of the estimate for the current month and the estimate for the previous month, adjusted for the net month-to-month change in households. Balanced repeated replication and collapsed stratum methods are used to estimate CPS variances for selected characteristics. Generalized variance functions are used to present the sampling error estimates in publications. Sampling error estimates are provided for all direct sample based estimates, which include the annual average estimates for states and some sub-state areas, as well as monthly estimates for the 11 largest states. Error estimates are not provided for estimates which use the State Model methodology or the Handbook methodology. General variance functions are used for calculating sampling error estimates for the direct sample based estimates from the CPS. Employment and Earnings series provides methods for calculating sampling error estimates for almost any estimate in the publication. These methods can also be used to calculate estimates for unpublished CPS estimates, such as the monthly unemployment rates for the smaller states. These sampling error estimates can be used to approximate the error in the model based estimates. 5.3 Estimator Documentation The method used to provide monthly state estimates for the 39 states and the District of Columbia is based on the time series approach to sample survey data. Originally suggested by Scott and Smith (1974), this approach treats the population values as stochastic and uses signal extraction techniques developed in the time series literature to improve on the direct survey estimator. Recent work has been conducted by Bell and Hillmer (1990), Binder and Dick (1990), Pfefferman (1992), and Tiller (1992a). The actual monthly CPS sample estimates are represented in signal plus noise form as the sum of a stochastically varying true labor force series (signal) and error (noise) generated by sampling only a portion of the total population. Issues related to non-sampling errors are not considered by this approach. The signal is represented by a time series model that incorporates historical relationships in the monthly CPS estimates along with auxiliary data from the Unemployment Insurance (UI) and Current Employment Statistics (CES) programs. This time series model is combined with a noise model that reflects key characteristics of the sampling error to produce estimates of the true labor force values. This estimator has been shown to be design consistent under general conditions by Bell and Hillmer (1990) and is optimal under the model assumptions. Unlike the typical small area estimation application that seeks to improve on the direct survey estimator by borrowing strength over areas, the time series approach borrows strength over time for a given area. While variance reduction is a primary goal of both these approaches, when there are strong overlaps in the sample design and relatively long historical series are available, the time series approach provides powerful tools for estimating the underlying population values. As discussed in the previous section, the CPS design creates major sample overlaps resulting in very strong autocorrelations in the sampling errors. By combining a model of both the true labor force values and the sampling error, the time series approach controls for the autocorrelation induced by the sample design making it easier to identify the population dynamics. This is particularly useful in trend analysis and seasonal adjustment. When sampling error is strongly autocorrelated, trend and sampling effects are confounded in the observed data (Tiller, 1992b). Click HERE for graphic. Click HERE for graphic. Seasonal Component The seasonal component is the sum of six trigonometric terms associated with the 12-month frequency and its five harmonics Click HERE for graphic. Irregular component The irregular component is a residual not explained by the regression or time series components discussed above. The convention in classical decomposition of a univariate time series is to represent the irregular as a highly transient phenomena, i.e., as white noise or a low order MA process. Noise The noise component of the observed CPS estimate represents error that arises from sampling only a portion of the total population. Its structure depends upon the CPS design and population characteristics. For our purposes, we focus on those design features that are likely to have a major effect on the variance-covariance structure of the sampling error, e(t). One of the most important features of the CPS is the large overlap in sample units from month to month. As described in the previous section, units are partially replaced each month according to a 4-8-4 rotating panel. Since this system produces large overlaps between samples one month and one year apart, we can expect e(t) to be strongly autocorrelated. Also, there is likely to be some correlation between nonidentical units in the same rotation group because of the way in which new samples are generated. When a cluster of housing units permanently drops out of a rotation group, it is replaced by nearby units. Since the new units will have characteristics similar to those being replaced, this will result in correlations between nonidentical households in the same rotation group (Train, Cahoon, and Makens, 1978). Finally, the dynamics of the sample error will also be affected by the composite estimator. This is a weighted average of an estimate based on the entire sample for the current month only and an estimate which is a sum of the prior month composite and change that occurred in the six rotation groups common to both months (Bureau of the Census, 1978). In effect, this estimator takes a weighted average of sample data from the current and all previous months. Another important feature of the CPS is its changing variance over time. There are three major sources of heteroscedasticity: (1) sample redesigns; (2) changes in the sample size; and (3) changes in the true value of the population characteristic of interest. The first two cause discrete shifts in, the sample variance. For example, the CPS is redesigned each decade to make use of decennial census data to update the sampling frame and estimation procedures. Most recently, a state-based design was phased in during 1984/85 along with improved procedures for noninterviews ratio adjustments and compositing. Changes in state sample sizes have occurred more frequently than redesigns and have had major effect on variances at the state level. Even with a fixed design and sample size, the error variance will be changing because it is a function of the size of the true labor force. Since the labor force is both highly cyclical and seasonal, we can expect the variance to follow a similar pattern. To capture the autocorrelated and heteroscedastic structure of e(t), we may express it in multiplicative form (see Bell and Hillmer, 1990) as Click HERE for graphic. Click HERE for graphic. The autocovariance structure may also change over time with redesigns of the sample. However, since the most important source of autocorrelation is the 4-8-4 rotation scheme, which has not changed, it seems reasonable to treat this structure as stable, at least, between sample designs. The application of the signal-plus-noise approach requires information on the variance- covariance structure of the CPS at the state level. In principal this information can be estimated directly from the sample unit data using conventional designed based methods. In practice, this is not always feasible, since the CPS variance estimation involves complex computations on large microdata files. In the initial implementation of models in 1989 for the 39 states and the District of Columbia not enough information was available to explicitly model the sampling error. Instead, the noise component was estimated as a correlated residual (Tiller, 1989). More recently, sampling error autocorrelations have been developed and new models are being tested incorporating this information (Tiller, 1992a). Estimation The models described in the previous section are estimated using the Kalman filter (KF). The is a highly efficient algorithm for estimating unobserved components of, a time series model, when that model can be represented in state-space form. The state-space form consists of two sets of equations, transition and measurement equations, and a set of initial conditions. The unobserved signal and noise components are collected into the state vector, Z(t). The transition equations represent the state vector as a first-order vector-autoregressive process (VAR) with a normal and independently distributed disturbance vector, v(t), which contains the white noise disturbances Click HERE for graphic. associated with each of the unobserved component processes. The transition equations are set out below in a simplified form appropriate for our specific application. In the new models under development, the CPS error correlation structure is estimated outside of the time series model from design-based information. Variances for the state CPS estimates are computed using the method of generalized variances (Tiller, 1992). Autocorrelations were derived in a study by Dempster and Hwang (1992). In that study, state-specific variance component models were fit to a time series of data for the 8 CPS rotation groups. From the estimated variance parameters, autocorrelations were derived and, then, ARMA parameters were, estimated from these autocorrelations. Click HERE for graphic. Once, an additional observation, y(t), becomes available, the update equations revise the conditional moments with the new information in that observation. Click HERE for graphic. To initialize these equations, it is necessary to specify starting values for the conditional moments, Z(O) and P(O). Those elements of the state vector that are stationary, i.e., sampling error and the irregular, are initialized with their unconditional moments. The nonstationary and nonstochastic state variables are initialized with diffuse priors. Together, the prediction and update equations constitute the KF. The KF updates its latest prediction of the state vector with current sample data, prepares a prediction for the next period and updates that prediction when new sample data become available, but the estimate of Z(t) will not be revised with data later than period t. Thus, the KF estimator at time t is optimal only with respect to data later than period t. The estimator of Z(t) optimal for all observations, before and after t, is known as a smoother. By taking a linear combination a forward and backward KF, which is the KF run in reverse, starting at the end of the sample period at time t=T, and preceding to the beginning, a Kalman (fixed interval) smoother (KS) is obtained. Let the backward filter prediction of the state vector at time t, conditional on data from t+l to T be denoted by Z(t/t+l) and its covariance by P(t/t+l). The smoothed estimator is Click HERE for graphic. State agency staff prepare their official monthly estimates using software developed by BLS that implements the KF. This algorithm is particularly well suited for the preparation of current estimates as they become available each month. Since it is a recursive data processing algorithm, it does not require all previous data to be kept in storage and reprocessed every time a new sample observation becomes available. All that is required is an estimate of the state vector and its covariance matrix for the previous month. The software is interactive, querying users for their UI and CPS data and, then combining these data with CPS estimates to produce model based estimates. At the end of the year, the monthly estimates are revised, along with previous year estimates with the smoothing algorithm. 5.4 Evaluation Practices Click HERE for graphic. For each of the 39 states and the District of Columbia, signal plus noise models of the CPS unemployment rate and employment level were fit to monthly data beginning in 1976. Each of the 80 models has been subjected to a wide variety of statistical tests. An analysis of the model's prediction errors is the primary tool for assessing model adequacy. The prediction errors are computed as the difference between the current values of the CPS and the predictions of the CPS made from the model based on data prior to the current period. Since these e errors represent movements not explained by the model, they should not contain any systematic information about the behavior of the signal or noise component of the CPS. Specifically, the prediction errors, when standardized, should approximate a randomly-distributed variate with Zero mean and constant variance (white noise). The tests used to check the prediction errors for departure from these properties included: - General tests for non-zero correlations in the innovations - Departures from white noise behavior at the seasonal frequencies - Heteroscedasticity - Non-nonmality - Prediction bias About 50 to 60 percent of the total variance in the monthly CPS series is attributable to the estimated signal with the remainder due to the aggregate noise term. The time varying regression mean is considerably smoother than the underlying CPS series. Based on the diagnostic tests, the 80 models appear to fit the systematic underlying movements in the CPS fairly well. The major problems with the models were high autocorrelations in 11 states, and heteroscedasticity in 9 of the 40 unemployment rate models. The heteroscedasticity is in part a reflection of changing variances (and sample sizes) in the CPS. Explicitly modeling the CPS sample errors would alleviate this problem and is discussed in the current problems and activities section, below. The current state model estimates were introduced in January 1989. The previous Handbook method could be classified as an accounting method. Several years of research and development, beginning in the early 1980's, examined numerous regression and time series approaches to replace the accounting method. A number of workgroups were setup to determine the criteria to be used to select the new methodology as well as how to implement the new methodology. Ongoing evaluation of models includes annual reassessment of the regressor variables, if requested by staff in the state employment security agencies. Typically, state agency staff express concerns with the models if either the month to month movements in the unemployment rate estimates are larger than they expect or the unemployment rate level seems unreasonable compared to other economic dam. Diagnostics tests, similar to the ones used for developing the model, are run. Adjustments to the regressor variables may be made if the diagnostic indicate a problem with the model. In this case historical estimates would be replaced, in addition to developing a new model for concurrent estimates. 5.5 Current Problems and Activities The implementation of model estimates for states in January 1989 resulted, not unexpectedly, in estimates with more month to month volatility than the previous Handbook method. The previous method incorporated a 6 month moving average, which limited month to month movement. The seasonal variation in the employment and unemployment statistics series is usually large relative to the trend and cycle. However, the BLS decided to conduct a research project to investigate the issues of seasonally adjusting model based estimates prior to implementing seasonally adjusted state estimates. In November 1989, a work group was formed to examine issues related to the seasonal adjustment of estimates of employment level and unemployment rate for the 39 smaller states and the District of Columbia (non-direct use states). The group was charged with addressing two primary areas: 1. Evaluation of the performance of the model estimates relative to the CPS sample-based estimates, with emphasis on the trend/cycle characteristics of the series. 2. Evaluation of the use and limitations of the BLS standard seasonal adjustment method, X- 11 ARIMA, to seasonally adjust the model estimates. The evaluation of the modeling approach involved simulating a reduction in reliability of direct-use CPS samples in 2 large states (direct use -- Florida and Massachusetts) to nondirect-use levels, fitting models to the resulting weakened series, and then comparing model estimates to the CPS estimates from the full sample. While it would have been desirable to simulate sample cuts by subsampling the original data, this was considered too costly. Instead random noise was added to the full sample estimates, using an estimated variance/covariance structure of the CPS estimator. For each state, two weakened samples were generated for employment and unemployment, and separate models were fitted to the full and weakened samples. The main findings are summarized as follows: Model Evaluation 1. Modeling the weakened unemployment rate series resulted in estimates which were closer to the full CPS sample than the unmodeled weak CPS series for all four unemployment rate series. Values for the root mean squared relative difference (RMSRD) comparing full CPS to model estimates were 28 to 38 percent smaller than values of RMSRD comparing the full CPS to the weakened series. Modeling also reduced the number of weakened series estimates falling outside two standard deviation intervals about the full sample estimates by 50 to 75 percent. 2. Modeling the weakened employment series resulted in modest, if any, reductions in the RMSRD from the full CPS sample. In one case for the Florida employment series, the RMSRD values for the model were actually larger than those for the weakened series. This appeared to be due primarily to the fact that the difference in reliability between the full sample and the weakened series for employment was very small compared to the difference for unemployment rate. 3. Modeling dramatically reduced the magnitude of irregular fluctuation in both employment and unemployment rate. It was not unusual for the relative contribution of the variance of monthly change in the CPS to be 4 to 8 times that of the modeled series. The much smoother quality of the model estimates have important implications for seasonal adjustment (see below). Evaluation of Seasonal Adjustment Using X-11 ARIMA, the CPS and modeled series were seasonally adjusted. The adequacy of the seasonal adjustment was evaluated using X-11 ARIMA quality control statistics, spectral analysis, sliding spans, and graphs of seasonal factors. The major findings are as follows: 1. Tbe direct sample based unemployment rate CPS series could not be adequately seasonally adjusted. Frequently, several of the X-11 ARIMA quality control statistics failed. Seasonal adjustments had poor stability properties, the seasonal variation could not be completely removed, and distortion was added to the nonseasonal variation in the series. The results were better for the CPS employment series but not nearly as good as for the modeled series. 2. The seasonal adjustments for all four of the employment and unemployment model series appeared satisfactory. Spectral analysis shows that X-11 ARIMA was able to effectively remove seasonal variation in the modeled series without introducing distortions in the nonseasonal components of the series. The sliding span statistics indicate seasonal factors remain stable as the span of the data is shifted across time. In addition, monthly seasonal factors using the model were similar to the seasonal factors of the full sample estimates of the two direct-use states. This indicates that the models were not forcing an artificial pattern, but were "picking up" the seasonal pattern of the underlying CPS series, despite the extra noise which was introduced. In summary, despite some limitations to the methods of evaluations, the study provided important information to help understand the value of modeling and the use of X-11 ARIMA to seasonally adjust model-based estimates; however, the theoretical base for superimposing the modeling structure for X-11 ARIMA to already smoothed, model based values remains to be explored. Although the study confirmed support for modeling, further work will be done to demonstrate the utility of the employment models. The BLS introduced seasonally adjusted state employment and unemployment estimates beginning in January 1992, based on the results of this study. Current research is focusing on further reduction in irregular movement in employment and unemployment models by introducing several changes to the methods. The most important change is the inclusion of the variance/covariance structure of CPS estimates into the models, rather than relying on the model to make these estimates (Tiller, 1992a). Information on the structure of the CPS sample error is being used to decompose the disturbance term into its sample error and model error components. Given CPS error variances and lag covariance, ARMA models can be developed to approximate the time series behavior of the sampling error. Treating the ARMA coefficients as known parameters of the state space system, standard time series diagnostic tools may be used to model the errors in equation disturbances. The need for estimating the variance-covariance structure of the CPS estimates stems from sample redesign and changes in sample size. Other changes, such as removing some exogenous CPS variables, are expected to improve the seasonal movements in the model estimates. Florida and Massachusetts will again be used to examine the ability of the weakened series to track the full sample CPS estimates. Research is expected to be completed in FY93 for implementation in January 1994. Long term research will focus on substate estimation. Hierarchical and Empirical Bayes methods may be considered in addition to a time-series approach for substate estimates. Spatial models, which borrow strength from CPS sample data within the state, may be appropriate for substate estimates. A number of related studies have been conducted under the auspices of other govemmental agencies. The Census Bureau has supported research by Bell and Hillmer (1990) that has been instrumental in stimulating renewed interest in the time series approach to survey estimation. In this study, the authors applied ARIMA models to retail survey data. Binder and Dick (1990), at Statistics Canada, fitted ARMA models to Canadian Labour Force survey data. Both of these studies estimated the sampling error structure outside the time series model, using design-based methods. Pffermann (1991) applied a structural time series model to individual panel estimates from the Israeli labor force survey. The sampling error structure was estimated through the model rather than by design-based methods. Dempster and Hwang (1993) have developed prototype Bayesian models for estimating U.S. State employment and unemployment rates. Their basic time series models are constructed from fractional Gaussian noise processes. REFERENCES Bell, W.R. and Hillmer, S.C. (1990), "The Time Series Approach to Estimation for Repeated Surveys". Survey Methodology, 16, 195-215. Binder, D.A. and Dick, J.P. (1990), "A Method for the Analysis of Seasonal ARIMA Models," Survey Methodology, 16, 239-253. Bureau of Labor Statistics (1988), Handbook of Methods, Washington, D.C. Bureau of Labor Statistics (1991), Report on the Seasonal Adjustment of LAUS Model Estimates, Washington, D.C. Bureau of Labor Statistics (1991), The Current Population Survey - An Overview, Internal Document by Edwin Robison, Washington, D.C. Bureau of the Census (1978), The Current Population Survey: Design and Methodology, Technical Paper 40, Washington, D.C. Dempster, A.P. and Jing-Shiang Hwang (1993), "Component Models and Bayesian Technology for Estimation of State Employment and Unemployment Rates," paper presented at the 1993 Annual Research Conference, Census Bureau. Harvey, A.C. (1989), Forecasting Structural Time Series Models and the Kalman Filter, Cambridge University Press Pfeffermann, D. (1992). Estimation and Seasonal Adjustment of Population Mean Using Data from Repeated Surveys. Journal of Business and Economics Statistics, 9, 163-175. Scott, A.J. and Smith, T.M.F. (1974), "Analysis of Repeated Surveys Using Time Series Methods," Journal of the American Statistical Association, 69, 674-678. Tiller, R. (1989), "A Kalman Filter Approach to Labor Force Estimation Using Survey Data," in proceedings of the Survey Research Methods Section, American Statistical Association. ____(1992a), "Time Series Modeling of Sample Survey Data from the U.S. Current Population Survey," Journal of Official Statistics, 8, 149-166. ____(1992b), "A Time Series Approach to Small Area Estimation," in Proceedings of the Survey Methods Research Section, American Statistical Association. Train, G., Cahoon, L., and Makens, P. (1978). The Current Population Survey Variances, Inter-Relationships, and Design Effects. In Proceedings of the Survey Research Methods Section, American Statistical Association, 443-448. CHAPTER 6 County Estimation of Crop Acreage Using Satellite Data Michael Bellow, Mitchell Graham, and William C. Iwig National Agricultural Statistics Service 6.1 Introduction and Program History The National Agricultural Statistics Service (NASS) of the U.S. Department-of Agriculture (USDA) has published county estimates of crop acreage, crop production, crop yield and livestock inventories since 1917. These estimates assist the agricultural community in local agricultural decision making. Also the Federal Crop Insurance Corporation (FCIC) and the Agricultural Stabilization and Conservation Service (ASCS) of the USDA use NASS county crop yield estimates to administer their programs involving payments to farmers if crop yields are below certain levels. The primary source of data for these estimates has always been a large non-probability survey of U.S. farmers, ranchers, and agribusinesses who voluntarily provide information on a confidential basis (see Chapter 7). In addition, the Census of Agriculture, conducted by the Bureau of the Census every five years, serves as a valuable benchmark for the NASS county estimates. Earth resources satellite data, particularly from the Landsat series of satellites, provide another useful ancillary data source for county estimates of crop acreage. The potential for improved estimation accuracy using satellite data is based on the fact that, with adequate coverage, all of the area within a county can be classified to a crop or ground cover type. The accuracy of the estimates is then dependent on how accurately the satellite data are classified to each crop type based on the "ground truth" data obtained from the annual June Agricultural Survey (JAS) conducted by NASS. Through the use of aerial photographs, this survey identifies the crop type of individual fields within randomly selected land segments. Segments in major agricultural areas are approximately one square mile in area and normally contain 10 to 20 fields. The satellite spectral data are matched to the corresponding fields for use in classifying all individual imaged areas, known as pixels, to a particular crop type. Recent studies (Bellow 1991; Bellow and Graham 1992) have shown that, for certain crops, approximately 80 percent of the pixels are classified correctly. This correct classification level is high enough to provide improved estimation accuracy. NASS has been a user of remote sensing products since the 1950's when it began using mid- altitude aerial photography to construct, area sampling frames (ASF's) for the 48 states of the continental United States. A new era in remote sensing began in 1972 with the launch of the Landsat I earth-resource monitoring satellite. Four additional Landsats have been launched since 1972, with Landsat IV and V still in operation in 1993. The polar-orbiting Landsat satellites contain a multi-spectral scanner (MSS) that measures reflected energy in four bands of the electromagnetic spectrum for an area of just under one acre. The spectral bands were selected to be responsive to vegetation characteristics. In addition to the MSS sensor, Landsats IV and V have a Thematic Mapper (TM) sensor which measures seven energy bands and has increased spatial resolution. The large area (185 by 170 km) and repeat (16 day per satellite) coverage of these satellites opened new areas of remote sensing research: large area crop inventories, crop yields, land cover mapping, area frame stratification, and small area crop cover estimation. Research from 1972 to 1978 led to the creation of an operational procedure for large area crop acreage estimation. A regression estimator was developed which related the ground-gathered area frame data to the computer classification of Landsat MSS images. The basic regression approach used to produce State estimates does not produce reliable county estimates. Domain indirect regression estimators were developed for this purpose. In the 1978 crop season, corn and soybean acreage State and county estimates based on remotely sensed data were produced for Iowa. One to two States were added to the project through 1984. For the 1984-1987 crop seasons, this project covered an eight-State area in the central United States and produced regression estimates of corn, winter wheat, soybeans, rice, and cotton acreages. These regression estimates were combined with other survey indications and administrative data to provide final published county estimates. Estimation based on data from Landsat MSS sensors was discontinued in 1988 in order to implement the increased capabilities of higher resolution sensors. France entered the field of resources satellites in 1986 with the launch of SPOT I, which carries an improved multi-spectral scanner. This scanner images an even smaller area than the TM sensor but only measures three energy bands. Several NASS research projects compared the SPOT MSS and Landsat TM sensors with respect to crop estimation. This research led to the selection of Landsat TM as the preferred sensor for crop area estimation based on its superior spectral characteristics. The spatial characteristics of the SPOT MSS sensor provide a benefit only in areas with mostly small fields. Regression estimation of crop acreages for large and small areas based on computer classification was reinstated in 1991 with the Delta Remote Sensing Project using Landsat Thematic Mapper data imaged over the Mississippi Delta region, which is a major rice and cotton area. Results from the operational eight-State program in 1987 and from sensor comparison experiments showed that the regression approach was most effective for rice and cotton estimation. State and county estimates of rice, cotton, and soybean acreages were produced for Arkansas and Mississippi in 1991, with Louisiana added in 1992. The project only covers Arkansas in 1993 due to budgetary constraints. Three domain indirect regression estimators have been used or considered for producing small area county estimates using ancillary satellite data. From 1976 to 1982, the Huddleston-Ray estimator was used (Appendix B). In 1978, the Cardenas family of estimators was considered but not implemented (Appendix C). Beginning in 1982, the Battese-Fuller family of estimators was used for calculating county crop acreage estimates using Landsat MSS data. Since 1991, the Battese-Fuller model has been used to produce county estimates with Landsat TM data. Currently, this is the preferred model. However, non-regression estimation procedures based on total pixel counts are being evaluated. 6.2 Program Description, Policies, and Practices The basic element of Landsat spectral data is the set of measurements taken by a sensor of a square area on the earth's surface. The sensor measures the amount of radiant energy reflected from the surface in several bands of the electromagnetic spectrum. The individual imaged areas, known as pixels, are arrayed along east-west rows within the 185 kilometer wide north-to-south pass (swath) of the satellite. For purposes of easy data storage, the data within a swath are subdivided into overlapping square blocks, called scenes. The two satellites currently in operation (Landsats IV and V) image a given point on the earth's surface once every 16 days. The MSS sensor, formerly used for crop area estimation, contained four spectral bands with 80 meter spatial resolution. The more advanced TM sensor has seven bands (three visible and four infrared) with 30 meter resolution. Several Landsat scenes may be required to cover an entire region of interest within a given State. It is not always possible to have the same image date for all such scenes due to schedule, cloud cover, and image quality factors. Consequently, analysis districts are created. An analysis district is a collection of counties or parts of counties contained in one or more Landsat scenes that have the same image date, or in areas for which usable Landsat data is not available to the analyst. To obtain State level crop acreage estimates, NASS sums all analysis district level estimates within the State. County level estimates are obtained using domain indirect regression and synthetic estimation methods, to be discussed later. The area sampling frame for each State is stratified based on land use such as percentage cultivation, forest, and rangeland. NASS uses the regression estimator described by Cochran (1977, pp. 189-204) to compute crop acreage estimates for each land use stratum within an analysis district that has satellite coverage for an adequate number of JAS segments. These regression estimates are more precise than the direct expansion estimates obtained from JAS data alone. A detailed description of the procedure involved is provided by Allen (1990). Briefly, the steps required are as follows: 1. A graphics oriented registration process associates Landsat pixels with JAS sampled segments. 2. JAS data for sampled segments are used to label each pixel within the segments to a crop or other cover type. 3. Labelled pixels are clustered based on their Landsat data values to develop discriminant functions (signatures) for each cover 4. The discriminant functions are used to classify each pixel within the sampled segments to a cover type. 5. The segment level classification results are used to develop regression relationships for each crop between the ground and satellite data within each land use stratum. For each stratum, the independent (regressor) variable is the number of pixels classified to that crop per segment, and the dependent variable is the JAS segment reported crop acreage. 6. All pixels within the analysis district are classified, using the discriminant functions developed in Step 3. 7. For each stratum, the mean number of pixels per segment classified for a given crop over all segments in the population is substituted into the corresponding regression equation to obtain the stratum level mean crop acreage per segment. This mean is multiplied by the known total number of segments in the stratum to obtain the stratum level crop acreage estimate. 8. The stratum level estimates are summed to obtain the analysis district level crop acreage estimate for the portion of the analysis district covered by satellites data. For land use strata lacking satellite coverage of an adequate number of JAS segments to develop the regression relationship, the direct expansion of JAS data is used to obtain estimates. These stratum level JAS estimates are also summed to obtain analysis district estimates for each crop representing the area not covered by satellite data. The total analysis district estimate for a particular crop is then: Click HERE for graphic. In many States, counties typically contain fewer than five sampled JAS segments, and may contain no segments at all. This fact makes it generally infeasible to define analysis districts to be individual counties and then use the above procedure to obtain county level estimates. Instead, the Huddleston-Ray, Cardenas, and Battese-Fuller domain indirect regression estimators have been developed and investigated for providing county estimates of crop acreage. The Battese-Fuller approach is currently favored by NASS, and is described in detail in Section 6.3. The NASS County Estimates system, described in Chapter 7, is designed to accept the Battese-Fuller values as a separate set of county crop acreage estimates. Within this system, the Battese-Fuller county estimates are first scaled to be additive to the official NASS State estimate for each commodity. The scaled Battese-Fuller values are then composited with scaled values from other NASS surveys and administrative data sources. Thus the Battese-Fuller estimates serve as an additional input to the County Estimates system in States where they are available. Currently, the composite weights are subjectively set by the statisticians in the State office to provide satisfactory and reliable estimates. Each NASS State Statistical Office (SSO) prepares their own annual publication of the final county estimates. Although sampling variances are calculated for the Battese-Fuller estimates, no variances or error information are published for the final county estimates. Mean squared error information is only published for major agricultural items at the U.S. level. 6.3 Estimator Documentation The Battese-Fuller family of estimators was first developed in the general framework of linear models with nested error structure (Fuller and Battese 1973), and later applied to the special case of county crop area estimation (Battese, Harter, and Fuller 1988). The method has been used for all Landsat county estimation done by NASS since 1982. Similar to the State level estimation, land use strata are separated into those that have adequate satellite coverage and those that do not. The Battese-Fuller model can be applied within an analysis district for all strata where classification and regression have been performed. The analyst computes stratum level Battese-Fuller acreage estimates for all counties and subcounfies within the boundaries of each analysis district. For land use strata where regression cannot be done due to lack of adequate satellite coverage or too few segments, a domain indirect synthetic estimator is used to obtain county estimates. Click HERE for graphic. Click HERE for graphic. Click HERE for graphic. Click HERE for graphic. Click HERE for graphic. Click HERE for graphic. Click HERE for graphic. was used within stratum A for the parts of counties outside the scene, and in stratum B for all nine counties. Table 2 gives the computed county estimates by stratum and estimation method. Table 3 contains the official county estimates issued by the Iowa Agricultural Statistics Service. These published estimates are based on additional survey and administrative data (see Chapter 7), and are considered as the standard for evaluating the Battese-Fuller model values. The tables show that the computed county estimates for corn were more efficient overall than those for soybeans. For eight of the nine counties, the C.V. for corn was less than 4 percent. No county had a C.V. of less than 4 percent for soybeans. The percent difference ranged from 0.2 to 9.2 for corn, and from 0.8 to 17.8 for soybeans. Table 2: Iowa 1988 County Estimates of Crop Acreage by Stratum and Estimation Method Stratum A Stratum A Stratum-B County Battese-Fuller Synthetic Synthetic Total C.V. Corn acres (000) acres (000) acres (000) acres (000) percent Audubon 91.9 - .3 92.2 3.5 Calhoun 130.3 2.6 .4 133.2 2.9 Carroll 140.7 - .7 141.4 3.2 Crawford 128.4 23.4 .9 152.7 3.1 Greene 129.6 - .4 130.0 3.0 Guthrie 105.7 - .6 106.3 4.9 Ida 43.4 63.2 .4 107.0 3.7 Sac 137.5 - .8 138.3 2.9 Shelby 140.2 - .5 140.7 2.9 Total 1047.789.2 5.0 1141.8 Soybeans Audubon 69.8 - .1 69.9 6.6 Calhoun 143.2 1.7 .1 145.0 4.0 Carroll 106.6 - .1 106.7 9.0 Crawford 91.3 15.5 .2 106.9 5.4 Greene 117.4 - .1 117.5 4.6 Guthrie 64.3 - .1 64.4 10.9 Ida 34.6 41.7 .1 76.4 6.9 Sac 112.8 - .1 112.9 4.9 Shelby 80.9 - .1 81.0 7.4 Total 820.9 58.8 1.0 880.7 6.5 Evaluation Practices NASS first began to address the problem of applying satellite data to small area estimation in the mid 1970's. In 1976, Huddleston and Ray (1976) proposed that within each stratum, the mean pixels per segment calculated by classifying all segments within an entire analysis district be replaced by the mean pixels per segment computed by classifying all segments within a given county. This county pixel mean is substituted into the corresponding stratum regression equation for the crop of interest. Amis, Martin, McGuire, and Shen (1982) describe the Huddleston-Ray estimator as an analysis district regression estimator applied to a subarea of the analysis district. The regression coefficients are estimated from sampled segments located throughout the analysis district, while the mean being estimated is from a subpopulation of the analysis district. The Huddleston-Ray estimator is simple and intuitively appealing, but Walker and Sigman (1982) point out two major drawbacks. First, it is unclear how to accurately compute the variance of the estimator. Second, the estimator lumps together a term attributable to sampling error within a given county and another term that measures the inherent distinction between a county and the analysis district. Amis et al. (1982) empirically demonstrate that the Huddleston-Ray method can generate biased estimates and that the variance estimatation formula can overestimate the variability for a given county. The mathematical formulas for the Huddleston-Ray estimator and its variance estimator are provided in Appendix B. The problems with the Huddleston-Ray estimator documented by Walker and Sigman (1982) and by Amis et.al (1982) were recognized soon after its development and prompted Cardenas, Blanchard, and Craig (1978) to devise a different type of estimator. The Cardenas family of estimators has three forms, each of which uses auxiliary Landsat data, through a regression type estimator. However, the versions use different methods of estimating the slope term. The three forms are the ratio estimator, the separate regression estimator, and the combined regression estimator. (Appendix C gives the mathematical formulation for the Cardenas family of estimators.) As with the Huddleston-Ray method, within each stratum the Cardenas method compares the analysis district level mean pixels per segment classified to a crop to the corresponding county level mean for that crop. However, the Cardenas methods uses all segments in the analysis district to calculate the analysis district mean, where the Huddleston-Ray approach only uses sample segments. The estimate of average crop area per segment is adjusted by an amount proportional to this difference between the county and analysis district means. Amis et al. (1982) examined the ratio and separate regression Cardenas estimators, and compared them with the Huddleston-Ray estimator. Cardenas et al. (1978) stated that none of the estimators they presented were shown to be "best" in any sense, nor did they demonstrate any optimum properties. They did show that each of these estimators, when summed over counties, provides an unbiased stratum level estimate for the State. Also, assuming that the within county variance is the same for all counties, the method enables unbiased estimation of the State-wide variance. Amis et al. (1982) emphasized that an unbiased estimate of the county mean crop area per segment may not be possible when there are few sample segments in a county. Whenever there are significant differences in county variances, the Cardenas estimators appear to have higher variances than the Huddleston-Ray estimator. Amis et al. (1982) concluded that there appears to be no difference between the Cardenas ratio estimator and the separate regression estimator, and that the Cardenas estimators do not perform better than the Huddleston-Ray estimator. Both Cardenas estimators studied appeared to be biased, with larger variances than the Huddleston-Ray estimator. The Cardenas method was never used in an operational remote sensing program since it did not provide sufficient improvement over the Huddleston-Ray estimator. The Huddleston-Ray estimator was used to generate county estimates for use by the NASS State Statistical Offices (SSO's) until 1982. At that time, Walker and Sigman (1982) advised that calculation of county estimates using the Huddleston-Ray method be discontinued, and that the Battese-Fuller method be used instead. Walker and Sigman (1982) studied the Battese-Fuller model using Landsat MSS data over a six county region in eastern South Dakota. They found a modest lack of fit of the model, with larger model departure corresponding to low correlation between classified pixel counts and ground survey observations. A key feature of the Battese-Fuller model is the county effect parameter and this effect was found to be highly significant for corn, the most prevalent of the four crops considered in the study. Furthermore, this effect manifested itself within several strata but was negligible across strata. The study nonetheless indicated robustness of the Battese- Fuller estimators against departure from certain model assumptions. Two members of the Battese-Fuller family satisfied the criterion for small relative root mean square error; i.e. less Table 4: County Estimates for Mississippi 1991 County Official Computed % Diff* CV Cotton acres (000) acres (000) percent percent Bolivar 65.5 61.6 6.0 9.9 Coahoma 105.7 88.3 16.5 4.8 Humphreys 61.6 57.3 7.0 5.9 Issaquena 38.0 34.6 9.0 11.3 Leflore 79.2 87.8 10.9 4.0 Quitman 31.O 46.4 49.7 8.6 Sharkey 47.0 48.6 3.4 7.0 Sunflower 100.0 79.3 20.7 6.9 Tallahatchie 64.2 67.9 5.8 7.2 Tunica 45.6 38.0 16.7 6.6 Washington 95.7 102.4 7.0 3.9 Yazoo 94.5 93.9 .6 8.0 Total 828.0 806.1 Rice Bolivar 74.0 66.2 10.5 5.4 Coahoma 15.8 1O.4 34.2 24.0 Humphreys 3.6 7.1 97.2 32.4 Leflore 16.6 19.4 16.9 18.6 Sharkey 5.0 7.8 56.0 21.8 Sunflower 36.0 37.8 5.0 9.3 Tallahatchie 9.6 8.5 11.5 35.3 Tunica 17.5 9.9 43.4 26.3 Washington 30.5 22.6 25.9 15.5 Total 208.6 189.7 * Click HERE for graphic. than 20 percent of the estimate was attributable to root mean square error. These members were the estimators that minimized mean square error and bias, respectively, under the model assumptions. However, the Battese-Fuller estimate closest to the Huddleston-Ray estimate was far less satisfactory, failing to meet the desired upper limits for mean square error and bias. This study provided the justification for replacing the Huddleston-Ray estimator with the Battese- Fuller family. The reliability of the county estimates based on the Battese-Fuller model has been closely watched since its implementation in 1982. As mentioned previously, these estimates are only one of possibly four or more indications that are composited to provide the final published crop acreage values. The reliability of the Battese-Fuller estimates can vary between years, between crops, between counties and between States depending on the stage of the crop at the time of the Landsat imagery, the amount of crop acreage within the county, the number of segments within the county and cloud cover. The results presented in Tables 2 and 3 for corn are relatively good with all CVs less than 5 percent and over half of the percentage differences from the published value less than 4 percent. The soybean results are slightly poorer, with CVs ranging from 4 percent to 11 percent and percentage differences ranging up to 18 percent. Table 4 presents more recent results for a set of counties covered by Landsat in Mississippi for 1991. A review of the CVs and percentage differences indicate that the Battese-Fuller estimates can have relatively large CVs and percentage differences when the county crop acreage is less than 30,000 acres. Some summary statistics of the differences for the four crop examples discussed are presented in Table 5. The mean average difference is typically less than 10,000 acres, but Table 5: Summary Statistics on Accuracy of Battese-Fuller Estimates (1000 acres) Crop/State/Year MD* RMSD* MAD* LAD* Corn Iowa 1988 -0.6 6.8 5.4 14.3 Soybeans Iowa 1988 -8.6 11.9 9.1 25.5 Cotton Mississippi 1991 -1.8 10.0 7.8 20.7 Rice Mississippi 1991 -2.1 5.2 4.5 7.9 * MD = mean difference between Battese-Fuller and published value RMSD = root mean squared deviation MAD = mean absolute difference LAD = largest absolute difference for small county acreages such as rice in Mississippi, large percentage differences may still occur. Consequently, NASS SSO's still use additional survey and administrative data to help set the published values. 6.6 Current Problems and Activities As technology improves, new sensors produce satellite data that can be more accurately classified to a given crop than ever before. Consequently, the overall count of pixels classified to a given crop within a county can possibly be used directly to estimate crop acreage. The overall pixel count represents a census of pixels covering the county and therefore is not subject to sampling error. However, a nonsampling error is introduced due to inaccuracies in the classification. A general expression for such an estimator is: Click HERE for graphic. Click HERE for graphic. Both adjustment terms are conceptually simple. The combined ratio uses stratum level survey information to compute the adjustment term that may provide a more accurate conversion of pixel counts to crop area than the set conversion factor. Also, the ratio has a readily available formula for estimating the variance. Research continues to focus on identifying new geographic areas and crops where this estimator would be applicable. Also, possible benefits of remotely sensed data from alternative sources, such as radar satellites, will be investigated as the newer sources are available. In recent years TM sensor data have been used to produce county estimates in the Delta region. County estimates of rice, cotton, and soybeans were produced for Arkansas and Mississippi, in 1991, with Louisiana added in 1992. In 1993 satellite data are only being used in Arkansas due to budgetary constraints. To date, the satellite based estimates have only been produced on a limited scale. The NASS SSO's continue to rely on other data series for helping set the published county estimates of crop acreages. They conduct a large non-probability county estimates survey (see Chapter 7) that serves a dual purpose of also providing updated control data for the list sampling frame. This is an integral part of the NASS survey program and so will continue in some form for the foreseeable future. Fairly reliable administrative data sources are also available. NASS is continuing to investigate the benefits of satellite based county estimates in relation to these other available data sources. One by-product of the satellite data process that is attractive to the State offices is color coded land use maps at the county level. These maps provide a pictorial view of the distribution of the crops within each county. Identifying alternative uses of satellite data such as this is an important research objective of NASS. REFERENCES Allen, J.D. (1990), "A Look at the Remote Sensing Applications Program of the National Agricultural Statistics Service," Journal of Official Statistics, 6, pp. 393-409. Amis, M.L., Martin, M.V., McGuire, W.G., and Shen, S.S. (1982) "Evaluation of Small Area Crop Estimation Techniques Using Landsat and Ground-Derived Data," LEMSCO-17597, Houston, TX: Lockhead Engineering and Management Services Company, Inc. Angelici, G., Slye, R., Ozga, M., and Ritter, P. (1986), "PEDITOR - A Portable Image Processing System," Proceedings of the IGARSS '86 Symposium, Zurich, Switzerland, pp. 265-269. Battese, G.E., Harter, R.M., and Fuller, W.A. (1988), "An Error-Components Model for Prediction of County Crop Areas Using Survey and Satellite Data," Journal of the American Statistical Association, 83, pp. 28-36. Bellow, M.E. (1991), "Comparison of Sensors for Corn and Soybean Planted Area Estimation," NASS Staff Report No. SRB-91-02, U.S. Department of Agriculture. Bellow,, M.E. and Graham, M.L. (1992), "Improved Crop Area Estimation in the Mississippi Delta Region using Landsat TM Data," Proceedings of the ASPRS/ACSM Convention, Washington, D.C., pp. 423-432. Cardenas, M., Blanchard, M.M., and Craig, M.E. (1978), "On The Development of Small Area Estimators Using LANDSAT Data as Auxiliary Information," Economic, Statistics, and Cooperatives Service, U.S. Department of Agriculture. Cochran, W.G. (1977), "Sampling Techniques," New York, N.Y.: John Wiley & Sons. Fuller, W.A. and Battese, G.E. (1973), "Transformations for Estimation of Linear Models with Nested-Error Structure," Journal of the American Statistical Association, 68,pp. 626-632. Huddleston, H.F. and Ray, R. (1976), "A New Approach to Small Area Crop Acreage Estimation," Proceedings of the Annual Meeting of the American Agricultural Economics Association, State College, PA. Ozga, M. (1985), "USDA/SRS Software of Landsat MSS-Based Crop Acreage Estimation," Proceedings of the IGARSS '85 Symposium, Amherst, MA, pp. 762-772. Prasad, N.G.N. and Rao, J.N.K. (1990), "The Estimation of the Mean Squared Error of Small-Area Estimates," Journal of the American Statistical Association, 85, pp. 163-171. Walker, G. and Sigman, R. (1982) "The Use of LANDSAT for County Estimates of Crop Areas - Evaluation of the Huddleston-Ray and Battese-Fuller Estimators," SRS Staff Report No. AGES 820909, U.S. Department of Agriculture. Appendix A: Estimators of Battese-Fuller Variance Components Click HERE for graphic. Click HERE for graphic. Appendix B: Huddleston-Ray Estimator The Huddleston-Ray estimator replaces the classified pixel average for the analysis district with the classified pixel average for a county when estimating the county mean crop area per frame unit. Within the analysis district, the overall mean crop area in regression stratum h is estimated by: Click HERE for graphic. Click HERE for graphic. Appendix C: Cardenas Family of Estimators The Cardenas family of estimators uses the stratum level differences between mean number of pixels classified to the crop of interest in the county and the analysis district, respectively, to adjust the mean reported crop area per sample segment. Within a regression stratum h, the estimate of mean crop area per segment for a county c is: Click HERE for graphic. Click HERE for graphic. CHAPTER 7 The National Agricultural Statistics Service County Estimates Program William C. Iwig National Agricultural Statistics Service 7.1 Introduction and Program History The National Agricultural Statistics Service (NASS) of the U.S. Department of Agriculture (USDA) publishes over 300 reports annually regarding the Nation's crop acreage, crop production, livestock inventory, commodity prices, and farm expenses. The primary source of this information is surveys of U.S. farmers, ranchers, and agribusinesses who voluntarily provide information on a confidential basis. These surveys are normally designed to provide State and U.S. level indications of agricultural commodities. There is also a need for county level estimates to assist farmers, ranchers, agribusinesses, and government agencies in local agricultural decision making. NASS has published annual county estimates for over 70 years through funding provided by cooperative agreements with State departments of agriculture and agricultural universities, and directly from other USDA agencies. The earliest known record of published county estimates is by the Wisconsin State Board of Agriculture, which issued county estimates on acreage and production of crops for 1911 and 1912 along with the number and value of livestock for 1912. Not until 1917, following the signing of the first Federal-State cooperative agreement, did the USDA assist in the preparation and publication of the Wisconsin county estimates. The cooperative agreement helped eliminate duplication of efforts between Federal and State statisticians, making possible more service for less cost. The cooperative work grew rapidly after 1917 as other State departments of agriculture and State agricultural universities established cooperative agreements with the USDA. State governments needed county level information and their funding made possible the publication of county level estimates by USDA. The New Deal Farm Programs of President Franklin D. Roosevelt's Administration used county estimates of agricultural commodities extensively and refocused USDA's attention to these estimates. In May 1933, the Agricultural Adjustment Act was passed and the Agricultural Adjustment Administration (AAA) was soon in place. This agency had the task of reducing supply in order to improve prices of agricultural commodities. These programs greatly increased demands on NASS for county estimates of commodities used by the AAA to set county quotas and program pay-outs for surplus items. In more recent years, the Federal Crop Insurance Corporation (FCIC) and the Agricultural Stabilization and Conservation Service (ASCS) of the USDA have used NASS county estimates to administer their programs and they provide funding to NASS for that purpose. Their programs involve payments to farmers if crop yields are below certain levels. Both agencies have chosen to use the NASS county estimates, when available, as the basis for determining these payments. The estimation approach has remained relatively unchanged over the years. The basic process for estimating totals such as crop acreage and livestock inventory initially involves scaling various survey estimates and other available administrative data at the county level to be additive to the official USDA State level estimate. These scaled estimates are composited together, usually with the previous year estimate, to provide the actual county estimate for the current year. This scaling and compositing process tends to strengthen the final estimate over a direct design based expansion. These estimates are checked against any vailable administrative data that are reliable indicators of minimum levels and modifications are made if necessary. Program changes that have been made since 1917 involve, data processing advances, allowing more data to be used, and larger sampling frames and more sophisticated sample selection techniques, providing better coverage of the farm population. Also, advances have been made to improve the quality of the State level estimates, which indirectly benefit the quality of the published county estimates through the scaling process. In the late 1950's, methodology was developed to conduct probability area frame surveys, where random segments of land would be selected for enumeration. In the 1960's these surveys became operational, which provided for the first, time probability survey indications of crop acreage and livestock inventories on a State level basis. During this time frame, the State reporter lists were also increasing in size and improving in quality. With improved data processing capabilities in the 1970's, probability Multiple Frame (MF) Surveys were implemented at the U.S. and State levels, which combined the use of list and area sampling frames. Also, some States have conducted probability or quasi-probability MF County Estimates surveys (North Carolina Ag Statistics Service 1986). States have traditionally shown a large degree of autonomy in designing and conducting their county estimates surveys. This has been due, in large part, to funding from the State cooperator, the quality of different data sources and different computing capabilities in each State. Recently, a NASS task force developed a County Estimates system for sample selection and summarization that provides a general framework, but still allows considerable flexibility to each State in their sample selection and summarization procedures (Bass et al. 1989). This system is now the standard being used by NASS State offices for their county estimates program. 7.2 Program Description, Policies, And Practices The NASS County Estimate Program is really 45 different programs conducted separately by each NASS State Statistical Office (SSO). There is some general structure provided by the 1989 County Estimates Task Group, but still each State has considerable flexibility in the implementation of the procedures. The quality of the county estimates is to some degree related to the amount of financial support being provided by the State cooperator, which is usually the State Department of Agriculture. The Census of Agriculture, conducted by the Bureau of the Census, has always served as a benchmark for the USDA crop and livestock estimates, and especially for county estimates. The annual State Farm Census, funded by the State cooperator, was also an important benchmark for the county estimates in many States until the late 1970's. Since then it has been discontinued in most States due to lack of funding. The Census of Agriculture has been conducted every five years since 1920 (on a 4 year schedule from 1974 to 1982), providing county, district, State, and U.S. level estimates of most agricultural commodities. Since 1982, the Census has been conducted to coincide with the economic censuses (business, industry, etc.) in years ending in 2 and 7. Census county level estimates are closely watched since the USDA estimates are often based on very few survey returns. At the same time, the quality of the Census numbers are also closely evaluated. The completeness of the Census varies from State to State, county to county, and item to item. Consequently, the Census values are interpreted differently. After the Census values are published, NASS statisticians review their estimates and make revisions as necessary. Another major component to the county estimate program has been the official USDA State level estimate. Preliminary survey estimates and administrative data are scaled to be additive to the official State total. State estimates are based on more data than each individual county estimate and, in recent years, have been based on probability survey indications. Consequently, the State estimates have always been considered more reliable than any individual county estimate. In: addition to being more reliable, State level estimates are usually already published before county estimates are published. For these reasons, county level indications have always been scaled to the State level estimates rather than the State level estimate being the sum of independently derived county estimates. Over the years, the county estimate surveys have developed into a major source of information for list frame maintenance and updating. Farm operations that had not been contacted within a prescribed time frame can be targeted for sampling for the annual county estimates survey. Currently, NASS has a stated policy that all control data on the list sampling frame (LSF) should be less than five years old (USDA 1991, Policy and Standards Memorandum 14-91). Control data refers to the historic survey data values or data values from external sources that are stored on the LSF and used for stratification and sample design purposes. Another policy that is followed in all States is the suppression of any county estimate that would disclose the data of any individual operation, as specified in Policy and Standards Memorandum 12-89 (USDA 1989). This policy preserves the confidentiality of all reports, which is a foundation of voluntary reporting to NASS. Estimates cannot be published if either: (1) the estimate is based on information from fewer than three respondents, or (2) the data for one respondent represents more than 60 percent of the estimate. Exceptions to this rule are only granted when written and signed permission is given by the respondent. Suppressed estimates may be combined with another county as long as the confidential data are not disclosed. In most States, county estimates are made for all major crop and livestock categories. This may cover 50 - 100 separate commodity items. Estimates for crop items usually include planted acres, harvested acres, yield, production, and value of production for a particular crop year. Some States also publish separate estimates for different cropping practices, such as irrigated and non-irrigated acreages. Livestock estimates include inventory numbers on a particular date, possibly marketings, and inventory value. Each SSO develops their own county estimate publication because they are State funded. These estimates have associated sampling and non-sampling errors. No variances or error information are published for the final county estimates. Mean squared error information is only published for major agricultural items at the U. S. level. 7.3 Estimator Documentation Tne new NASS County Estimate System uses a combination of scaling and compositing techniques to provide a county level total estimate for any particular agricultural item. Separate estimates that may be composited together include the previous year official estimate, current year direct expansion and ratio estimates, and other available indications. In recent years, remotely sensed data from satellites have been used to generate county level estimates of crop acreages for selected crops where this technology has been applied (see Chapter 6). County estimates of a ratio such as crop yield, which is the ratio of total crop production to total harvested acres, are dependent on the final estimates of the two items involved. Current year data are collected using primarily a mail survey in the fall of the year with some selected telephone follow-up. State sample sizes can range up to 40,000 with usable record counts around 200 for major items in major counties. However, county estimates for many commodities are based on fewer than 20 sample records. A key feature of the system is the sample design which involves selecting sampling units from multiple overlapping stratified designs. A separate design is developed for each commodity of interest. The system combines data collected from sampled operations from these different designs such that the selection probabilities are not used in, calculating the survey estimates. Another key feature of the system is the coordination of survey contacts from the different designs to control respondent burden. A third feature is a synthetic scaling of the county estimates in order that they sum to the official U.S. Department of Agriculture State level estimates. A fourth feature is the compositing of the different estimates to provide final county level estimates. Further details on each of these features follow. 7.3.1 Commodity Specific Stratified Designs The NASS County Estimate Program depends primarily on a large mail survey in the fall of the year with State level sample sizes ranging up to 40,000. Some States conduct two surveys, with an early fall survey covering acreage and production of small grains which are usually harvested by September. Then the late fall survey covers the fall harvested crops and livestock. The sample units are farm operations selected from the NASS list sampling frame in each State. One of the major goals of the new system is to provide a framework that will ensure adequate representation for each agricultural item of interest. In order to provide adequate county level estimates, major farm operations for each item of interest must be represented appropriately in the sample. This is relatively easy for the major crops in a State since a sample design representing all known operations with cropland would represent any major crop adequately. However, in order to provide adequate representation for rare crop and livestock items, the strategy used in the new system is to develop separate stratified sample designs for each agricultural commodity as needed. The sample design strata for each commodity are based on the positive control data for that particular item. Control data are the historic data values stored on the list sampling frame. Strata boundaries typically coincide with the categories used in the Census of Agriculture publications. Table I illustrates the stratified design that might be developed for barley in a particular State, covering all known operations that have positive control data for barley. Table 1: Example Stratified Design for Barley ============================================= Population Boundary Stratum Count (acres) _____________________________________________ 10 2,500 1 - 49 20 1,000 50 - 99 30 400 100 - 299 40 100 300+ _____ Total 4,000 ______________________________________________ The major function of the stratified design is to provide a framework to group similar size operations for summarization (see 7.3.3). Initial sampling may occur at the State level within each stratum. Or, different sampling rates may be used at the county level in order to assure an adequate sample within each county. Different sampling rates by county would typically occur when the commodity frame contains only a few records in a particular county. It may be necessary to sample all records with "probability one" in that county, where a smaller sampling fraction is sufficient in other counties. This most frequently occurs with rare commodities. Another sampling option keys on whether the sampling unit reported in the previous year. If the current to previous year ratio is a primary indication for a State, units that reported in the previous year may be sampled heavily, and other records sampled at a lighter rate. 7.3.2 Coordination of Multiple Samples The samples selected from the different commodity designs contain many overlapping records. A farming operation could easily be selected from multiple commodity designs. In addition, many of the selected operations may have already provided all or some of the requested information on another current year survey. These other survey data files are used as input to the County Estimate System. The system is designed to identify which records already have provided the requested information and questionnaires are not sent to these operations. Even if an operation has only provided some of the needed data on previous crop specific or livestock specific surveys, it will typically not be recontacted to help control respondent burden. Data items not included on the previous surveys are treated as "missing" in the county estimates expansions. The system also identifies which records are duplicated in multiple designs and in multiple samples. Only one questionnaire is sent to each sampled unit. The same questionnaire, containing all items of interest, is used regardless of the commodity design (barley, corn, hogs, etc.) from which the record was selected. There is usually some telephone follow-up to non- respondents as resources allow. Telephoning may be targeted to provide sufficient data for each commodity. Since a secondary objective of the county estimate survey is to update control data on the list sampling frame, some telephoning may be targeted at operators with missing control data or control data that are more than five years old. 7.3.3 Creation of Survey Indications The County Estimates System is designed to provide direct expansion and ratio estimates based on sample data collected from the county estimates survey and from sample data collected from other current year surveys. As mentioned previously, the same questionnaire is used for all farm operations selected specifically for the county estimates survey, regardless of the originating commodity design. Consequently, a farm operation selected from the barley design will also be asked to provide data on all other crop and livestock items. All reported data from the county estimates survey and from other surveys are used in providing the survey indications. For each operation, the system identifies the assigned strata from all of the commodity designs. All records will not be included in each commodity design since all records do not have positive control data for all commodities. Records that do not have an original design stratum for a commodity are assigned to "pseudo stratum 99" for summary. Then corn data are summarized in the corresponding stratum from the corn design for each operation and hog data are summarized in the corresponding hog stratum. Since data are used for a particular item from records that were not selected in the original sample design, the direct expansion and ratio estimates are not based on the selection probabilities. However, this approach probably doubles the number of positive data records available for most survey items compared to just using data records from the original commodity designs. The use of this additional data is a stabilizing factor in providing reliable county level estimates. Survey estimates from the County Estimate System are provided at State, district, and county levels for each item. Districts are groups of geographically contiguous counties with relatively homogeneous agricultural practices and climate within each district. There are usually four to nine districts per State. The State and district estimates are used primarily in the scaling process described later. The county level survey estimates are the basis for the final published estimates, but they also go through a scaling and compositing process. Population counts and useable record counts are generated by the system at each level. The direct expansion estimate for a particular commodity at any level is represented as follows: Click HERE for graphic. Table 2: Examples of Direct Expansion County District, and State Estimates, for Corn Planted Acres Click HERE for graphic. In addition to direct expansion estimates, ratio estimates of totals and ratio estimates of ratios are also created. For crop acreage items, possible ratio estimates are based on ratios of current year planted acres to previous year planted acres, harvested to planted acres, planted acres to total cropland acres, and irrigated acres to planted acres.: The ratio estimates are generated from usable reports for both the numerator and denominator and are expressed as: Click HERE for graphic. 7.3.4 Scaling of Indications The first step in the process is to scale the individual county and district "indications to the official published USDA State level estimate. Typically, "indications" that are scaled include: 1) survey direct expansion estimate 2) survey ratio estimates 3) previous year estimate 4) other indications (remotely sensed acreage estimates, Census of Agriculture, other Administrative data). Initially, each district indication (direct expansion, ratio, administrative data) is scaled. Suppose there are "M" different indications. The scaling at the district level occurs as follows: Click HERE for graphic. The resulting county level estimates for each of the "M" indications (direct expansion, ratio, administrative data) then sum to the district estimate. This scaling process serves as a weighting adjustment to account for any incompleteness in the various indications. As mentioned previously, the NASS list sampling frame typically provides about 80% coverage for major commodities. Administrative data values also have varying degrees of completeness. 7.3.5 Compositing of Scaled Estimates The next step in the process is to composite together the various scaled estimates to provide satisfactory county and district level estimates. The composite estimates generated for each county and district are represented as follows: Click HERE for graphic. Rounding rules are incorporated into this process so that the final estimates are the published values. These estimates are reviewed by statisticians in the State office for reasonableness based on their knowledge of the location and general size of the largest operations in the State for each commodity. The estimates must exceed minimum levels and not exceed maximum levels provided by reliable administrative data sources. For example, a State may check that the sum of major crop acreages does not exceed the Census of Agriculture total cropland acres for each county. If estimates are not reasonable, the data will be more closely examined for outliers and insufficient sample sizes. Different weights for the compositing process or adjustments to the outlier indications may be needed to provide the final published county level estimates. 7.4 Evaluation Practices Each NASS State Statistical Office has taken a major responsibility in developing and evaluating procedures that help provide reliable county estimates in an efficient manner in their State. The automony in each program is primarily a function of the funding received from the different State cooperators. The recently developed NASS County Estimates System provides a common framework for producing county estimates within each State. However, the actual sampling and estimation methods still vary to some degree. Some documented research has been conducted over the years to evaluate different procedures. But the Census of Agriculture continues to be the major evaluation tool. Ford, Bond, and Carter (1983) examined a model-based approach that estimates the percentages of the total USDA State level crop acreage allocated to each county and district. A composite estimator was used to estimate North Carolina county and district level percentages for 1981. The composite included the estimated percentages based on direct estimates of crop acreage from two separate probability crop acreage surveys and the estimated percentage from a simple linear regression on the percentages over time (1972-1980). The time trend component tended to have much larger weights than the survey components in the composite. Results demonstrated that indications from this procedure were more stable and closer to published, values than indications from either of the separate crop acreage surveys. Since the published values tended to follow the composite which is strongly influenced by the time trend model, the results suggested that NASS statisticians were already informally following the linear time trends in setting the county estimates, and consequently, these procedures were never implemented. The major evaluation process of the NASS county estimates continues to be the review against the Census of Agriculture numbers every five years. NASS statisticians are actually involved in the review of the Census numbers before they are published to resolve any major discrepancies based on their knowledge of the State's agriculture and their, county estimates for the comparable year. After this review, the Census data are resummarized and published. NASS State offices then go through the "Census Review" process. The county estimates series during the last five years is reviewed for consistency with the Census numbers and any necessary changes are made. This is a subjective process, and handled differently in each State. Other available check data may also be used in the revision process, such as data from livestock or crop associations. 7.5 Current Problems and Activities Currently, research is being conducted on general small area estimation methodology through a cooperative agreement with the Department of Statistics, The Ohio State University. In .addition, research needs are being identified by the developers and users of the county estimates system as they gain experience with the programs. The methodology research with The Ohio State University has focused on statistical procedures for non-probability survey data with the constraint that the sum of the county estimates must sum to the official NASS State estimate. Initial research considered a multiple regression estimator for obtaining county estimates of wheat production in Kansas (Stasny, Goel, and Rumsey 1991). The regression model is of the form: Click HERE for graphic. The county total can be estimated if county level values are known for all independent variables in the regression model. In the initial analysis of wheat production county estimates, the independent variables were planted acres of wheat and a district indicator which accounted for differences in yield for different areas of the State. Since production is closely related to planted acres and yield, these seem to be reasonable independent variables. It may be more difficult to identify independent variables for estimated planted acreage. These indications would then be scaled by some method. Evaluation of the regression estimator using simulated data indicated that it generally produced more precise indications than a direct expansion of sample data within the respective county. Analysis also indicated that a constant proportional scaling method worked just as well as more sophisticated methods involving the sum of squared differences or the sum of squared relative differences between the county indications and the final estimates. Future research is planned to consider other variables and other small-area estimators. Research is also being conducted through the cooperative agreement with The Ohio State University on a synthetic estimator for counties that have zero or only a few positive records for a commodity. In spite of the improved sampling capabilities of the new system, this situation still occurs. Approaches that share information from neighboring counties and across States are being investigated. Also, there is a need to evaluate survey estimates (direct expansion and ratio) generated on a probability basis. The current program combines data from different sampling designs in such a manner that the actual selection probabilities are not used. This procedure was chosen because it is easy to implement. Also, it makes use of all data collected. As stated previously, the same questionnaire is used for all sample units, regardless of the original sampling design. Consequently, barley data are collected from the barley design, from the corn design, from the hog design, etc.. An alternative approach that also makes use of all data collected is to first generate, for each commodity, probability based estimates independently from each design. That is, generate separate barley acreage estimates from the barley design, from the corn design, from the hog design, etc., using the appropriate selection probabilities. These estimates can then be combined to produce an unbiased (or nearly unbiased) estimator with less variance than an estimate based on a single design. Analysis is currently being conducted to evaluate alternative post-stratification and composite estimation strategies. As has been described, the NASS County Estimates System has evolved over the past 70 years. The published estimates continue to be a relied upon source of essential information for many data users in the agricultural community. However, there is a constant concern about the quality of the estimates and methodological improvements that could be made. The program requires a major commitment of resources for the editing, summarization, and publishing of the data. These issues will continue to be a focus of future research as resources allow. REFERENCES Bass, J., Guinn, B., Klugh, B., Ruckman, C., Thorson, J., and Waldrop, J. (1989), "Report of the Task Group on County Estimates," National Agricultural Statistics Service, U.S. Department of Agriculture. Brooks, E. M. (1977), "As We Recall: the Growth of Agricultural Estimates, 1933-1961," Statistical Reporting Service, U.S. Department of Agriculture. Ford, B. L., Bond, D., and Carter, N. (1983), "Combining Historical and Current Data to Make District and County Estimates for North Carolina," Staff Report AGES 830906, Statistical Reporting Service, U.S. Department of Agriculture. North Carolina Agricultural Statistics Service (1986), "North Carolina Probability A&P; and County Estimates Surveys," Raleigh, NC: Author. Stasny, E. A., Goel, P. K., and Rumsey, D. J. (1991), "County Estimates of Wheat Production," Survey Methodology, Vol. 17, pp 211-225. U.S. Department. of Agriculture (1917), "Conference of Agricultural Statisticians," Author. U.S. Department of Agriculture, Bureau of Agricultural Economics (1933), "The Crop and Livestock Reporting Service of the United States," Misc. Publication No. 171, Author. U.S. Department of Agriculture, Bureau of Agricultural Economics (1949), "The Agricultural Estimating and Reporting Services of the United States Department of Agriculture," Misc. Publication No. 703, Author. U.S. Department of Agriculture, Agricultural Marketing Service (1957), "National Conference of Agricultural Statisticians: Conference Papers, Part B, Commodity Branch Sessions," Author. U.S. Department of Agriculture, Statistical Reporting Service (1969), "The Story of U.S. Agricultural Estimates," Misc. Publication No. 1088, Author. U.S. Department of Agriculture, National Agricultural Statistics Service (1989), "Standard for Suppressing Data Due to Confidentiality," Policy and Standards Memorandum No. 12-89, Author. U.S. Department of Agriculture, National Agricultural Statistics Service (1991), "Sampling Frame Standards for Coverage and Maintenance," Policy and Standards Memorandum No. 14-91, Author. U.S. Department of Agriculture, National Agricultural Statistics Service (1992), "Estimation Manual," Volume 10, Author. CHAPTER 8 Model Based State Estimates from the National Health Interview Survey Donald Malec National Center for Health Statistics (NCHS) 8.1 Introduction and Program History There is a continuing need to assess health status, health practices and health resources at both the national level and subnational levels. Estimates of these health items help determine the ete demand for quality health care and the access individuals have to it. Although NCHS survey data systems can provide much of this information at the national level, little can be provided directly at the subnational level, except for a few large states and metropolitan areas. The need for State and substate health statistics exists, however, because health and health care characteristics are known to vary geographically. Also, health care planning often takes place at the state and county level. In this chapter our focus will be the production of state and substate indirect estimators from the National Health Interview Survey (NHIS). Information on health status, health practices and health resources is collected annually in the NHIS and direct national estimates of these items are also produced annually. The NHIS is a multistage, personal interview sample survey. It is redesigned every ten years, in order to make use of new population data collected in the U.S. Census of Population. The current sample design uses 1,983 primary sampling units (PSU's), each PSU consisting of a single county or a group of contiguous counties (minor civil divisions are used instead of counties in New England and Hawaii). The population of 1,983 PSU's is stratified and approximately 200 are sampled with probability roughly proportional to their population sizes. Within each sampled PSU clusters of households are formed and sampled. Areas within a PSU with a high concentration of Blacks are oversampled. The NHIS is a cross- sectional survey, each year a new sample containing approximately 50,000 households and 120,000 individuals is selected. For additional details about the design of the NHIS see Massey et al (1989). High costs are the primary reason that NCHS is unable to provide subnational estimates from its national surveys. With the current budget, the sample size in most states is often too small to produce precise direct estimates. There is also a concern that direct estimates of small areas will have a larger component of nonsampling error. For example a small area may be only canvassed by one interviewer and the resulting direct estimate may be affected by this interviewer's style. In contrast, direct estimates that cover higher geographic levels are canvassed by many interviewers and will tend to have a smaller interviewer affect due to the involvement of many independent interviewers. However, even with large sample sizes and many interviewers, problems can occur in preparing direct estimates of the variance of direct state estimates because the NHIS was not designed for this purpose (Parsons, Botman and Malec, 1990). Since 1968, the National Center for Health Statistics has produced and evaluated indirect state estimators of health items derived from the NHIS. Although NCHS does not have a program for the regular publication of subnational estimates based on indirect estimation methods, it has supported the development and evaluation of these techniques. This aim has been achieved through the support of in-house research, research grants and small-area conferences and workshops. Through these efforts, three Public Health Series reports, containing indirect State estimates of disability and the use of health care have been published (see section 8.2.1). A number of methodological research projects have also taken place at the Center. Research results have appeared in the Center's series reports and in journals and conference proceedings. The demand for small area estimates is increasing. Subnational estimates are sometimes needed for the administration of Federal Block grants. States are also striving to meet the health guidelines for the year 2000 as promoted in Healthy People 2000: National Health Promotion and Disease Prevention Objectives. The assessment of the dietary and nutritional status of the U.S. requires an understanding of these factors at the subnational level. Accurate estimates are needed for all these purposes. The Center is continuing research efforts into the development of subnational estimators. Currently, estimates based on a hierarchical, logistic regression are being produced and evaluated. This model includes demographic effects and county level effects and includes county level variation. These continuing efforts are being made to both improve the accuracy of small area estimates and to produce estimates of their accuracy. The next redesigned NHIS, which will be fielded from 1995 to 2005, will possibly oversample and screen for both Blacks and Hispanics. It is planned that approximately twice as many PSUs will be sampled as are sampled now. In addition PSUs will, most likely, be stratified by state and by urban/rural PSUs within a state. This stratification will not produce precise state estimates but it will provide a convenient framework for supplementing the NHIS state data with additional state data. The use of state strata may also improve indirect state estimates. 8.2 Program Description, Policies and Practices 8.2.1 Estimates from the NHIS Three reports containing indirect state estimates from the NHIS have been published. The first report, Synthetic State Estimates of Disability Derived from the National Health Survey (NCHS 1968), contains estimates of long- and short-tem disability measures collected during July 1962 - June 1964. Specifically, the report contains the percent of persons who suffer from one or more chronic conditions, the percent of persons whose activity is limited by a chronic condition, the average number of restricted activity days per person, the average number of bed disability days per person and the average number of work-loss days per employed person. Estimates were made using a ratio adjusted synthetic estimate. The second report, State Estimates of Disability and Utilization of Medical Services: United States, 1969-1971 (NCHS 1977), also contains state estimates of disability as well as state estimates of short-stay hospital utilization, physician visits and dental visits. The estimates in this report are also ratio adjusted synthetic estimates. The third report, State Estimates of Disability and Utilization of Medical Services: United States 1974-76 (NCHS 1978) contains estimates of the same health items as the preceding report; disability, short-stay hospital utilization, physician visits and dental visits. These estimates were made using a composite estimation method. These reports present estimates of levels but contain no estimates of accuracy. The reason estimates of accuracy were not presented is because satisfactory estimates of the error of individual estimates of states did not exist for either synthetic or composite estimates. 8.2.2 Small Area Research Conferences The Center has sponsored or cosponsored three research conferences on small area estimation. The first conference, cosponsored with the National Institute on Drug Abuse was held in Princeton, N.J. in 1978. The second conference was held in Snowbird, Utah, in 1984. 7le third conference was held in New Orleans, LA, in 1988. The first two conferences produced published proceedings (see NIDA Research Monograph 24 1979 and NCHS'1984). 8.3 Estimator Documentation NCHS has no regular program of producing indirect state estimates from the NHIS. However, the following estimators have been used in the past to prepare state estimates, of health characteristics. Many of these estimators were introduced to correct for the known deficiencies of the synthetic estimator. 8.3.1 Basic Synthetic Estimator The basic synthetic estimator is used for State estimation when national estimates by class and State-specific population counts by class are available. The synthetic estimator weights the national class means by the proportion of persons in the state belonging to the class. The form of the estimator for state d is: Click HERE for graphic. A synthetic estimator is unbiased if the population can be divided into mutually exclusive and exhaustive classes, b, in which the average health characteristic in each class does not vary among small areas. If this assumption is true and if a large enough sample is selected in each class, then the synthetic estimator will be accurate. In chapter two, entitled "Synthetic Estimation in Followback Surveys at the National Center for Health Statistics", a detailed example of the use of synthetic estimation is provided. Another example of the construction of a synthetic estimate can be found in Schaible, et al. (1977) where they create sixty-four demographic classes based on cross-classifications of variables defined by race, sex, age, family size and industry occupation of head of family. Click HERE for graphic. Click HERE for graphic. The synthetic estimator appears in a number of NCHS related publications (e.g., NCHS 1968 1977, Levy 197l and Namekata, Levy and O'Rourke 1975) and has been used extensively. 8.3.2 Ratio Adjusted Synthetic Estimator When regional, direct estimates are available, state synthetic estimates are often ratio adjusted to their regional direct estimate. In this way regional estimates, obtained by combining synthetic State estimates in a region together, will equal the corresponding direct estimator. This adjustment removes all bias in the synthetic estimate at the regional level and is an attempt to remove bias of the synthetic estimator at the state level. The form of this estimator is: Click HERE for graphic. This estimator has also been used in a number of NCHS publications (e.g., NCHS 1968 1977 and Levy 1971). It is also used by the Department of Agriculture in their NASS county estimates program (see chapter 7). Click HERE for graphic. Click HERE for graphic. 8.3.5 Nearly Unbiased Estimator Click HERE for graphic. Click HERE for graphic. Click HERE for graphic. Click HERE for graphic. Click HERE for graphic. Several key features can be seen in this figure. There is a predominant sex effect. While relatively fewer males in their 20's or 30's visit a physician, relatively more females visit a physician during these child-bearing years. In addition, after accounting for a linear age term, the relative propensity to visit a physician increases for both the underaged and the overaged, regardless of sex. To account for these effects, independent variables corresponding to linear splines were used. After examining a number of residual plots by age, race and sex, a final set of independent variables was chosen. Based on visual inspection, race effects were considered negligible and not included in the final model. In relation to an individual in county, c, in the age and sex class denoted by, b, the final set of independent variables are defined as follows: Click HERE for graphic. Partial residual plots based on this eight parameter model were computed and figure 2 plots the averages of these residuals within age by sex group. As can be seen, the partial residual plot indicates a relative absence of age and sex affects. (The apparent heterogeneity in the plots is at least partly due to an unequal sample size in each age group.) Click HERE for graphic. These residuals, based on the eight parameter model outlined above, are then used to assess the affect of county covariates. The resulting residuals, with the individual age and sex effects removed, were then averaged within counties of a given type, for example counties with a high level of educational attainment. Corresponding to various typologies of counties, plots of the residuals versus county types were used to assess the influence of county covariates on the proportion of persons visiting a physician. Figure 3 illustrates one such comparison. Here, residuals are averaged within counties exhibiting a certain education level. A number of independent variables were examined in this manner. For physician visits, economic type variables such as per capita income, percent of population below poverty and education level exhibited similar trends. Other county covariates from the March 1989 Area Resource File (1989) and the NCHS County Mortality files were also examined. Click HERE for graphic. Click HERE for graphic. Click HERE for graphic. Click HERE for graphic. Based on this preliminary work, State estimates for both physician visits and health status are being planned for publication. 8.4 Evaluation Practices 8.4.1 Reports or Evaluation Studies A number of evaluations on small area estimators have been conducted at NCHS. The first publication of synthetic estimates (NCHS 1968) includes a comparison of a synthetic estimator, a nearly unbiased estimator and Woodruff's estimator. Since then a number of other evaluation studies have been conducted at the Center. A short description of each is given below. The Use of Mortality Data in Evaluating Synthetic Estimates A basic synthetic estimator, a ratio adjusted synthetic estimator and a regression adjusted synthetic estimator are evaluated by constructing these estimates of mortality rates for Motor Vehicle Accidents, Major Cardiovascular Diseases, Suicides and Respiratory T.B. using the complete population of mortality events compiled by NCHS. These state estimates are then compared to their known rates, using the same data. It was found that the accuracy of the synthetic estimates varied considerably from state to state and from item to item. (Levy 1971). Synthetic Estimates of Work Loss Disability for Each State and the District of Columbia The basic synthetic estimates of partial and complete work disability are compared to precise estimates from the 1970 Census of Population and Housing. Here, the agreement between synthetic and direct estimates of partial work disability were found to be fairly good while the agreement, between corresponding estimates of complete work disability were fairly poor. (Namekata, Levy and O'Rourke 1975) Synthetic Estimates of State Health Characteristics Based on the Health Interview Survey Formulas for the bias and variance of the nearly unbiased estimator and the ratio-adjusted synthetic estimator are developed. Correlations and average percentage errors between state synthetic estimates based on different formulations of demographic groups are determined for each of a number of NHIS items. For the cases evaluated here, the coarseness of the demographic groupings had little effect on the resulting synthetic estimates. (Levy and French 1977) An-Empirical Comparison of the Simple Inflation, Synthetic and Composite Estimators for Small Area Statistics Using NHIS data, state unemployment rates and percent completing college were estimated using a simple direct estimator, a synthetic estimator and a composite estimator. Each of these estimates were compared to accurate estimates obtained from the 1970 Census. It was demonstrated that the composite estimator was much more accurate than either the synthetic estimator or the inflation estimator for the items under study. (Schaible, Brock, Casady and Schnack 1977) 8.4.2 Comparison with a known standard The following is a summary of the types of evaluation methods that have been used in the aforementioned reports. Measurements such as: work-loss disability, unemployment rates, percent completing college and marital status are compiled in the U.S. Population Census and are known exactly or with a high degree of accuracy for small geographic areas. In addition, vital statistics such as the mortality rates from motor vehicle accidents, Cardiovascular disease, suicides and Tuberculosis are known exactly for counties. Although these measurements are not the same as the morbidity rates or health care utilization rates, measured in the Center's surveys, they are related. Small area estimation procedures have been tested by making state estimates for these quantities and comparing them to the known quantity. This technique has shown that synthetic state estimates will often have less variability than their corresponding ensemble of true values (Schaible, Brock, Casady, Schnack 1977,1979). By estimating known quantities, some of the deficiencies of an estimation method can be ascertained. However, the fact that a particular method works well in estimating known quantities does not imply that the method will work well in other situations. When estimating a known quantity, the distribution of the errors is usually presented in summary form. Usually, the average absolute error or the average squared error is calculated. To assess the similarity of two estimation methods, the simple correlation between errors is calculated. There are other ways to obtain accurate standards for comparison. For example, accurate, direct estimates are often available for the very large states and for groups of small states (Levy 1971). In addition, years of data can be pooled together to create a larger population in which direct estimates can be made for a number of subnational areas and then compared to an indirect estimate (Malec, Sedmnsk and Tompkins (1993)). NCHS publications utilizing this method include (NCHS, 1968; Levy, 1971; Malec, Sedransk and Tompkins, 1993; Namekata, Levy, O'Rourke, 1975; Schaible, Brock, Casady, Schnack, 1977,1979). 8.4.3 Comparison or alternative estimators An alternative estimator can be constructed to specifically deal with a deficiency in another estimator. For example, a composite estimator of a state will preferentially use its state data, whereas a synthetic estimator will not. The relevance of this problem can be evaluated by comparing the two estimates. This method can be used to improve estimates but, since the true. value is not known, the quality of the improved estimate is not known. For example, two types of estimators may yield similar state estimates but both be in error. In contrast, a number of different types of estimators may each yield vastly different estimates but one of them may be accurate. 8.4.4 Model Evaluation with Data Analysis When a statistical model is used to produce indirect estimates, features of the model can be checked using the data that has been sampled. In fact, data from a national survey can be used to develop a model that fits the data well (Malec, Sedransk and Tompkins, 1993), -- although there may be features of a model that are difficult to evaluate, if the model is complex. 8.5 Current Problems and Activities The Bayesian method, utilizing a hierarchical model (in section 8.3.8) is currently being refined and evaluated with the aim of producing state estimates based on a single year of data. A relatively efficient method of Gibbs sampling due to Gilks and Wild (1992) has been used to produce state estimates. This procedure uses the exact specifications from (1) and (2) of section 8.3.7 and does not require the pooling of four years of data or a normal approximation to the likelihood. Gilks and Wild's method is still very computationally time consuming and it is being compared to both an alternative normal approximation to the conditional posterior of A, and to the normal approximation of the likelihood. The comparison is in terms of both accuracy and computational effort. The variable selection procedure for the hierarchical model has also been improved. Stepwise spline selection is now being used for both individual level variables, county level variables and interactions between individual-level and county-level variables. In addition, more specific methods to evaluate the model fit are being considered. The impact of the design on inference is also being further evaluated. In addition to the Bayesian hierarchical method presented above, three other research projects have recently been undertaken at NCHS. o As part of a research contract to evaluate and redesign the NHIS, a generalization of the synthetic estimator is being developed. First, the population is partitioned into mutually exclusive demographic cells, as in synthetic estimation. Then, within each demographic cell, a hierarchical model is specified for the responses among the small areas. A distribution is not specified, only the first two moments are used. Estimators are derived in a Bayesian framework where data-based prior information, also specified by only the first two moments, may be incorporated. Alternative estimators, derived within this framework, are compared. Estimates of mean square error, that are more state specific, are also examined. For more information, see Marker and Waksberg (1993). o The regression adjusted synthetic estimation method of Elston, Koch and Weissert (1991), (see section 8.3.4 above) has been extended by them, under an NCHS contract, to produce estimates of disability covering individuals of all ages. o An empirical study utilizing 1990 Census data on disability is also being planned. A subsample of the census data, similar to the NHIS sample, will be used to model and predict states. These estimates will then be compared to values based on the entire census. Both the hierarchical model, in section 8.3.7, and the aforementioned synthetic regression method, of Elston, Koch and Weissert, will be used to produce and compare estimates. 8.6 Summary and Some Conclusions The National Center for Health Statistics has been developing, producing and evaluating indirect state estimates for over two decades. The newer methods proposed generally incorporate local data with the aim of correcting known deficiencies of established estimates and providing more accurate estimates. Recently, the goal to provide both a good, indirect estimate as well as an estimate of its accuracy has received increased attention. Improvements or deficiencies of indirect estimators have been evaluated using related data for which the actual population values are known. Improvements have also been made by developing classes of estimators which include competing estimators as a special case. Small area estimation research has also followed new developments in computing algorithms and the availability of cheaper computing. In particular, the Gibbs Sampler and related methods have opened up the possibility of utilizing more realistic and complex models for small area estimation. It is safe to say that much more can be accomplished in this domain. The continued research and evaluation of indirect estimators of small areas has helped educate the consumer of these statistics to be more critical of them and to be aware of the underlying assumptions. The fact that small area estimation research continues to be an active and supported area of research is a testament to the continued demand for detail. REFERENCES Berger, J.O.(1985) Statistical Decision Theory and Bayesian Analysis, second edition. Springer- Verlag. Gilks, W.R., and Wild, P. (1992) Adaptive Rejection Sampling for Gibbs Sampling. Applied Statistics, 41: 337-348. Elston, J.M., Koch, G.G., and Weissert, W.G. (1991) Regression-Adjusted Small Area Estimates of Functional Dependency in the Noninstitutionalized American Population Age 65 and Over. American Journal of Public Health, 81: 335-343. Kass, R.E., and Steffey, D. (1989) Approximate Bayesian Inference in Conditionally Independent Hierarchical Models (Parametric Empirical Bayes Models). Journal of the American Statistical Association, 84: 717-726. Landwehr, J.M., Pregibon, D., and Shoemaker, A.C. (1984) Graphical Methods for Assessing Logistic Regression Models. Journal of the American Statistical Association, 79: 61-83. Levy, P.S. (1971) The use of Mortality data in evaluating synthetic estimates. Proceedings of the American Statistical Association, Social Statistics Section: 328-331. Levy, P.S., and French, D.K. (1977) Synthetic Estimation of State Health Characteristics Based on the Health Interview Survey. Vital and Health Statistics: Series 2, No. 75, DHEW Publication (PHS) 78-1349. Washington: U.S. Government Printing Office. MacGibbon, B. and Tomberlin, T.J. (1989) Small Area Estimates of Proportions Via Empirical Bayes Techniques. Survey Methodology, 15: 237-252. Malec D. and Sedransk J. (1993) Bayesian Predictive Inference for Units with Small Sample Sizes: The Case of Binary Random Variables. Medical Care, 5: YS66-YS70. Malec D., Sedransk J. and Tompkins, L. (1993) Bayesian Predictive Inference for Small Areas for Binary Variables in the National Health Interview Survey. In Case Studies in Bayesian Statistics, eds., C. Gatsonis, J.S. Hodges, R.E. Kass and N.D. Singpurwalla. Springar-Verlag. Marker, D.A. and Waksberg, J. (1993) Small Area Estimation for the U.S. National Health Interview Survey. In Small Area Statistics and Survey Designs, Vol. 1, Central Statistical Office, Warsaw, Poland. Massey, J.T., Moore, T.F., Parsons V.L. and Tadros, W. (1989), "Design and Estimation for the National Health Interview Survey, 1985-94, "National Center for Health Statistics. Vital and Health Statistics, 2: 1110. Namekata, T., Levy, P.S., and O'Rourke, T.W. (1975) Synthetic Estimates of Work Loss Disability for Each State and the Districts of Columbia. Public Health Reports, 90: 532-538. National Center for Health Statistics. (1968) Synthetic State Estimates of Disability. PHS Publication No. 1759. U.S. Government Printing Office. National Center for Health Statistics. (1977) State Estimates of Disability and Utilization of Medical Services, United States, 1969-1971. DHEW Publication No. (HRA) 77-1241. Health Resources Administration. Washington: U.S. Government Printing Office. National Center for Health Statistics. (1978) State Estimates of Disability and Utilization of Medical Services, United States. 1974-1976. DHEW Publication No. (PHS) 78-1241. Public Health Service. Washington: U.S. Government Printing Office. National Center for Health Statistics (1984) Invited Papers to the Data Use Conference on Small Area Statistics. Proceedings of the 1984 NCHS Data Use Conference on Small Area Statistics, Snowbird, Utah. NIDA Research Monograph 24 (1979) Synthetic Estimates for Small Areas DHEW Publication No. (ADM) 79-801. Health Resources Administration. Washington: U.S. Government Printing Office. Parsons, V.L., Botman, S.L. and Malec, D. (1990) State Estimates for the NHIS. 1989 Proceedings of the Section on Survey Research Methods, American Statistical Society. pp. 854- 859. Prasad, N.G.N. and Rao, J.N.K. (1990) The Estimation of the Mean Squared Error of Small- Area Estimators. Journal of the American Statistical Association, 85: 163-171. Sarndal, C.E. (1984) Design-Consistent Versus Model-Dependent Estimation for Small Domains. Journal of the American Statistical Association, 79: 624-631. Schaible, W.L., Brock, D.B., Casady, R.J. and Schnack, G.A. (1977) An Empirical Comparison of the Simple Inflation, Synthetic and Composite Estimators for Small Area Statistics. Proceedings of the American Statistical Association, Social Statistics Section: 1017-1021. Schaible, W.L., Brock, D.B., Casady, R.J. and Schnack, G.A. (1979) Small Area Estimation: An Empirical Comparison of Conventional and Synthetic Estimators for States. Vital and Health Statistics: Series 2, No. 82, DHEW Publication (PHS) 80-1356. Washington: U.S. Government Printing Office. Schaible, W.L. (1979) A composite Estimator for Small Area Statistics Synthetic Estimates for Small Areas DHEW Publication No. (ADM) 79-801. Health Resources Administration. Washington: U.S. Govemment Printing Office. Smith, A.F.M. and Roberts, G.O. (1993) Bayesian Computation Via the Gibbs Sampler and Related Markov Chain Monte Carlo Methods. Journal of the Royal Statistical Society, Series B, 55, 3-23. U.S. Department of Health and Human Services (1989), The Area Resource File (ARF) System. ODAM Report No. 7-89. U.S.Department of Health and Human Services, Public Health Service. (1990) Healthy People 2000: National Health Promotion and Disease Prevention Objectives. DHHS Publication No. (PHS) 91-50213. Washington: U.S. Government Printing Office. Wong, G.Y. and Mason, W.W. (1985) The Hierarchical Logistic Regression Model for Multilevel Analysis. Journal of the American Statistical Association, 80: 513-524. Woodruff, R.A.(1966) Use of a Regression Technique to Produce Area Breakdowns of the Monthly National Estimates of Retail Trade. Journal of the American Statistical Association, 61: 497-504. CHAPTER 9 Estimation of Median Income for 4-Person Families by State Robert E. Fay and Charles T. Nelson, U.S. Bureau of the Census Leon Litow, Department of Health and Human Services 9.1 Introduction and Program History Starting with income year 1974, the U.S. Census Bureau has computed model-based estimates of median annual income for 4-person families by state using data from the decennial censuses, the Current Population Survey (CPS), and estimates of per capita income (PCI) from the Bureau of Economic Analysis (BEA). Originally, these estimates were used in determining eligibility for the former Title XX Program of the Social Security Act, which provided social services for individuals and families. Beginning in fiscal year (FY) 1982, the Department of Health and Human Services (HHS) has employed the estimated 4-person family medians to administer the Low Income Home Energy Assistance Program (LIHEAP). This program is one of six block grant programs authorized by the Omnibus Budget Reconciliation Act of 1981 (PL 97-35) and administered by HHS. The Augustus F. Hawkins Human Services Reauthorization Act of 1990 (PL 101-501) reauthorized the LMAP through FY 1994. States, the District of Columbia, Indian tribes and tribal organizations, and territories that wish to assist low income households in meeting the costs of home energy may apply for a LIHEAP block grant. "Home energy" is defined by the LIHEAP statute as "a source of heating or cooling in residential dwellings." Section 2603(7) of Title XXVI of PL 97-35 requires the Secretary of HHS to establish the state median incomes for purposes of the program. Section 2605(b)(2)(B)(ii) of PL 97-35 provides that 60% of the state median income is one of the income criteria that states can use in determining a household's eligibility for the LIHEAP. HHS publishes the estimated 4-person family medians by state annually in the Federal Register. For purposes of administration, state median incomes are established for families of other sizes as a fixed proportion; depending on the size of family, of the estimated median for 4-person families. The following percentages of the 4-person family medians are used: 52% for 1-person households, 68% for 2 persons, 84% for 3 persons, 100% for 4 persons, 116% for 5 persons, and 132% for 6 persons. For families with more than 6 persons, each person beyond 6 adds an additional 3%. U.S. Bureau of the Census (1991) provides further details, as does the Federal Register on March 3, 1988 at 53 FR 6824. In addition to their programmatic use in the administration of the LIHEAP and the earlier Title XX Program of the Social Security Act, the estimates represent the only intercensal state-specific family income estimates produced by the Census Bureau. Consequently, these estimates have been of interest to a number of general data users. Until the recent publication of the historical series in U.S. Bureau of the Census (1991), however, the estimates did not appear in a regular publication series of the Census Bureau. 9.2 Program Description, Policies, and Practices Throughout this period the methodology has relied on three sources: 1. Estimates of median family income by state from the decennial censuses. Since the census asks income during the previous year, the census medians pertain to income years 1969, 1979, etc. Although the estimates are based on the long-form sample, the size of this sample provides estimates with virtually negligible sampling errors at the state level every 10 years. 2. Sample estimates of median income by family size by state from the March CPS. Although the CPS estimates are available annually, their direct use is limited by substantial sampling variability due to the size of the CPS sample. 3. Annual estimates of PCI from BEA. These estimates, based on aggregate statistics on components of income from administrative series, have negligible sampling error. The PCI estimates are measures of average income per person, however, and so are only indirectly linked to median income for families. U.S. Bureau of the Census (1991) describes each of these series in detail. In brief, however: 1. The decennial census provides geographically detailed estimates by sampling roughly 1/6 of the households in the U.S. to receive the long-form census questionnaire. The census income concept includes as sources: wages and salaries, self-employment income (including losses), Social Security, Supplemental Security Income (SSI), cash public assistance, interest, dividends, rents, royalties, estates and trusts, veterans' payments, unemployment and workers' compensations, private and government survivor, disability, and retirement pensions, alimony, child support, and any other source of money income that is regularly received. Capital gains (or losses) and lump-sums or one-time payments such as life insurance settlements are excluded. Noncash benefits, such as government noncash transfers (foods stamps, Medicaid, etc.) and private sector in-kind benefits are also excluded from the money income definition. 2. The CPS is a monthly labor force survey of about 60,000 households across the country. Each March, the CPS asks additional questions about money income during the previous year, using the same concepts as the decennial census. The primary purpose of the CPS sample is for national estimates. For example, the CPS provides a national estimate of median income for 4-person families, which is published annually along with many other estimates from the survey. 3. The BEA income series is based on a different concept of income than the one used by the Census Bureau in the decennial censuses and the CPS. The major difference is that BEA personal income attempts to represent income from all sources, noncash as well as cash. (Appendix A of U.S. Bureau of the Census, 1991, compares the BEA personal income concept to the census money income concept underlying both the CPS and the decennial censuses. Budd, Radner, and Hinrichs, 1973, provide additional detail on this point.) Another conceptual difference is that the BEA estimates of PCI represent the ratio of estimates of aggregate income to the number of persons in each state. They are not disaggregated by size of family and do not distinguish the income of family members from unrelated individuals and persons in 1-person households. The PCI series is developed from a variety of government statistics, including Federal tax records from the Department of the Treasury, the insurance files of the Social Security Administration, and state unemployment records collected by the U.S. Department of Labor. The BEA produces annual estimates of personal per capita income for states and other geographic areas. Thus, the BEA estimates generally do not have associated sampling errors, unlike the CPS estimates, since they essentially do not employ sampling techniques. These estimates are described by Bailey, Hazen, and Zabronsky in Chapter 3 of this report. Before outlining the elements of the methodology, we first compare the estimates for income year 1989, based on the March 1990 CPS and published in March, 1991, with medians for 4-person families from the 1990 census. Figure 9.1 shows the geographic distribution of the true increase in median income for 4-person families during the decade, since the 1980 census. Click HERE for graphic. Figure 9.1 indicates that the greatest relative increase during the decade in the median income of 4-person families occurred in the Northeast region, where most states more than doubled their medians, according to the census. Other areas of active increase include additional states in the East and South Atlantic, and Tennessee, Minnesota, California, and Hawaii. Figure 9.1 also shows that median income in some areas of the country has grown considerably more slowly. Figure 9.2 presents the estimated increase since the 1980 census according to the model. Although there are some differences between the census and the model predictions, the comparison of the two maps shows that the model is successful in capturing most real sources of change in median family income. Some of the states are not classified into the same grouping in Figures 9.1 and 9.2, but, in each case, the difference is by at most one category. For example, states estimated to be among the fastest growing group were either in that category or the next one down, and so forth. Click HERE for graphic. Figure 9.3 shows the geographic distribution of the key predictor variable, the increase in estimated BEA per capita income. Note that the scale of percent income increase is shifted on this third map compared to the other two; in general, the proportional increase in per capita personal income outstripped the increase in the median income of 4-person families during the decade. With the resealing, however, the BEA income figures are quite successful predictors of the corresponding increase in the median income of 4-person families at the state level. Figures 9.4 and 9.5 illustrate additional features of the performance of the model. Both figures include a regression line from the-simple linear regression as an aid in assessing fit, although each line is not formally part of the model. Click HERE for graphic. Figure 9.4 compares the estimated increase with the actual increase in median incomes, according to the census. Again, the predictions are not perfect but, nonetheless, appear to capture most of the variation among states in the increase in median income. Figure 9.5 shows that the relationship between increase in BEA PCI and increase in the census median income is essentially linear over the entire spectrum. As previously indicated by the scaling Figure 9.3, Figure 9.5 provides further evidence of somewhat greater dispersion in the increase in the BEA estimates than in the census medians. The figures provide a summary of the basic features and performance of the model, and they may be of help to many readers. The remainder of this chapter aims specifically toward a technical audience interested in the exact form of the model, plans for further assessment, and a brief description of potential enhancements. U.S. Bureau of the Census (1991) furnishes a more detailed history of the program, the estimates for calendar years 1974-1989, and citations for the publication of the estimates in the Federal Register annually since 1983. The current methodology has been in place since income year 1984, although with minor refinements over this period of time. The methodology is applied separately for each year, t, in the series. (For simplicity, the implicit subscript, t, is not shown in the following, except where necessary to avoid confusion. Section 9.5 will discuss the possibility of alternative models more attuned to the longitudinal nature of the problem.) The primary elements of the current methodology are: Click HERE for graphic. Click HERE for graphic. A key feature of the model is the multivariate combination of estimation of the target variable of interest, median income for 4-person families, along with an auxiliary variable, the combined 3- and 5-person family medians, even though the auxiliary variable is not itself a subject of interest. In fact, the purpose of the multivariate approach is to realize additional gains in the estimation of 4-person family medians. Fuller and Harter (1987) and Fay (1987) motivate the possible advantages of the multivariate approach for problems of this sort. The current model replaced an earlier version, whose major features were: Click HERE for graphic. Click HERE for graphic. Click HERE for graphic. Although Woodruff's original method was based on computing the density of the distribution in the interval containing the estimated median, the small expected number of sample cases falling into each of the $2,500 intervals in the estimated CPS income distribution by family size within states leads this approach to be unstable. Empirically, experimentation with samples of 100 and 400 cases drawn from the national income distribution showed that it was preferable to use a braoder interval to estimate the density for samples of these sizes. Specifically, in addition to the interval containing the median, the 4 $2,500 intervals immediately below and 2 intervals immediately above are used to form a combined interval of width $17,500 for purposes of estimating the density, d. The variance estimator is: Click HERE for graphic. Click HERE for graphic. Click HERE for graphic. 9.4 Evaluation Practices 9.4.1 Comparisons to the 1980 Census The availability of direct census estimates every 10 years affords a significant opportunity to evaluate and recalibrate the estimation technique. The current methodology grew from its predecessor, described at the end of section 9.2, primarily as a consequence of comparisons to 1990 census results. The conclusions of that comparison, discussed in Fay (1986), were, in brief: 1) The earlier method yielded generally useful state estimates, but 1 estimate was in error by more than 10% and 11 additional estimates were in error by 5% or more. Click HERE for graphic. 9.4.2 Comparisons to the 1990 Census Figures 9.1 - 9.5 compare the model to recently available estimates from the 1990 census. Overall, the results of the comparison are quite encouraging. For example, no estimate was in error by 10% or more, and only 7 were in error by 5% or more. These findings reflect only the first steps in a more complete analysis. The next critical step, however, will be to react to a surprising finding reported in Section 9.3.3, namely, that the CPS sample estimates of the medians by state appear to differ from the CPS values by more than sampling error alone would suggest. This is in contrast to the comparison of the 1980 CPS and census. Consequently, some form of nonsampling error is possible, but a more systematic study of components of differences between the CPS and the census will be required to isolate the significant source or sources of these differences. Figures 9.1 and 9.2 provide one suggestion of a possible source of difference: relative to the census, the model underpredicted increase in four large states: California, Florida, New York, and Texas. In each of these states, the CPS sample estimates themselves fall below the census values. All four states also have appreciable Hispanic populations. Furthermore, preliminary comparisons suggest higher estimated medians for Hispanics in the census compared to the CPS, Hispanic income may be only one of several factors underlying differences between census and CPS state medians. The outcome of a more complete investigation should provide a firmer basis to separate the issues of limitations of the model from possible nonsampling error in either the CPS or the census. Once issues of nonsampling error, are more firmly understood, the census results should permit assessment of a number of features of the current model: 1) The average error of the model predictions. 2) Whether errors are differential for certain classes of states, e.g., small vs. large, rapidly changing vs. static, lower income vs. higher, etc. 3) Whether errors cluster geographically. 4) Whether modification of the current predictors would yield significant improvement in prediction. The census data permit assessment of the current model but also offer the occasion for consideration of more significant changes for subsequent years. A number of these are described in the next section. In addition to relying on the census for evaluation, work on alternative models, such as the hierarchical Bayes model described by Datta, Fay, and Ghosh (1991), has addressed methods to obtain estimates of individual and aggregate measures of performance from the sample estimates when census data are not available. The 1990 census data should help to calibrate these procedures for future use. (Unfortunately, these procedures may be adversely affected by nonsampling error producing differences between the expected values of the CPS and the census medians at the state level, so that understanding sources of nonsampling error is a critical step here as well.) 9.5 Current Problems and Activities Implemented a year at a time, the current model and its predecessor has produced a series spanning income years 1974 to 1991 without taking any advantage of the longitudinal or time series nature of this problem. Several researchers have investigated longitudinal extensions that attempt to address this aspect. The current model relies simply on observed relationships that appear to be quite linear, without taking advantage of any specific knowledge about income distributions. Possibly, a more explicit parametric model for the income distribution may represent a fruitful alternative. On the other hand, the utility of such efforts would have to be balanced against requirements of parsimony imposed by the relatively small sizes of the CPS state samples. As noted at the end of Section 9.4, another area of potential research is to attempt to improve measures of error from the model for the intercensal period. Recent research in fully Bayes procedures may prove promising for estimation of error. References Budd, E. C., Radner, D. B., and Hinrichs, J. C. (1973), "Size distribution of Family Personal Income: Methodology and Estimates for 1964," BEA-SP 73-21, Washington, DC: Bureau of Economic Analysis. Datta, G., Fay, R. E., and Ghosh, M. (1991), "Hierarchical and Empirical Multivariate Bayes Analysis in Small Area Estimation," in Proceedings of the Annual Research Conference, Washington, DC: U. S. Bureau of the Census, pp. 63-79. Efron, B., and Morris, C. (1971), "Limiting the Risk of Bayes and Empirical Bayes Estimators - Part 1: the Bayes Case," Journal of the American Statistical Association, 74, 269-277. Fay, R. E. (1986), "Multivariate Components of Variance Models as Empirical Bayes Procedures for Small Domain Estimation," in Proceedings of the Survey Research Methods Section, Washington, DC, American Statistical Association, pp. 99-107. __________(1987), "Application of Multivariate Regression to Small Domain Estimation," in Small Area Statistics, An International Symposium, R. Platek, J. N. K. Rao, C. E. Sarndal, and M. P. Singh, eds., New York: John Wiley & Sons, pp. 91-102. Fay, R. E. and Herriot, R. A. (1979), "Estimates of Income for Small Places: An Application of James-Stein. Procedures to Census Data," Journal of the American Statistical Association, 74, 269-277. Fuller, W. A. and Harter, R.M. (1987), "The Multivariate Components of Variance Model for Small Domain Estimation," in Small Area Statistics, An International Symposium, R. Platek, J. N. K. Rao, C. E. Sarndal, and M. P. Singh, eds., New York: John Wiley & Sons, pp. 103-123. U.S. Bureau of the Census (1991), "Estimates of Median 4-Person Family Income by State: 1974- 1989," Current Population Reports, Technical Paper No. 61, U.S. Government Printing Office, Washington, DC. Woodruff, R. (1952), "Confidence Intervals for Medians and Other Position Measures," Journal of the American Statistical Association, 47, 635-646. CHAPTER 10 Recommendations and Cautions During the design of a data system, indirect estimators rarely, if ever, are considered for Federal statistical programs when resources to produce direct estimates of adequate precision are available. However, given an existing system, if direct estimation is judged to be inadequate for a domain not specified in the design, indirect estimation may, in some cases, prove to be a valuable alternative. There are a number of reasons that direct estimators are preferable to indirect ones and, if federal statistical agencies are to improve the usefulness of indirect estimates, a number of important issues, including those that follow, should receive additional attention. Traditionally, statistical programs are designed with only direct estimators for large domains in mind; indirect estimators for small domains are considered only after the design has been determined. Planning for both direct estimators and indirect estimators at the sample design stage should lead to improved indirect estimates. The purpose of the analysis should be kept in mind when selecting an indirect estimator. This is important with direct estimators, but even more so with indirect estimators where there may be implications for the choice between a domain indirect and a time indirect estimator. More coordination and cooperation among Federal agencies would increase accessibility to auxiliary information for use with indirect estimators. Additional empirical evaluations of existing and proposed indirect estimators are needed. Existing evaluations are generally limited in the conclusions they are able to draw. o Additional research on errors associated with indirect estimators is necessary. Additional attention should be directed to the estimation, not only of variances, but also of biases, mean square errors, and confidence intervals. o When indirect estimates are published, they should be distinguished from direct estimates and accompanied by a clear explanation of model assumptions and appropriate cautions. These points are developed further in the sections that follow. 10.1 Sample Designs for Small Areas The design of samples for the production of both direct and indirect estimates should receive more attention. Since the 1940's, considerable research has been conducted on the design of samples for use with direct estimators. More recently, indirect estimators have received the attention of researchers, but most often in a way that treats the sample design as fixed and beyond the scope of the research effort. An exception to this statement is found in the work on rolling samples and censuses (for example, Kish, 1990 with discussion by Scheuren, 1990 and Hansen, 1990) where direct and/or indirect estimators may be defined depending, on the population quantity being estimated and whether the design requires a subset of units to be observed in more than one time period of interest. Present applications of indirect estimators generally evolved from data systems designed for other purposes. Few, if any, existing data sources were created with the production of indirect estimates as a consideration. Sample designs considering the production of both direct estimates and indirect estimates deserve much more attention than they have received thus far. This is particularly true for continuing surveys where information useful for design and estimator evaluation is obtained periodically. None of the programs described in this report were designed with indirect estimation as an explicit consideration. However, indirect estimators which benefit from observations in the domain and time of interest have been enhanced by certain design decisions. For example, the redesign of the National Health Interview Survey, discussed in Chapter 8, will include stratification by individual states so that the sample size within each state is controlled. 10.2 Use of Estimates and the Selection of an Estimator Selection of an appropriate indirect estimation method should take into account the Purpose for which estimates are to be used. Indirect estimators should be selected with great caution and perhaps even avoided in some situations. Indirect estimators may perform poorly when the purpose of the analysis is to identify domains with extreme population values, to rank domains, or to identify domains that fall above or below some predetermined level. A domain indirect estimator borrows strength across domains and is justified under a model that assumes model parameters are the same across domains. If the purpose of the analysis is to make comparisons across domains for a given time period, an inconsistency between objective and method can be avoided if a time indirect estimator is chosen. Of course, depending on the application and the available auxiliary information, this may not always be the appropriate course of action. Similarly, if the purpose of the analysis is to make comparisons across time periods within a given domain, it may be more appropriate to select from among domain indirect estimators. 10.3 Auxiliary Information More coordination and cooperation among Federal agencies would allow expanded access to the auxiliary information on which indirect estimators depend. Regardless of how appropriate the conceptual and theoretical basis of a particular indirect estimator may be, the estimator cannot be used in practice if the required auxiliary variables, which usually come from administrative sources or censuses, are not available. Without auxiliary information related to the variable of interest and for the domain and time period of interest, only the most crude indirect estimators can be implemented. For programs described in this report, the search for auxiliary variables generally seems to have been somewhat ad hoc, with little coordination or cooperation among statistical agencies. An integrated data system for geographical areas would make auxiliary information more readily available and would potentially lead to improved indirect estimators. Such a system might also take advantage of recent computational technologies. Although not previously discussed in this report and designed with the objectives of mental health needs assessment, policy, and research in mind, the National Institute of Mental Health's Health Demographic Profile System (Goldsmith et al 1984) provides a variety of social indicators for geographic areas. In addition, Statistics Canada has addressed this issue in their Small Area Data Program (Brackstone 1987). 10.4 Empirical Evaluations Additional empirical evaluations are needed to help determine whether indirect estimators are adequate for the intended purposes. The decision whether or not to use an indirect estimator is rarely an COY One. Empirical evaluations play a critical role in the decision process. The perfomance of an indirect estimator in a given application depends on the variable(s) of interest and their relationship to the auxiliary variables through the underlying model. Generalization from one application to another is difficult so that each application requires a dfferent empirical evaluation. In practice, indirect estimators are considered for use in situations where data are not available to support the use of direct estimators. The data that, if available, would support the use of direct estimators are the same data that would be most useful in the evaluation of indirect estimators and models. In other words, the need for indirect estimators is the greatest in precisely those situations where data are not available for their adequate empirical evaluation. For this reason, it is rare that a single empirical evaluation of an indirect estimator is completely convincing. There seems to be no satisfying solution to this problem. Resourcefulness in locating data sources and the use of multiple empirical evaluations will be required in most, if not all, situations. Two approaches can be used for empirical evaluations of indirect estimators. In the first estimators under consideration are used to produce estimates; these estimates then are compared to a better estimate or census value. The estimator that performs best using an empirical average squared error or similar criterion is judged to be most appropriate for the given application. The great majority of evaluation efforts connected with indirect estimates have used this approach. The second approach is to evaluate how well the models associated with the competing estimators fit the data. A principled approach is needed; models should not only fit the data but also have a conceptual basis. An indiscrinminate search through a large number of models does not often produce appealing results. Empirical evaluations of indirect estimators are critical, and careful evaluations should include consideration of underlying models as well as the corresponding estimators. In addition to initial evaluations leading to the selection of an estimator, continuing evaluations of the underlying models should be conducted for those series that are published periodically. 10.5 Measures of Errors for Indirect Estimators Care should be taken in the production of measures of errors of indirect estimators. Estimates of variances alone may be misleading to data users. Additional research on the estimation of biases, mean squared errors, and confidence intervals is needed At present, none of the programs described in this report provide measures of error to accompany published indirect estimates. It is difficult to produce meaningful measures of error for indirect estimators. Expressions for indirect estimator variances and biases under the assumed model are usually straightforward to derive, and estimation of variances is usually possible. If the model leading to an estimator is a good approximation of reality in a given application, then the variance of the estimator derived under the model should serve as an adequate measure of error. However, if the model associated with the estimator is not a good approximation, the estimator will have a bias due to model failure. If the bias is large relative to the variance, the variance, by itself, will not be an adequate measure of error, and an estimate of the mean squared will be required. This is a difficult problem since an estimate of the mean squared error requires an estimate of the bias. Bias in an indirect estimator arises from model failure; that is, failure of the model to adequately represents the variability of the variable of interest over domains and time. Since the population quantity being estimated is specific to a given domain and time, it follows that an estimate of this bias requires data from that domain and time. If the available data am inadequate to produce reliable direct estimates, it is unlikely that they would be adequate to support acceptable estimates of biases. Estimation of confidence intervals for indirect estimators is also a difficult problem in practice. The existing research in this area provides valuable results, but additional work is needed. In the interim, measures of error as indicated by empirical evaluation studies may be the only source of error information for users. 10.6 Publication of Indirect Estimates A clear distinction should be made between direct and indirect estimators. When indirect estimates are published, they should be accompanied by appropriate cautions and clear explanations of the model assumptions. Direct estimates published by Federal statistical agencies usually meet expected reliability and validity criteria. Even unsophisticated users of statistics have come to expect estimates Federal statistical agencies to be trustworthy in some sense. Rarely is enough known about the structure of indirect estimators to produce adequate measures of their quality. For this reason, it is misleading to the public and potentially damaging to the reputation of Federal statistical agencies to publish indirect estimates that are not clearly distinguished from direct estimates and that are not offered with appropriate cautions. In any case, a clear statement of the assumptions required for the indirect estimator to be model unbiased should be included with all published estimates. This issue has been addressed differently by various programs. For example, the Bureau of Labor Statistics produces estimates for a limited number of states using a direct estimator and for the Remainder of the states using an indirect estimator. The two sets of estimates are published in the same table but separated into the two groups with explanatory notes. The National Center for Health Statistics publishes indirect estimates from the National Health Interview Survey in a separate publication containing explanations and cautions. 10.7 Cautions for Producers and Users of Indirect Estimates As evidenced by the large and growing literature on indirect estimation methods, numerous researchers have been working on the challenging problems facing those who must produce estimates with inadequate resources. Many authors suggest new approaches or variations of existing approaches, but few caution about the dangers associated with the use of indirect estimation methods. The following exceptions should be noted. "When first one casts his eye upon the synthetic estimate, he shrinks away in horror; with a second and then a third look, the aversion begins to fade, until finally one clasps the estimator to his bosom, and embraces it with affection. . . . The synthetic estimator is a dangerous tool, but with careful further development, it has an attractive potential." (Simmons 1979, paraphrasing Alexander Pope) "A workshop of this sort, focused on a specific technique, can spur development, but it can also be dangerous. The danger is that from hearing many people speak many words about synthetic estimation we become comfortable with the technique. The idea and the jargon become familiar, and it is easy to accept that 'Since all these people are studying synthetic estimation, it must be okay.' We must remain skeptical and not allow familiarity to dull our healthy skepticism. There is reason for some optimism, but it must be guarded optimism." (Royall 1979) ". . .a cautious approach should be adopted to the use of small area estimates, and especially to their publication by government statistical agencies. When government statistical agencies do produce model-dependent small area estimates, they need to distinguish them clearly from conventional sample-based estimates. Before small area estimates can be considered fully credible, carefully conducted evaluation studies are needed to check on the adequacy of the model being used. Sometimes model dependent small area estimators turn out to be of superior quality to sample-based estimators, and this may make them seem attractive. However, the proper criterion for assessing their quality is whether they are sufficiently accurate for the purposes for which they are to be used. In many cases, even though they are better than sample-based estimators, they are subject to too high a level of error to make them acceptable as the basis for policy decisions." (Kalton 1987) Indirect estimation should be considered when other, more robust alternatives are unavailable, and then only with appropriate caution and in conjunction with substantial research and evaluation. Even after such efforts, neither producers nor users should forget that indirect estimates may not be adequate for the intended purpose. REFERENCES Brackstone, G.J. (1987), "Small Area Data: Policy Issues and Technical Challenges," in Small Area Statistics, New York: John Wiley and Sons. Goldsmith, H.G., Jackson, D.J., Doenhoefer, S., Johnson, W., Tweed, D.L., Stiles, D., Barbano, J.P., and Warheit, G. (1994), "The Health Demographic Profile System's Inventory of Small Area Social Indicators," National Institute of Mental Health. Series BN No. 4., DHHS Pub. No. (ADM) 84-1354. Washington, D.C.: U.S. Government Printing Office. Hansen, M.H., (1990), "Discussion of paper by Kish," Survey Methodology, 16-1, 81-86. Kalton, G. (1987), "Panel Discussion" in Small Area Statistics, New York: John Wiley and Sons. Kish, L. (1990), "Rolling Samples and Censuses," Survey Methodology, 16-1, 63-7 1. Royall, R.A. (1979), "Prediction Models in Small Area Estimation," in Synthetic Estimates for Small Areas (National Institute on Drug Abuse, Research Monograph 24), Washington, D.C.: U.S. Government Printing Office. Scheuren, F., (1990), "Discussion of paper by Kish," Survey Methodology, 16-1, 72-79. Simmons, W.R. (1979), "Discussion of a paper by Levy," in Synthetic Estimates for Small Areas (National Institute on Drug Abuse, Research Monograph 24), Washington, D.C: U.S. Government Printing Office. Reports Available in the Statistical Policy Working Paper Series l. Report on Statistics for Allocation of Funds (Available through NTIS Document Sales, PB86-211521/AS) 2. Report an Statistical Disclosure and Disclosure-Avoidance Techniques (NTIS Document sales, PB86-211539/AS) 3. An Error Profile: Employment as Measures by the Current Population Survey (NTIS Document Sales PB86-214269/AS) 4. Glossary of Nonsampling Error Terms: An Illustration of a Semantic Problem in Statistics (NTIS Document Sales, PB86- 211547/AS) 5. Report on Exact and Statistical Matching Techniques (NTIS Document Sales, PB86-215829/AS) 6. Report on Statistical Uses of Administrative Records (NTIS Document Sales, PB86-214265/AS) 7. An Interagency review of Time-Series Revision Policies (NTIS Document Sales, PB86-232451/AS) 8. Statistical Interagency Agreements (NTIS Documents Sales, PB86-230570/AS) 9. Contracting for Surveys (NTIS Documents Sales, PB83-233148) 10. Approaches to Developing Questionnaires (NTIS Document Sales, PB84-105055/AS) 11. A Reviev of industry Coding Systems (NTIS Document Sales, PB84-135276) 12. The Role of Telephone Data Collection in Federal Statistics (NTIS Document Sales, PB85-105971) 13. Federal Longitudinal Surveys (NTIS Documents Sales, PB86- 139730) 14. Workshop on Statistical Uses of Microcomputers in Federal Agencies (NTIS Document Sales, PB87-166393) 15. Quality on Establishment Surveys (NTIS Document Sales, PB88- 232921) 16. A Comparative Study of Reporting Units in Selected Employer Data Systems (NTIS Document Sales, PB90-205238) 17. Survey Coverage (NTIS Document Sales, PB90-205246) 18. Data Editing in Federal Statistical Agencies (NTIS Document Sales, PB90-205253) 19. Computer Assisted Survey Information Collection (NTIS Document Sales, PB90-205261) 20. Seminar on the Quality of Federal Data (NTIS Document Sales PB91-142414) 21. Indirect Estimators in Federal Programs (NTIS Document Sales, PB93-209294) Copies of these working papers may be ordered from NTIS Document sales, 5285 Port Royal Road, Springfield, VA 22161 (703)487-4650 1 A more appropriate method would be to use the of the net migration rate for the total population to the net migration rate of the school-aged population but component method II has traditionally used the difference.
Page Last Modified: April 20, 2007 | FCSM
Home Methodology Reports |