Statistical Policy Working Paper 21 - Indirect Estimators in Federal Programs

Federal Committee on Statistical Methodology
Office of Management and Budget

FCSM Home ^
Methodology Reports ^

Statistical Policy Working Paper 21 - Indirect Estimators in Federal Programs

Click HERE for graphic.

Statistical Policy

Working Paper 21

Indirect Estimators

in Federal Programs

Prepared by

Subcommittee on Small Area Estimation

Federal Committee on Statistical Methodology

Statistical Policy Office

Office of Information and Regulatory Affairs

Office of Management and Budget

July 1993

MEMBERS OF THE FEDERAL COMMITTEE ON

STATISTICAL METHODOLOGY

(July 1993)

Maria E. Gonzalez, Chair

office of Management and Budget

Yvonne M. Bishop Daniel Melnick

Energy Information Substance Abuse and Mental

Administration Health Services Administration,

Warren L. Buckler Robert P. Parker

Social Security Administration Bureau of Economic Analysis

Cynthia Z.F. Clark Charles P. Pautler, Jr.

National Agricultural Bureau of the Census

Statistics Service

David A. Pierce

Steven Cohen Federal Reserve Board

Administration for Health

Policy and Research Thomas J. Plewes

Bureau of Labor Statistics

Zahava D. Doering

Smithsonian institution Wesley L. Schaible

Bureau of Labor Statistics

Roger A. Herriot

National Center for Fritz J. Scheuren

Education Statistics internal Revenue Service

C. Terry Ireland Monroe G. Sirken

National Computer Security National Center for

Center Health Statistics

Charles D. Jones Robert D. Tortora

Bureau of the Census Bureau of the Census

Daniel Kasprzyk Alan R. Tupek

National Center for National Science Foundation

Education Statistics

Nancy Kirkendall

Energy Information

Administration

PREFACE

The Federal Committee on Statistical Methodology was organized by

OMB in 1975 to investigate issues of data quality affecting

Federal statistics. Members of the committee, selected by OMB on

the basis of their individual expertise and interest in

statistical methods, serve in a personal capacity rather than as

agency representatives. The committee conducts its work through

subcommittees that are organized to study particular issues. The

subcommittees are open by invitation to Federal employees who

wish to participate. Working papers are prepared by the

subcommittee members and reflect only their individual and

collective ideas.

The Subcommittee on Small Area Estimation was formed in 1991 to

document the uses of indirect estimators by Federal statistical

agencies to prepare and publish estimates. An indirect estimator

uses values of the variable of interest from a domain and/or time

period other than the domain and time period of the estimate

being produced. Users of indirect estimators should consider the

errors to which these estimates are subject.

Eight programs that publish indirect estimators are described in

this report. These programs sometimes respond to legislative

requirements or, alternatively, to State data needs. The

programs and sponsor agencies are: infant and maternal health

characteristics (National Center for Health Statistics (NCHS));

personal income, annua1 income, and gross product (Bureau of

Economic Analysis); postcensal population estimates for counties

(Bureau of the Census (BOC)); employment and unemployment for

States (Bureau of Labor Statistics); cotton, rice, and soybean

acreage (National Agricultural Statistics Service (NASS));

livestock inventories, crop production, and acreage (NASS);

disabilities, hospital utilization, physician and dental visits

(NCHS); and median income for 4-person families (BOC).

The Subcommittee on Small Area Estimation was chaired by

Wesley L. Schaible of the Bureau of Labor Statistics, Department

of Labor.

MEMBERS OF THE SUBCOMMITTEE

ON SMALL AREA ESTIMATION

Wesley L. Schaible (Chair)

Bureau of Labor Statistics

Department of Labor

Robert E. Fay

Bureau of the Census

Department of Commerce

Joe Fred Gonzalez

National Center for Health Statistics

Department of Health and Human Services

Linnea Hazen

Bureau of Economic Analysis

Department of Commerce

William C. Iwig

National Agricultural Statistics Service

Department of Agriculture

John F. Long

Bureau of the Census

Department of Commerce

Donald J. Malec

National Center for Health Statistics

Department of Health and Human Services

Alan R. Tupek

National Science Foundation

ACKNOWLEDGMENTS

This report is the result of the collaborative efforts of members of

the Subcommittee on Small Area Estimation of the Federal Committee on

Statistical Methodology. Subcommittee members volunteered a

considerable amount of time over a two year period to complete

individual chapters in the report. Chapter authors are identified at

the beginning of each chapter. Although the introductory and

concluding chapters were authored by the Subcommittee Chair, they

resulted from discussions which included the entire Subcommittee as

well as other interested parties.

Throughout the preparation of the report, a number of reviewers read

drafts and provided valuable comments. The Subcommittee thanks Maria

Gonzalez, Chair of the Federal Committee on Statistical Methodology,

for her support and contributions throughout the development and

preparation of the report. The Subcommittee also expresses its

appreciation to the members of the Federal Committee on Statistical

Methodology for reviewing the report and providing many useful

suggestions. The Subcommittee extends special thanks to the following

Committee members: Yvonne Bishop, Cynthia Clark, Robert Parker, David

Pierce, Thomas Plewes, Monroe Sirken, Robert Tortora, and, in

particular, Fritz Scheuren. The Subcommittee also thanks Alan

Dorfman, Steve Mlier, and especially Robert Casady, all of the Bureau

of Labor Statistics, for helpful discussions and comments. In

addition, the Subcommittee extends its thanks to Gordon Brackstone and

the staff of Statistics Canada, to Wayne Fuller of Iowa State

University, and especially to Graham Kalton of Westat, Inc. for

valuable comments and the time so generously provided to review the

report.

TABLE OF CONTENTS

Chapter 1. Introduction and Summary........................ 1-1

Chapter 2. Synthetic Estimation in Follwback Surveys

at the National Center for Health Statistics.............. 2-1

Chapter 3. State, Metropolitan Area, and County

Income Estimation......................................... 3-1

Chapter 4. Postcensal Population Estimates: States,

Counties and Places....................................... 4-1

Chapter 5. Bureau of Labor Statistics' State and Local Area

Estimates of Employment and Unemployment.................. 5-1

Chapter 6. County Estimation of Crop Acreage

Using Satellite Data...................................... 6-1

Chapter 7. The National Agricultural Statistics Service County

Estimates Program......................................... 7-1

Chapter 8. Model-Based State Estimates from the National

Health Interview Survey................................... 8-1

Chapter 9. Estimation of Median Income for Four-Person

Families by State......................................... 9-1

Chapter 10. Recommendations and Cautions.................. 10-1

CHAPTER 1

Introduction and Summary

1.1 Introduction

Federal statistical agencies produce estimates of a variety of

population quantities for both the nation as a whole and for

subnational domains. Domains are commonly defined by demographic and

socioeconomic variables. However, geographic location is perhaps the

single variable used most frequently to define domains. Regions,

states, counties, and metropolitan areas are common geographic domains

for which estimates are required. Federal agencies use different data

systems and estimation methods to produce domain estimates. Those

systems designed for the purpose of producing published estimates use

standard, direct estimation methods. Data systems are designed within

time, cost and other constraints which restrict the number of

estimates that can be produced by standard methods. However, the

demand for additional information and the lack of resources to design

the required dam systems have led federal statistical agencies to

consider non-standard methods. Estimation methods of a particular

type, referred to as small area or indirect estimators, have sometimes

been used in these situations.

The purpose of this report is to document, in a manner that will

facilitate comparisons, the practices and estimation methods of the

federal statistical programs that use indirect estimators. Only

programs that use indirect estimators for the production of published

estimates are included; whether a data system is based on a census

(including administrative records) or a: sample survey has no bearing

on the inclusion of a program. The focus of this report is on the

method by which estimates are produced. The methods and practices of

eight programs are documented here; three are located in the

Department of Commerce, two in the Department of Agriculture, two in

the Department of Health and Human Services, and one in the Department

of Labor. Other applications of indirect estimators occur in federal

statistical agencies but descriptions of these applications have not

been included in this report. Most of these methods were not included

because they were not used, to produce published estimates. This

publication restriction, a somewhat arbitrary indicator of program

importance, keeps the number of programs included to a manageable

level but leads to the omission of other interesting methods (for

example, Fay and Herriott 1979).

This introductory chapter includes brief discussions of small area

estimation terminology; definitions of direct and indirect estimators;

some characteristics, of indirect estimators; and summary descriptions

of the programs included in the report. Each program is documented,

following a standard format, in the individual chapters of the report.

The intent is to create program descriptions that will not only

provide complete, self-contained documentation for each individual

program but also facilitate comparisons among programs. Although the

focus of the report is on estimation methods, the description of each

program includes material on program history, policies, evaluation

practices, estimation methods, and current, problems and activities.

In addition to the standard chapter format, attempts have been made to

employ common notation throughout the chapters to facilitate

comparisons of estimation methods. The report, concludes with a

number of recommendations and cautions.

1.2 Terminology

Terms used to describe indirect estimators can be confusing.

Increased interest in non- traditional estimators for domain

statistics has occurred recently among survey statisticians and, even

though the term "small area estimator" is commonly used, uniform

terminology has not yet evolved. This term is frequently used because

in most applications of these estimators the domains of interest have

been geographic areas. However, the word "small" is misleading. It

is the small number of sample observations and the resulting large

variance of standard direct estimators that is of concern, rather than

the size of the population in the area or the size of the area itself

The word "area" is also misleading since these methods may be applied

to any arbitrary domain, not just those defined by geographic

boundaries. Other terms used to describe these estimators include

"local area" (Ericksen 1974), "small domain" (Purcell and Kish 1979),

subdomain" (Laake 1979), "small subgroups" (Holt, Smith, and Tomberlin

1979), subprovincial" (Brackstone 1987), "indirect" (Dalenius 1987),

and "model-dependent" (Sarndal 1984). The term "synthetic estimator"

has also been used to describe this class of estimators (NIDA 1979)

and, in addition, to describe a specific indirect estimator (NCHS

1968). Survey practitioners sometimes refer to indirect estimators as

"model-based" whereas this term is rarely, if ever, used to describe

direct estimators. However, direct estimators can be motivated by and

justified under models as readily as indirect estimators.

There is also lack of agreement on what to call the class of

direct estimators. In addition to "direct" (Royall 1973), authors

have used "unbiased" (Gonzalez 1973), "standard" (Holt, Smith, and

Tomberlin 1979), "valid" (Gonzalez 1979), and "sample-based" (Kalton

1987). In the remainder of this paper, the words "direct" and

"indirect" will be used to describe traditional and small area

estimators, respectively.

1.3 Direct and Indirect Estimators

Perhaps the most common measure of error of an estimator is the mean

square error, composed of the sum of the variance of the estimator and

the squared bias of the estimator. Biases can rarely be estimated

with any degree of confidence. If an estimator is unbiased or

approximately unbiased, the variance of the estimator, which can be

estimated from the available data, is a satisfactory measure of error

of the estimator. This leads to the selection of estimators that are

unbiased or approximately unbiased in most applications. Such

estimators allow data systems to be designed so that estimates with a

predictable level of error can be produced with high probability and,

in addition, estimated measures of error can be provided to accompany

estimates.

Federal statistical programs are generally designed using direct

estimators which are unbiased, or approximately unbiased, under finite

population sampling theory. Samples are often used and, given

adequate resources, the sample design specifies population and domain

sample sizes large enough to produce direct estimates that meet

reliability requirements for the survey. When a domain sample size is

too small to make a reliable domain estimate using the direct

estimator, a decision must be made whether to produce estimates using

an alternative procedure. The alternative estimators considered are

those that increase the effective sample size and decrease the

variance by using data from other domains and/or time periods through

models that assume similarities across domains and/or time periods.

These estimators are generally biased, but if the mean square error of

the alternative estimator can be demonstrated to be small compared to

the variance of the direct estimator, the selection of the alternative

estimator may be justified. In extreme situations, there may be no

sample units in the domain of interest and, if an estimate is to be

produced, an alternative estimator will be required.

Indirect estimators have been characterized in the Bayesian and

empirical Bayes literature as estimators that "borrow strength" by the

use of values of the variable of interest from domains other than the

domain of interest. This approach can be used to provide a working

definition of direct and indirect estimators for a broad class of

population quantities including means and totals. A direct estimator

uses values of the variable of interest only from the time period of

interest and only from units in the domain of interest. An indirect

estimator uses values of the variable of interest from a domain and/or

time period other than the domain and time period of interest. Three

types of indirect estimators can be identified. A domain indirect

estimator uses values of the variable of interest from another domain

but not from another time period. A time indirect estimator uses

values of the variable of interest from another time period but not

from another domain. An estimator that is both domain and time

indirect uses values of the variable of interest from another domain

and another time period.

Indirect estimators depend on values of the variable of interest

from domains and/or time periods other than that of interest. These

values are brought into the estimation process through a model that,

except in the most trivial case, depends on one or more auxiliary

variables that are known for the domain and time period of interest.

To the extent that applicable models can be identified and the

required auxiliary variables are available, indirect estimators can be

created to produce estimates. Perhaps the simplest example of an

indirect estimator is the use of the sample mean of the entire sample

as the estimator for a specific domain. For example, the use of the

mean from a national sample as an estimate for a particular state. To

the extent that information related to the variable of interest is

available for the state, an indirect estimator which is "better" than

the national mean can be defined. The availability of auxiliary

variables and an appropriate model relating the auxiliary variables to

the variable of interest are crucial to the formation of indirect

estimators. However, the definition of direct and indirect estimators

does not depend on whether or not auxiliary variables from outside the

domain or time period of interest are used.

The clear distinction between direct and indirect estimators made in

the discussion above reflects the situation during the design stage of

a data system. However, when estimators reflect the realities

associated with data system implementation, the distinction becomes a

little less clear. For example, nonresponse is a common problem in

dam collection efforts. To the extent that nonresponse occurs, even

direct estimators must rely on model-based assumptions relating the

known information for responders to the unknown information for

nonresponders.

1.4 Organization of Program Chapters

As discussed in the previous section, indirect estimators borrow

strength and can be classified into three types: domain indirect, time

indirect, and domain/time indirect. In addition to this

classification, indirect estimators are commonly expressed in

different forms, that is different algebraic expressions. Each of the

eight programs described in this report uses one of the following

three common indirect estimators: synthetic, regression, or composite.

The order of chapters describing programs follows this classification

of estimators. That is, the program that uses a synthetic estimator

is presented first in Chapter 2, followed by the programs that use

regression estimators in Chapters 3 through 6; those programs that use

composite estimators are presented in Chapters 7 through 9. Some of

the programs have used different estimators at different times;

however, emphasis is placed on the estimator that was last used to

publish estimates.

As with all indirect estimators, synthetic estimators may be

domain indirect, time indirect, or domain and time indirect. For

example, a domain indirect synthetic estimator for a population total

in domain d and time t may be written as

Click HERE for graphic.

Regression estimators may be direct or, like the synthetic estimator,

domain indirect, time indirect, or domain and time indirect depending

on how the parameters are estimated. For example, a domain indirect

regression estimator for a population total may be written as

Click HERE for graphic.

It should be noted that not all indirect estimators are linear. For

examples of nonlinear indirect estimators see MacGibbon and Tomberlin

(1989) and Malec, Sedransk, and Tompkins (1993). This latter,

nonlinear indirect estimator is being considered for use in

conjunction with the National Health Interview Survey and is discussed

in Chapter 8.

1.5 Characteristics of Indirect Estimators

There are several fairly well-known characteristics of indirect

estimators that are important for producers and users to keep in mind.

o In general, indirect estimators have relatively small variances

since they not only incorporate observations from the domain and time

period of interest, but also, from other domains and/or time periods.

The variance of a modified synthetic estimator is discussed by Holt,

Smith, and Tomberlin (1979) and variances of several indirect

estimators resulting from different prediction models are discussed by

Royall (1979). Care must be taken since the variance alone may be a

misleading measure of error. See, for example, Raback and Sarndal

(1982) and Sarmdal and Hidiroglou (1989).

o An indirect estimator will be biased if the model assumptions

leading to the estimator are not satisfied. Even so, an indirect

estimator may be a useful alternative to a direct estimator when the

mean squared error of the indirect estimator is sufficiently small

compared to the variance of the direct estimator. However, the

magnitude of the bias is likely to vary with each application and

estimation of biases is difficult. Gonzalez and Waksberg (1973)

consider the problem of estimating the mean squared error of synthetic

estimators, and Prasad and Rao (1990) discuss the estimation of the

mean squared error of indirect estimators. Care must be taken when

interpreting estimated mean squared errors of indirect estimators.

Some approaches provide an average measure over all domains rather

than a measure associated with a specific domain. Confidence

intervals for biased estimators is a related issue that has, been

addressed by Miller (1992).

o For a given application and estimator, biases in different

domains will differ since the model will likely be a better

representation of reality in some domains than in others. In general,

when an indirect estimator is used to produce estimates for a number

of domains, the distribution of estimates will have a smaller variance

than the corresponding distribution of domain population values. This

is a result of the tendency for indirect estimators to have relatively

small biases when domain population values are close to the average

total population value and, when domain population values are not

close to the overall population value, to have relatively large

directional biases which make the estimates closer to the overall

population value. There is considerable evidence illustrating this

characteristic (Gonzalez and Hoza 1978; Schaible et al. 1977 and 1979;

and Heeringa 1981). Not all indirect estimators display this

characteristic to the same extent. Spjovoll and Thomsen (1987),

Lahiri (1990), and Ghosh (1992) have addressed this problem and

suggest constrained approaches.

o From a model-based, prediction point of view, direct and

indirect estimators are model unbiased under the model that generates

the estimator. A direct estimator is robust against model failure in

the sense that it is unbiased, not only under the domain/time specific

model which generates the estimator, but under each of the models

associated with the corresponding indirect estimators. Indirect

estimators are not robust in the same sense. However, the domain

indirect estimator and the time indirect estimator are both more

robust against model failure, in a similar sense, than the estimator

that is both domain and time indirect. The bias of indirect

estimators, under the domain and time specific model, is a source of

concern that results in a reluctance to fully accept indirect

estimators in many applications. An example and additional discussion

of this aspect of indirect estimator bias is given in Schaible (1993).

1.6 Program Summaries

The programs described in this report were initiated in response to a

variety of needs and directives. Several are a direct result of

legislative requirements to allocate federal funds. Others were

created in response to state needs for data and/or to standardize

estimation methods across states. Others are viewed as research

programs that periodically publish estimates when an improved

methodology has been developed. Table 1 below allows a comparison of

summary information on the programs described in Chapters 2 through 9

in this report. The eight programs that use indirect estimators to

publish estimates are located in five large statistical agencies. In

some instances, a program produces estimates for a single variable; in

other instances, estimates are produced for numerous variables.

States and counties are the only domains for which indirect estimates

are presently published. Four of the programs publish estimates for

states, three for counties, and one for both states and counties.

There is considerable variability in the frequency with which

estimates are published. Two programs publish estimates only

periodically, every few years. The remainder publish indirect

estimates on a fixed schedule: four annually, one annually with

selected estimates on a quarterly schedule, and one monthly. As noted

above, a variety of indirect estimators are used to produce estimates.

Synthetic, regression, and composite estimators that borrow strength

over domains, over time, or over both domain and time are found among

these programs. The estimation procedures for six of the programs are

based on data from sample surveys. There is no sampling involved in

the procedures used in the two programs that produce estimates of

personal income and postcensal populations.

Given the differing demands on Federal statistical agencies, it is not

surprising that considerable variation is seen in the programs

described in this report. Further investigations and improvements in

the quality of indirect estimates published by Federal agencies are

needed. It is hoped that recognition of the differences, as well as

the similarities, in these programs will help provide a foundation for

this further effort.

Click HERE for graphic.

REFERENCES

Brackstone, G. J.. (1987), "Small Area Data: Policy Issues and

Technical Challenges," in Small Area Statistics, New York: John Wiley

and Sons.

Dalenius, T. (1987), "Panel Discussion" in Small Area Statistics, New

York: John Wiley and Sons.

Ericksen, E.P. (1974), "A Regression Method for Estimation Population

Changes for Areas," Journal of the American Statistical Association,

69, 867-875.

Fay, R.E. and Herriott, R.A. (1979), "Estimates of Income for Small

Places: An Application of James-Stein Procedures to Census Data,"

Journal of the American Statistical Association, 74, 269-277.

Ghosh, M. (1992), "Constrained Bayes Estimation With Applications,"

Journal of the American Statistical Association, 87, 533-540.

Gonzalez, M.E. (1973), "Use and Evaluation of Synthetic Estimates,"

Proceedings of the Social Statistics Section, American Statistical

Association, 33-36.

Gonzalez, M.E. (1979), "Case Studies on the Use and Accuracy of

Synthetic Estimates: Unemployment and Housing Applications" in

Synthetic Estimates for Small Areas (National Institute on Drug Abuse,

Research Monograph 24), Washington, D.C.: U.S. Government Printing

Office.

Gonzalez, M.E. and Hoza, C. (1978), "Small-Area Estimation with

Application to Unemployment and Housing Estimates," Journal of the

American Statistical Association, 73, 7- 15.

Gonzalez, M.E. and Waksberg, J (1973), "Estimation of the Error of

Synthetic Estimates," paper presented at the first meeting of the

International Association of Survey Statisticians, Vienna, Austria,

18-25 August, 1973.

Heeringa, S.G. (1981), "Small Area Estimation Prospects for the Survey

of Income and Program Participation," Proceedings of the Section on

Survey Research Methods, American Statistical Association, 133-138.

Holt, D., Smith, T.M.F., and Tomberlin, T.J. (1979), "A Model-Based

Approach to Estimation for Small Subgroups of a Population," Journal

of the American Statistical Association, 74, 405- 410.

Kalton, G. (1987), "Panel Discussion" in Small Area Statistics, New

York: John Wiley and Sons.

Laake, P. (1979), "A Prediction Approach to Subdomain Estimation in

Finite Populations," Journal of the American Statistical Association,

74, 355-358.

Lahiri, P. (1990), "Adjusted Bayes and Empirical Bayes Estimation in

Finite Population Sampling," Sankhya B, 52, 50-66.

MacGibbon, B. and Tomberlin, T.J. (1989), "Small Area Estimation of

Proportions Via Empirical Bayes Techniques," Survey Methodology, 15-2,

237-252.

Malec, D., Sedransk, J., and Tompkins, L. (1993), "Bayesian Predictive

Inference for Small Areas for Binary Variables in the National Health

Interview Survey." In Case Studies in Bayesian Statistics, eds.,

Gatsonis, Hodges, Kass and Singpurwalla. New York: Springer Verlag.

Miller, S.M. (1992), "Confidence Interval Coverage for Biased Normal

Estimators," Proceedings of the Section on Survey Research Methods,

American Statistical Association.

National Center for Health Statistics (1968), Synthetic State

Estimates of Disability (PHS Publication No. 1759), Washington, D.C.:

U.S. Government Printing Office.

National Institute on Drug Abuse (1979), Synthetic Estimates for Small

Areas (NIDA Research Monograph 24), Washington, D.C.: U.S. Government

Printing Office.

Prasad, N.G.N. and Rao, J.N.K. (1990), The Estimation of the Mean

Squared Error of Small Area Estimators," Journal of the American

Statistical Association, 85, 163-171.

Purcell, N.J. and Kish, L. (1979), "Estimation for Small Domains,"

Biometrics, 35, 365-384.

Raback, G. and Sarndal, C.E. (1982), "Variance Reduction and

Unbiasedness for Small Area Estimators," Proceedings of the Social

Statistics Section, American Statistical Association, 541- 544.

Royall, R.A. (1973), "Discussion of papers by Gonzalez and Ericksen,"

Proceedings of the Social Statistics Section, American Statistical

Association, 42-43.

Royall, R.A. (1979), "Prediction Models in Small Area Estimation," in

Synthetic Estimates for Small Areas (National Institute on Drug Abuse,

Research Monograph 24), Washington, D.C.: U.S. Government Printing

Office.

Sarndal, C.E. (1984), "Design-Consistent versus Model-Dependent

Estimation for Small Domains," Journal of the American Statistical

Association, 79, 624-631.

Sarndal, C.E. and Hidiroglou, M.A. (1989), "Small Domain Estimation: A

Conditional Analysis," Journal of the American Statistical

Association, 84, 266-275.

Schaible, W.L. (1993), "Use of Small Area Estimators in U.S. Federal

Programs," in Small Area Statistics and Survey Designs, Vol. 1,

Central Statistical Office, Warsaw, Poland.

Schaible, W.L., Brock, D.B., and Schnack, G.A. (1977), "An Empirical

Comparison of the Simple Inflation, Synthetic and Composite Estimators

for Small Area Statistics," Proceedings of the Social Statistics

Section, American Statistical Association, 1017-1021.

Schaible, W.L., Brock, D.B., Casady, R.J., and Schnack, G.A. (1979),

Small Area Estimation: An Empirical Comparison of Conventional and

Synthetic Estimators for States, (PHS Publication No. 80-1356),

Washington, D.C.: U.S. Government Printing Office.

Spjovoll, E. and Thomsen, I. (1987), "Application of Some Empirical

Bayes Methods to Small Area Statistics," Proceedings of the

International Statistical Institute, Vol. 2, 435-449.

CHAPTER 2

Synthetic Estimation in Followback Surveys

at The National Center for Health Statistics

Joe Fred Gonzalez, Jr., Paul J. Placek, and Chester Scott

National Center for Health Statistics

2.1 Introduction and Program History

The National Center for Health Statistics (NCHS) through its vital

registration system collects and publishes data on vital events

(births and deaths) for the United States (NCHS 1989). NCHS produces

national, State, county, and smaller area vital statistics for

sociodemographic and health characteristics which are available from

birth and death certificates. The Division of Vital Statistics of

NCHS produces annual summary tables for the United States showing

trends in period and cohort fertility measures and characteristics of

live births. Also, NCHS produces detailed tabulations by place of

residence and occurrence for each State, county, and city with a

population of 10,000 or more by race and place of delivery and place

of residence for population-size groups in metropolitan and

nonmetropolitan counties within each State by race, attendant and

place of delivery, and birth weight. These statistics are based on a

complete count of vital records.

In addition to the limited vital statistics tabulations which are

produced annually, there has always been a continuing need for more

detailed national and State level estimates of health status, health

services, and health care utilization related to vital events.

Because vital records (birth and death certificates) serve both legal

and statistical purposes, they provide limited social, demographic,

health, and medical information. Each vital record is a one page

document with extremely limited information. The data from these

vital records can be augmented, however, through periodic "followback"

surveys. These surveys are referred to as "followback" because they

obtain additional information from sources named on the vital record.

A followback survey is a cost effective means of obtaining

supplementary information for a sample of vital events. From the

sample it is possible to make national estimates of vital events

according to characteristics not otherwise available. Examples of

supplementary information which may be needed by health researchers,

health program planners, and health policy makers are: mother's

smoking habits before and during pregnancy; complications of

pregnancy; drug or surgical procedure to induce or maintain labor;

amniocentesis during pregnancy; electronic fetal monitoring;

respiratory distress syndrome; infant jaundiced; medical x-ray use;

birth injuries; and, congenital anomalies. Periodic followback

surveys respond to the changing data needs of the public health

community without requiring changes in the vital record forms.

The specific NCHS followback surveys that will be discussed in this

chapter are the 1980 National Natality Survey (NNS) and the 1980

National Fetal Mortality Survey (NFMS) (NCHS 1986). In order to

produce State estimates for certain health characteristics not

available on the vital records, synthetic estimation (NCHS 1984a,

1984b) was applied to national data from the 1980 NNS and 1980 NFMS.

In addition to the usual appeal of using synthetic estimation over

direct estimation, especially when small sample sizes are concerned,

synthetic State estimates were compared to direct State estimates as

well as the "true" values for a limited number of variables from State

vital statistics via fetal death records and birth and death

certificates.

2.2 Program Description, Policies and Practices

The 1980 NNS is based on a probability sample of 9,941 from a universe

of 3,612,258 live births that occurred in the United States during

1980. The NNS sample included a four-fold oversampling of low birth

weight infants. The live birth certificate represents the basic

source of information. Based on information from the sample birth

certificates, eight page Mother's questionnaires were mailed to

mothers who were married. These mothers were asked to provide

information on prenatal health practices, prenatal care, previous

pregnancies, and social and demographic characteristics of themselves

and their husbands. Each mother was also asked to sign a consent

statement authorizing NCHS to obtain supplemental information from her

medical records. If the mother did not respond after two

questionnaires were sent by mail, a telephone interviewer attempted to

complete an abbreviated questionnaire and to obtain a consent

statement. To ensure their privacy, unmarried mothers were not

contacted. As a result of sending the Mother's questionnaire only to

married mothers, the 1980 NNS population of inference for data

collected through the Mother's questionnaire was 2,944,580 live

births.

Regardless of the mother's marital status, questionnaires were mailed

to the hospital's and to the attendants at delivery (for example,

physicians or nurse-midwives) named on the birth certificates. A

questionnaire was sent to the hospital for each sample birth that

occurred either in a hospital or en route to a hospital. If the

mother signed a consent statement authorizing NCHS to obtain

supplemental medical information, a copy was included with the

questionnaire. The focus of the hospital questionnaire was on

characteristics of labor and delivery, health characteristics of the

mother and infant, information on prenatal care visits, and

information on radiation examinations and treatments received by the

mother during the 12 months before delivery of the sample birth. For

the hospital component of the 1980 NNS, the population of inference

was 3,580,700 live births.

The 1980 NNS is composed of information from birth certificates and

information from questionnaires sent to married mothers, hospitals,

attendants at delivery, and providers of radiation examinations and

treatments. The survey represents an extensive source of information

concerning specific maternal and child health conditions and obstetric

practices for live births in the United States. The 1980 NNS response

rates were 79.5 percent for mothers, 76.1 percent for hospitals, and

61.6 percent for physicians.

The 1980 NFMS is based on a probability sample of 6,386 fetal deaths

(out of a universe of 19,202 fetal deaths) with gestation of 28 weeks

or more, or delivery weight of 1,000 grams or more, that occurred in

the United States during 1980. The report of fetal death represent

the basic source of information in this survey. Married mothers,

hospitals, attendants at delivery, and providers of radiation

examinations and treatments were surveyed under the same conditions as

those described for the 1980 NNS. The 1980 NFMS populations of

inference for all fetal deaths, fetal deaths in hospitals, and fetal

deaths to married mothers were 19,202, 18,930, and 14,790,

respectively. The same questionnaires were used for both surveys.

Although some questions pertained only to live births and other

pertained only to fetal deaths, instructions to skip inappropriate

questions were included in the questionnaires. The sampling design

for the NFMS was developed so that the NFMS would be large enough to

permit comparisons between live births in the NNS and fetal deaths in

the NFMS. The 1980 NFMS response rates were 74.5 percent for mothers,

74.0 percent for hospitals, and 55.0 percent for physicians.

Table 1 presents the 1980 NNS and NFMS distribution of sample cases of

live births and fetal deaths by State of occurrence. As shown in

Table 1, it may be possible to produce direct State level estimates of

certain health characteristics for some of the larger States.

However, the sample sizes for most States are generally too small to

produce reliable direct State estimates. This was the main

justification for exploring synthetic State estimation as an

alternative for producing State level estimates.

2.3 Estimator Documentation

The underlying rationale for synthetic estimation is that the

distribution of a health characteristic is highly related to the

demographic composition of the population (NCHS 1984a). It is assumed

that differences in the prevalence of the characteristics between two

areas are due primarily to differences in demographic composition

(e.g. age, race, sex, etc.). That is, it is assumed that a particular

measure would be the same in two States that had the same population

composition with respect to certain demographic variables. This

rationale was used to select the demographic variables that were

deemed to be the most appropriate and relevant to the 1980 NNS and

NFMS in order to produce Synthetic State estimates.

The following is the basic estimator that was used to produce

Synthetic State estimates of proportions for certain health variables

from the 1980 National Natality Survey (NNS) and the 1980 National

Fetal Mortality Survey (NFMS).

Click HERE for graphic.

Table 2 gives an illustration of the computation of the synthetic

State estimate of the percent jaundiced infants in Pennsylvania in

1980. The stub of Table 2 shows the 25 demographic cells (race, age

of mother, and live-birth order groups) that were used to produce the

Synthetic State estimates. Column (1) shows the national (based on

the 1980 NNS) estimates of percent of live births that were jaundiced

in each of the respective 25 demographic cells. Column (2) shows the

number of hospital births (derived from State Vital Registration

System) within the 25 demographic cells in Pennsylvania. Column (3),

the estimated number of jaundiced live births in Pennsylvania, is

computed by taking the product of entries in columns (1) and (2)

within each of the 25 respective cells. Finally, the Synthetic State

estimate is found by taking the ratio of the sum of column (3) to the

sum of column (2).

Since there were three different populations of inference (all vital

events, vital events to married mothers, and vital events in

hospitals) for each of the 1980 NNS and NFMS, appropriate State

aggregates of vital events were incorporated into the calculation of

corresponding synthetic State estimates (NCHS 1984a, 1984b).

2.4 Evaluation Practices

The following is a description of some of the tabulations that were

produced. Table 3 gives Synthetic State estimates of 11 health

characteristics of mothers and infants for five selected States. A

complete listing of all 57 NNS/NFMS health variables for which

Synthetic State estimates were produced can be found in Tables 2-8 in

(NCHS 1984a, 1984b).

Click HERE for graphic.

The synthetic State estimates are subject to sampling error because

they are based on corresponding national estimates derived from the

1980 NNS and NFMS by race, maternal age, and live-birth order group.

Therefore, the standard errors of the synthetic State estimates are

relatively small because they are based on the standard errors of the

national estimates. The standard errors for the NNS and NMFS were

estimated by a balanced-repeated-replicated procedure using 20

replicate half samples. This procedure estimates the standard errors

for survey estimates through the observation of the variability of

estimates based on replicate half samples of the total sample; This

variance estimation procedure was developed and described by McCarthy

(NCHS 1966, 1969).

Although the synthetic State estimate has a relatively small standard

error, it is subject to bias. The bias is a measure of the extent to

which the national maternal age, race, live-birth order specific

estimates differ from the true values for a given State. The closer

the demographic variables used in the synthetic estimate come to

accounting for all the interstate variation in a particular health

characteristic, the smaller the bias will be. Unfortunately, the bias

cannot be computed without knowing the true values. However, through

the U.S Vital Registration System, true State values for vital events

(collected through birth and death certificates) are known for a

limited number of available sociodemographic and health

characteristics. Therefore, we can compare certain synthetic

estimates with their corresponding true values. This yields a degree

of confidence for the synthetic estimates of similar characteristics

which cannot be checked against the true values from State vital

statistics. Thus, the evaluation of this study only provides an

indicator of the quality of the synthetic State estimates.

The last two columns of Table 4 show the mean square error (MSE) of

the NNS synthetic. estimates as compared with the MSE of the NNS

direct estimates. The MSE of an estimate x is the variance of x plus

the square of the bias of x, i.e.

Click HERE for graphic.

2.5 Current Problems And Activities

Work is currently underway at NCHS to produce synthetic State

estimates from the 1988 National Maternal and Infant Health Survey

(NMIHS) which is very similar to its predecessor the 1980 NNS and

NFMS. In the NMIHS 9,953 out of a universe of 3,898,922 live-birth

certificates are linked with mothers' responses on 35-page

questionnaires about the mothers' prenatal health behavior, maternal

health, the birth experience, and infant health. The 1988 NMIHS live

birth estimates will be used to produce synthetic State estimates by

infant's race. birth weight, and maternal age and marital status.

Click HERE for graphic.

REFERENCES

National Center for Health Statistics: Vital Statistics of the United

States, 1987 Vol. 1, Natality, DHHS Pub. No. (PHS) 89-1100. Public

Health Service, Washington. U.S. Government Printing Office, 1989.

National Center for Health Statistics, K.G. Keppel, R.L. Heuser,

P.J. Placek, et al.: Methods and Response Characteristics, 1980

National Natality and Petal Mortality Surveys. Vital and Health

Statistics, Series 2, No. 100. DHHS Pub No. (PHS) 86-1374. Public

Health Service, Washington. U.s. Government Printing Office,

Sept. 1986.

National Center for Health Statistics: State Uses of Followback Survey

Data, R.L. Heuser, K.G. Keppel, C.A. Witt, and P.J.Placek, Presented

at the Annual Meeting of the Association for Vital Records and Health

Statistics, July 9-12, 1984, Niagara Falls, NY.

National Center for Health Statistics: R.L. Heuser, K.G. Keppel,

C.A. Witt, and P.J. Placek, Synthetic Estimation Applications form the

1980 National Natality Survey (NNS) and the 1980 National Fetal

Mortality Survey (NFMS), Presented at the NCHS Data Use Conference on

Small Area Statistics, August 29-31, 1984, Snowbird, Utah.

National Center for Health Statistics, P.J. McCarthy: Replication: An

Approach to the Analysis of Data From Complex Surveys. Vital and

Health Statistics, Series 2, No. 14, PHS Pub No. 1000. Public Health

Service. Washington, U.S. Government Printing Office, April 1966.

National Center for Health Statistics, P.J. McCarthy:

Pseudoreplication: Further Evaluation and Appliication of the Balanced

Half-Sample Technique. Vital and Health Statistics. Series 2,

No. 31. DHEW Pub No. (HSM) 73-120. Health Services and Metal Health

Administration. Washington. U.S. Government Printing Office,

Jan. 1969.

* Chapter 8 (authored by Donald Malec) of this report contains several

references on small area estimation as applied to the National Health

Interview Survey of the National Center for Health Statistics.

CHAPTER 3

State, Metropolitan Area, and County

Income Estimation

Wallace Bailey, Linnea Hazen, and Daniel Zabronsky

Bureau of Economic Analysis

3.1 Introduction and Program History

3.1.1 Program Description

The Bureau of Economic Analysis (BEA) maintains a program of State and

local area (county and metropolitan area) economic measurement that

centers on the personal income measure. This program originated in

1939 when estimates of income payments to individuals by State were

first published. At the national level, personal income is the

principal income measure in the personal income and outlay account,

one of the five accounts that compose the national income and product

accounts. The State and local area personal income estimates are

derived by disaggregating the detailed components of the national

personal income estimates to States and counties. Estimates for all

other geographic areas are made by aggregating either the State or

county estimates in the appropriate combinations. This building block

approach permits estimates for areas whose boundaries change over

time, such as metropolitan areas, to be presented on a consistent

geographic definition for all years.

3.1.2 Uses of the State and Local Area Income Estimates

BEA's State and local area income estimates are widely used in the

public and private sector to measure and track levels and types of

incomes received by persons living or working in an area. They

provide a framework for the analysis of each area's economy and serve

as a basis for decision making in both the public and private sectors.

Personal income is among the measures used in evaluating the

socioeconomic impact of public- and private-sector initiatives; for

example, it is widely used in preparing the environmental impact

statements required by the National Environmental Policy Act of 1969.

One of the first uses made of State personal income estimates (or a

derivative) was as a variable in formulas for allocating Federal funds

to States. The most often used derivative is per capita personal

income, which is computed using the Census Bureau's estimates of total

population; these population estimates are described by Long in

Chapter 4 of this report. At present, BEA's State personal income

estimates are used by the Federal Government to allocate over $92

billion annually for various Federal domestic programs, including the

medical assistance (Medicaid) program, and the aid to families with

dependent children program. Table 3.1 highlights the major Federal

Government programs which use BEA personal income estimates in

allocation formulas for Federal domestic assistance funds.

Federal agencies also use the components of personal income in

econometric models, such as those used to project energy and water

use. The U.S. Forest Service is using these estimates to identify

resource dependent rural areas and to allocate funds for their

economic diversification as required by the National Forest-Dependent

Rural Communities Economic Diversification Act of 1990.

The U.S. Census Bureau uses the BEA estimates of State per capita

personal income as the key predictor variable in its estimates of mean

annual income for 4-person families by State. These estimates are

described by Fay, Nelson, and Litow in Chapter 9 of this report.

During the past decade, State governments have substantially increased

their use of the State personal income estimates. The estimates are

used in the measurement of economic bases and in models developed for

planning for such things as public utilities and services. They are

also used to project tax revenues. In recent years, legislation that

limits a State's expenditures or tax authority by the level of, or

changes in, State personal income or to one of its components has been

enacted in 16 States. These 16 States account for nearly one-half of

the U.S. population. Some of these States used BEA's annual State

personal income estimates; the others use fiscal year estimates

derived from BEA's quarterly State personal income estimates (ACIR,

1990).

State governments also use the local area estimates to measure the

economic base of State planning areas. University schools of business

and economics, often worldng under contract for State and local

governments, use the BEA local area estimates for theoretical and

applied economic research.

Businesses use the estimates to evaluate markets for new or

established products and to determine areas for the location,

expansion, and contraction of their activities. Trade associations

and labor organizations use them for product and labor market

analyses.

3.1.2 A History of BEA's Regional Income Estimates

In the mid-1930's, BEA's predecessor began work on the estimation of

regional income as part of the effort to explain the processes and

structure of the Nation's economy. As a result of its work, it

produced a report that showed State estimates of total "income

payments to individuals' in May 1939 (Nathan and Martin, 1939). These

income payments were defined as the sum of

(1) wages and salaries, (2) other labor income and relief, (3)

entrepreneurial withdrawals, and (4) dividends, interest, and net

rents and royalties.

In 1942, the State estimates of wages and salaries and entrepreneurial

income were expanded to include a further breakdown by broad industry

group--agriculture, other commodity-producing industries,

distribution, services, and government. The industry breakdown was

for 1939, when the availability of census information on payrolls and

the employed labor force by industry and by State made possible more

reliable estimates than for prior years (Creamer and Merwin, 1942).

The estimates for most nonagricultural industries and for the military

services were based on reports in which establishments, not employees,

were classified by State and in which the State of residence of the

employees was not indicated; therefore, the estimates for these

industries were on a "place-of-work" (where-earned) basis. No

systematic adjustment was made in the total income payments series to

convert the estimates to a "place-of-residence" (where-received)

basis. However, using the limited information that was available,

residence adjustments were made for a few States for the per capita

series.

During the later 1940's and early 1950's, BEA continued to work on

improving these estimates by seeking additional source data and by

improving the estimating methods that were used. The industrial

detail of the wage and salary estimates was expanded to include each

Standard Industrial Classification (SIC) division and additional

detail for some SIC divisions. As one result of the major reworking

and expansion of the national income and product accounts, BEA

developed State personal income--a measure of income that is more

comprehensive than State income payments.

During the 1960's and 1970's, BEA continued its work to provide more

information about regional economies. Annual State estimates of

disposable personal income were published in the April 1965 Survey of

Current Business (Survey), and the first set of quarterly estimates of

State personal income was published in the December 1966 Survey.

Estimates of personal income for metropolitan areas were published in

the May 1967 Survey, for nonmetropolitan counties in the May 1974

Survey, and for metropolitan counties in the April 1975 Survey. In

the late 1970's, BEA introduced annual estimates of employment for

States, metropolitan areas, and counties.

Refinement of the residence adjustment procedures and a fuller

presentation of industrial detail for earnings--the term introduced to

cover wages and salaries plus other labor income plus proprietors'

income--emerged in the estimates published in 1974. The residence

adjustment procedures had been extended to all States in 1966, but the

residence adjustment estimates (i.e., the net flows of interstate

commuters' earnings), along with earnings by industry on a

place-of-work basis, were not published explicitly until 1974.

3.2 The Regional Economic Measurement Program

3.2.1 Estimating Schedule for State and Local Area Personal Income Series

The annual estimates of State personal income for a given year are

subject to successive refinement. Preliminary estimates, based on the

current quarterly series, are published each April, 4 months after the

close of the reference year, in the Survey. The following August,

more reliable annual estimates are published. These estimates are

developed independently of the quarterly series and are prepared in

greater component detail, primarily from Federal and State government

administrative records. The annual estimates published in August are

subsequently refined to incorporate newly available information used

to prepare the local area estimates for the same year. These revised

State estimates, together with the local area estimates, are published

the following April. The annual estimates emerging from this

three-step process are subject to further revision for several

succeeding years (the State estimates in April and August and the

local area estimates in April), as additional data become available.

For example, the 1992 State estimates that were first released in

April 1993 will be revised in August 1993 and in April and August of

subsequent years; the 1991 local area estimates that were first

released in April 1993 will be revised in April of 1994 and of

subsequent years. The routine revisions of the State estimates for a

given year are normally completed with the fourth April publication,

and the local area estimates, with the third April publication. After

that, the estimates will be changed only to incorporate a

comprehensive revision of the National Income and Product

Accounts--which takes place approximately every 5 years--or to make

important improvements to the estimates through the use of additional

or more current State and local area data.

Quarterly estimates of State personal income, which are available

approximately 4 months after the close of the reference quarter, are

published regularly in the January, April, July, and October issues of

the Survey. In,October and again the following April, the quarterly

series for the 3 previous years is revised for consistency with the

revised annual estimates. In January and July, at least the quarter

immediately preceding the current quarter is revised.

3.2.2 Availability of State and Local Area Estimates

The State and local area personal income and employment estimates are

available through the Regional Economic Information System (REIS),

which operates an information retrieval service that provides a

variety of standard and specialized analytic tabulations for States,

counties and specified combinations of counties. Standard tabulations

include personal income by type and earnings by industry, employment

by industry, transfer payments by program, and major categories of

farm gross income and expenses. These tabulations are available from

REIS in magnetic tape, computer printout, microcomputer diskette and

CD-ROM; some of the tabulations are also available electronically on

the Department of Commerce's Economic Bulletin Board, available

through the National Technical Information Service. In addition,

summary tabulations of the State and local estimates are published

regularly in BEA's major publication, the Survey. An extensive set of

State-level historical estimates is available (BEA, 1989).

BEA also makes its regional estimates available through the BEA User

Group, members of which include State agencies, universities, and

Census Bureau Primary State Data Centers. BEA provides its estimates

of income and employment for States, metropolitan areas, and counties

to these organizations with the understanding that they will make the

estimates readily available. Distribution in this way encourages

State universities and State agencies to use data that are comparable

for all States and counties and that are consistent with national

totals; using comparable and consistent data enhances the uniformity

of analytic approaches taken in economic development programs and

improves the recipients' ability to assess local area economic

developments and to service their local clientele.

3.3 BEA Annual State and County Personal Income Estimation

Methodology

3.3.1 Overview

The following discussion will focus on the annual estimates of State

and county personal income. BEA's quarterly State personal income,

annual State disposable personal income, annual State and county full-

and part-time employment, and gross State product (GSP) estimates are

produced in a manner similar to those described below. (The

methodologies for quarterly State personal income and for annual State

disposable personal income are described in BEA (1989, pp. M-32-37);

the methodology for GSP is described in BEA (1985) and in Trott,

Dunbar, and Friedenberg (1991, pp. 43-45).

The personal income of an area is defined as the income received by,

or on behalf of, all the residents of the area. It consists of the

income received by persons from an sources, that is, from

participation in production, from both government and business

transfer payments, and from government interest. Personal income is

measured as the sum of wage and salary disbursements, other labor

income, proprietors' income, rental income of persons, personal

dividend income, personal interest income, and transfer payments less

personal contributions for social insurance. Per capita personal

income is measured as the personal income of the residents of an area

divided by the resident population of the area.

At the national level, personal income is part of the personal income

and outlay account, which is one of five accounts in a set that

constitutes the national income and product accounts. Such accounts

do not now exist below the national level; however, personal income

has long been available for States and local areas. In addition, GSP,

which corresponds to the national measure gross domestic product, and

some elements of personal outlays (personal tax and nontax payments)

are available for States but not for local areas. GSP is estimated

separately from State personal income, but the two measures share most

of the elements of wages and salaries, other labor income, and

proprietors' income by State of work. For a tabular representation of

the relationships among gross domestic product, State earnings, and

GSP, see Table 2 in Trott et. al. (1991, p. 44).

3.3.2 Differences Between the National and Subnational Estimates

The definitions underlying the State and local area estimates of

personal income are essentially the same as those underlying the

national estimates of personal income. However, the national

estimates of personal income include the labor earnings (wages and

salaries and other labor income) of residents of the United States

temporarily working abroad, whereas the subnational estimates include

the labor earnings of persons residing only in the 50 States and the

District of Columbia. Specifically, the national estimates include

the labor earnings of Federal civilian and military personnel

stationed abroad and of residents who are employed by U.S. firms and

are on temporary foreign assignment. An "overseas" adjustment is made

to exclude the labor earnings of these workers from the national

totals before the totals are used as controls for the State estimates.

An important classification difference between national and

subnational estimates relates to border workers--that is, residents of

the United States who work in adjacent countries (such as Canada) and

foreigners who work in the United States but who reside elsewhere. At

the national level, the net flow of the labor earnings of border

workers and the labor earnings of U.S. residents employed by

international organizations and by foreign embassies and consulates in

the United States are included in the measurement of the

"rest-of-the-world" sector. At the State and local area levels,

however, only the labor earnings of U.S. residents employed by

international organizations and by foreign embassies and consulates in

the United States are treated as a component of personal income.

Border workers are treated as commuters and their earnings flows are

reflected in personal income through the residence adjustment

procedures.

Statistical differences between the national and subnational series

may reflect the different estimating schedules for the two series.

The State and local area estimates usually incorporate source data

that are not available when the national estimates are prepared. The

national estimates are usually revised the following year to reflect

the more current State and local area data.

3.3.3 Sources of Data

BEA uses information collected by others to prepare its estimates of

State and local area personal income. Generally, two kinds of

information are used to measure the income of persons: Information

generated at the point of disbursement of the income and information

elicited from the recipient of the income. The first kind is data

drawn from the records generated by the administration of various

Federal and State government programs; the second kind is survey and

census data.

The following are among the more important sources of the

administrative record data: The State unemployment insurance programs

of the Employment and Training Administration, Department of Labor;

the social insurance programs of the Social Security Administration

and the Health Care Financing Administration, Department of Health and

Human Services; the Federal income tax program of the Internal Revenue

Service, Department of the Treasury; the veterans benefit programs of

the Department of Veterans Affairs; and the military payroll systems

of the Department of Defense. The two most important sources of

census data are the censuses of agriculture and of population. (BEA

uses little survey data to prepare the State and local area estimates;

however, the Department of Agriculture makes extensive use of surveys

to prepare the State farm income estimates and the county cash

receipts and crop production estimates that BEA uses in the derivation

of the farm income components of personal income.) The data obtained

from administrative records and censuses are used to estimate about 90

percent of personal income. Data of lesser scope and relevance are

used for the remaining 10 percent.

When data are not available in time to be incorporated into the

current estimating cycle, interim estimates are prepared using the

previous year's State or county distribution. The interim estimates

are revised during the next estimating cycle to incorporate the newly

available data.

Using data that are not primarily designed for income measurement has

several advantages and disadvantages. Using administrative record

data and census data, BEA can prepare the estimates of State and local

area personal income on an annual basis, in considerable detail, at

relatively low cost, and without increasing the reporting burden of

businesses and households. However, because these data are not

designed primarily for income measurement, they often do not precisely

"match" the series being estimated and must be adjusted to compensate

for differences in content (definition and coverage) and geographic

detail.

3.3.4 Controls and the Allocation Procedure

The national estimates for most components of wages and salaries and

transfer payments, which together account for about 75 percent of

personal income, are based largely on the sum of subnational source

data, and the procedure used to prepare the State and county estimates

causes only minor changes to the source data. For other components of

personal income, either detailed geographic coding is not available

for all source data, or more comprehensive and more reliable

information is available for the Nation than for States and counties.

For these reasons, the estimates of personal income are first

constructed at the national level. The subnational estimates are

constructed as elements of the national totals, using the subnational

data. Thus, the national estimates, with some adjustment for

definition, serve as the "control" for the State estimates, and the

State estimates, in turn, serve as controls for the county estimates.

The State estimates are made by allocating the national total for each

component of personal income to the States in proportion to each

State's share of a related economic series. Similarly, the county

estimates are made, in somewhat less component detail, by allocating

the State total. In some cases, the related series used for the

allocation may be a composite of several items (e.g., wages, tips, and

pay-in-kind) or the product of two items (e.g., average wages times

the number of employees). In every case, the final estimating step

for each income estimate is its adjustment to the appropriate higher

level total. This procedure is called the allocation procedure.

The allocation procedure, as used to estimate a component of State

personal income, is given by

Click HERE for graphic.

The source data that underlie the national estimates are frequently

more timely, detailed, and complete than the available State and

county data. The use of the allocation procedure imparts some of

these aspects of the national estimates to the subnational estimates

and allows the use of subnational data that are related but that do

not always precisely match the series being estimated. The use of

this procedure also yields an additive system wherein the county

estimates sum to the State totals and the State estimates sum to the

national total.

3.3.5. Place of Measurement

Personal income, by definition, is a measure of income received;

therefore, estimates of State and local area personal income should

reflect the residence of the income recipients. However, the data

available for regional economic measurement are frequently recorded by

the recipients' place of work. The data underlying the estimates can

be viewed as falling in four groups according to the place of

measurement.

(1) For the estimates of wages and salaries, other labor income, and

personal contributions for social insurance by employees, most of the

source data are reported by industry in the State and county in which

the employing establishment is located; therefore, these data are

recorded by place of work. The estimates based on these data are,

subsequently adjusted to a place-of-residence basis for inclusion in

the personal income measure. (2) For nonfarm proprietors' income and

personal contributions for social insurance by the self-employed, the

source data are reported by tax-filng address. These data are largely

recorded by place of residence. (3) For farm proprietors' income, the

source data are reported and recorded at the principal place of

production, which is usually the county in which the farm has most of

its land. (4) For military reserve pay, rental income of persons

personal dividend income, personal interest income, transfer payments,

and personal contributions for supplementary medical insurance and for

veterans life insurance, the source data are reported and recorded by

the place of residence of the income recipients.

3.3.6 Sources and Methods for Annual State and County Income Estimates

3.3.6.1 Framework

Personal income is estimated as the sum of its detailed components;

the major types of payments that comprise those components are shown

in Table 3.2, together with the related percents of personal income

and the principal sources of data used to estimate the components.

The following methodology presentation consists of a section for each

of the six types of payment and a section for the residence

adjustment. The methodologies for some types of payment and for many

of the individual income components are omitted from this

presentation, but a complete presentation is available (BEA 1991, pp.

M-7-27).

3.3.6.2 Wage and salary disbursements

Wage and salary disbursements, which accounted for about 58 percent of

total personal income at the national level in 1990, are defined as

the monetary remuneration of employees, including the compensation of

corporate officers; commissions, tips, and bonuses; and receipts in

kind that represent income to the recipient. They are measured before

deductions, such as social security contributions and union dues. The

estimates reflect the amount of wages and salaries disbursed during

the current period, regardless of when they were earned.

The following description of the procedures used in making the

estimates of wage and salary disbursements is divided into three

sections: Wages and salaries that are covered under the unemployment

insurance (UI) program, wages and salaries that are not covered under

the UI program, and wages and salaries that are paid in kind.

Wages and salaries covered by the UI program

The estimates of about 95 percent of wages and salaries are derived

from tabulations by the State employment security agencies (ESA's)

from their State employment security reports (form ES-202). These

tabulations summarize the data from the quarterly UI contribution

reports filed with a State ESA by the employers subject to that

State's UI laws. Employers usually submit reports for each "county

reporting unit"--i.e., for the sum of all the employer's

establishments in a county for each industry. However, in some cases,

an employer may group very small establishments in a single

"statewide" report without a county designation. Each quarter, the

various State ESA's submit the ES-202 tabulations on magnetic tape to

the Bureau of Labor Statistics (BLS), which provides a duplicate tape

to BEA. The tabulations present monthly employment and quarterly

wages for each county in Standard Industrial Classification four-digit

detail. (The ES-202 tabulations through 1987 reflect the 1972 SIC,

and those for 1988-90, the 1987 SIC.) Under the reporting

requirements of most State UI laws, wages include bonuses, tips,

gratuities, and the cash value of meals and lodging supplied by the

employer.

The BEA estimates of wage and salary disbursements are made, with a

few exceptions, at the SIC two-digit level. However, the availability

of the ES-202 data in SIC four-digit detail facilitates the detection

of errors and anomalies; this detail also makes it possible to isolate

those SIC three-digit industries for which UI coverage is too

incomplete to form a reliable basis for the estimates. In this case,

the SIC two-digit estimate is prepared as the sum of two pieces: The

fully covered, portion, which is based on the ES-202 data, and the

incompletely covered portion, which is estimated as described in the

section on wages and salaries not covered by the UI program.

The ES-202 wage and data do not precisely meet the statistical and

conceptual requirements for BEA's personal income estimates.

Consequently, the data must be adjusted. to meet the requirements

more closely. The adjustments affect both the industrial and

geographic patterns of the State and county UI-based wage estimates.

Adjustment for statewide reporting.--Wages and salaries reported for

statewide units are allocated to counties in proportion to the

distribution of the wages and salaries reported by county; the

allocations for each State are made for each private-sector industry

(generally at the SIC two-digit level) and for five government

components.

Adjustment for industry nonclassification.--The industry detail of the

ES-202 tabulations regularly shows minor amounts of payroll that have

not been assigned to any industry. For each State and county, the

amount of ES-202 payrolls in this category is distributed among the

industries in direct proportion to the industry-classified payrolls.

Misreporting adjustment.--This adjustment--the addition of estimates

of wages and salaries subject to UI reporting that employers do not

report--is made to the ES-202 data for all covered private-sector

industries. At the national level, the estimate for each industry is

made in two parts--one for the underreporting of payrolls on UI

reports filed by employers and one for the payrolls of employers that

fail to file Ul reports (Parker, 1984). The source data necessary to

replicate this methodology below the national level are not available.

Instead, the national adjustment for each industry is allocated to

States and counties in proportion to ES-202 payrolls.

Adjustments to government components.--Alternative source data are

substituted for the ES-202 data when the latter series reflects

excessively large proportions of Federal civilian payrolls that are

not reported by county or of State government payrolls that are

apparently reported in the wrong counties. For Federal civilian wages

and salaries, the alternative source data are tabulations of

employment by agency and county prepared by the Office of Personnel

Management. For State government wages and salaries, the alternative

source data are place-of-work wage data derived from an unpublished

tabulation of journey-to-work (JTW data from the 1980 Census of

Population. (All income estimates using 1980 Census of Population

data will be updated to incorporate 1990 Census of Population data in

a regional comprehensive revision to be released in the spring of

1994.)

Adjustments for noncovered elements of UI-covered industries.--BEA

presently makes adjustments for the following noncovered elements:

0 Tips;

0 Commissions received by insurance solicitors and real estate agents;

0 Payrolls of electric railroads, railroad carrier affiliates, and railway labor organizations;

0 Salaries of corporate officers in Washington State;

0 Payrolls of nonprofit organizations exempt from UI coverage

because they have fewer than four employees;.

0 Wages and salaries of students employed by the institutions of

higher education in

which they are enrolled;

0 Allowances paid to Federal civilian employees in selected

occupations for

uniforms; and

0 Salaries of State and local government elected officials and

members of the judiciary.

Except for tips, these elements are exempted from State UI coverage.

Tips are covered by the various UI laws. BEA assumes that this form

of income payment is considerably underreported, and it therefore

makes additional estimates of tips in industries where tipping is most

customary.

National and State estimates of each of the noncovered elements are

made (based on either direct data or indirect indicators). These

estimates are added to the ES-202 payroll amount for the industry of

the noncovered element to produce the final estimates for that

industry. Because of the lack of relevant data, county estimates are

made by allocating the final State total by the distribution of ES-202

payrolls for the appropriate industry.

Wages and salaries not covered by the UI program

Eight industries are treated as noncovered in making the State and

county estimates of wage and salary disbursements: (1) Farms, (2) farm

labor contractors, (3) railroads, (4) private elementary and secondary

schools, (5) religious membership organizations, (6), private

households, (7) military, and (8) "other." The estimates for these

industries are based on a variety of sources. For example, the

estimates for railroads ar based mainly on employment data provided by

the Association of American Railroads, and the estimates for the

military services are based mainly on payroll data provided by the

Department of Defense. See BEA (1991) for the methodology for the

noncovered industries.

Wages and salaries paid in kind

The value of food, lodging, clothing, and miscellaneous goods and

services furnished to employees by their employers as payment, in part

or in full, for services perfomed is included in the wage and salary

component of personal income and is referred to as "pay-in-kind." The

estimates for UI-covered industries are prepared as an integral part

of total wages and salaries for those industries, based on the ES-202

data. The estimates for most on the noncovered industries are based

on pertinent employment data. See BEA (1991) for the methodology for

pay-in-kind.

3.3-6.3 Other labor income

Other labor income (OLI), which accounted for about 5.5 percent of

total personal income at the national level in 1990, consists

primarily of employer contributions to private pension and welfare

funds; these employer contributions account for approximately 98

percent of OLI. The "all other" component of OLI consists of

directors' fees, judicial fees, and compensation of prisoners.

Employer contributions for social insurance, which are paid into

govemment-administered funds, are not included in OLI; under national

income and product account conventions, it is the benefits paid from

social insurance funds--which are classified as transfer

payments--that are measured as part of personal income, not the

employer contributions to the funds.

Employer contributions to private pension and welfare funds

Private pension and profit-sharing funds, group health and life

insurance, and supp1emental unemployment insurance.--The larger part

of the national estimates of employer contributions to private pension

and welfare funds is developed from Internal Revenue Service

tabulations of data from proprietorship and corporate income tax

returns published in Statistics of Income. However, these data are

not suitable for making the subnational estimates because most

multiestablishment corporations file tax returns on a companywide

basis instead of for each establishment and because the State in which

a corporation's principal office is located is often different from

the State of its other establishments. As a result, the geographic

distribution of the data tabulated from the tax returns does not

necessarily reflect the place of work of the employees on whose behalf

the contributions are made.

For private-sector employees, the State and county estimates of

employer contributions to private pension and profit-sharing funds,

group health and life insurance, and supplemental unemployment

insurance are made, for all types of employer contributions combined,

at the SIC two-digit level, the same level of industrial detail as the

wage and salary estimates. The national total of employer

contributions for each industry is allocated to the States and

counties in proportion to the estimates of wage and salary

disbursements for the corresponding industry. The use of subnational

wage estimates to allocate the national estimates of employer

contributions to private pension and welfare funds is based on the

assumption that the relationship of contributions to payrolls for each

industry is the same at the national, State, and county levels. The

procedure reflects the wide variation in contribution rates--relative

to payrolls--among industries (and therefore reflects appropriately

the various mixes of industries among States and counties). It does

not reflect the variation in contribution rates among States and

counties for a given industry.

The Federal Government makes contributions to a private pension fund,

called the Thrift Savings Plan, on behalf of its civilian employees

who participate in the Federal Employees Retirement System (mainly

employees hired after 1983). In the absence of direct data below the

national level, the national estimate is allocated to States and

counties in proportion to the estimates of Federal civilian wages and

salaries.

State government contributions to private pension plans consist of

annuity payments made by State governments on behalf of selected

employee groups--primarily teachers. The State estimates are based on

direct data from the Teachers Insurance and Annuity

Association/College Retirement Equities Fund. The county estimates

are prepared by allocating the State estimates in proportion to the

estimates of State and local government education wages and salaries.

In the absence of direct data below the national level, the national

estimates of Federal, State, and local government contributions to

private welfare funds on behalf of their employees are allocated to

States and counties in proportion to ES-202 employment data for each

level of government.

Privately administered workers' compensation.--The State estimates for

this subcomponent are based mainly on direct data provided by the

National Council on Compensation insurance and by the Social Security

Administration; the county estimates for each SIC two-digit industry

reflect the geographic distribution of wages and salaries. The

methodology for this income component is given in BEA (1991).

"All other" OLI

The methodology for "all other" OLI--primarily directors' fees and

jury and witness fees--is given in BEA (1991). The State and county

estimates for directors' fees--the largest of these

subcomponents--reflect the geographic distribution of wages and

salaries in each industry.

3.3.6.4 Proprietors' Income

Proprietors' income, which accounted for about 8.5 percent of total

personal income at the national level in 1990, is the income,

including income-in-kind, of sole proprietorships and partnerships and

of tax-exempt cooperatives. The imputed net rental income of

owner-occupants of farm dwellings is included. Dividends and monetary

interest received by proprietors of nonfinancial business, monetary

rental income received by persons who are not primarily engaged in the

real estate business, and the imputed net rental income of

owner-occupants of nonfarm dwellings are excluded; these incomes are

included in dividends, net interest, and rental income of persons.

Proprietors' income, which is treated in its entirety as received by

individuals, is estimated in two parts--nonfarm and farm.

Nonfarm prorrietgrt' income

Nonfarm proprietors' income is the income received by nonfarm sole

proprietorships and partnerships and by tax-exempt cooperatives. The

State and county estimates of the income of sole proprietors and

partnerships for all but three of the SIC two-digit industries are

based on 1981-83 tabulations from Internal Revenue Service (IRS) form

1040, Schedule C (for sole proprietors), and form 1065 (for

partnerships). Tabulations either of gross receipts or of profit less

loss from the two forms combined are used either to attribute a

national total to the States or as direct data. Two national totals

are used for each industry: One for income reported on the income tax

returns as adjusted to conform with national income and product

accounting conventions--and one for an estimate of the income not

reported on tax returns.

For the adjustments for unreported income, no direct data are

available below the national level. The national total for each

industry is attributed to States in proportion to the IRS State

distribution of gross receipts for the industry. For the reported

portion of nonfarm proprietors' income, the State estimates for each

of 45 industries are based on the IRS distribution of profit less loss

for the industry, and the estimates for each of another 20 industries

(together accounting for 3 percent of total nonfarm proprietors'

income) are based on the IRS distribution of gross receipts for the

industry. For the latter group, the ERS distribution of profit less

loss, although preferable in concept, is not used as a basis for State

estimates because the extreme year-to-year volatility of the State

data suggests that they are unreliable.

The 1983 State estimates prepared by the foregoing methodology are

extended to later years based mainly on the number of small

establishments in each industry as determined from the Census Bureau's

County Business Patterns; see BEA (1989) for a full description of the

methodology.

For the three remaining industries, limited partners' income presents

a special estimating problem. In these industries--crude petroleum

and natural gas extraction, real estate, and holding and investment

companies--limited partnerships are often used as tax shelters.

Limited partners' participation in partnerships is often purely

financial; their participation more closely resembles that of

investors than that of working partners. Accordingly, the usual

assumption that the State from which the partnership files its tax

return is the same as the residence of the individual partners is

unsatisfactory. No direct data on the income of partners by their

place of residence are available. The national estimates of

proprietors' income for these industries are attributed to States in

the same proportion as dividends received by individuals (based on

all-industry dividends reported on IRS form 1040).

The State estimates of the income of tax-exempt cooperatives are based

on data provided by the Rural Electrification Administration (for

electric and telephone cooperatives) and the Agricultural Cooperative

Service (for farm supply and marketing cooperatives); see BEA (1989)

for the methodology.

The methodology for the county estimates of nonfarm proprietors'

income is similar to the State methodology, but less direct data are

used for many industries because problems with data volatility are

greater at the county level. See BEA (1991) for a full description of

the county methodology.

Farm proprietors' income

The estimation of farm proprietors' income starts with the computation

of the realized net income of all farms, which is derived as farm

gross receipts less production expenses. This measure is then

modified to reflect current production through a change-in-inventory

adjustment and to exclude the income of corporate farms and salaries

paid to corporate officers. Tables showing the derivation of State

and county farm proprietors' income in detail are available from the

Regional Economic Information System.

The concepts underlying the national and State BEA estimates of farm

income are generally the same as those underlying the national and

State farm income estimates of the U.S. Department of Agriculture

(USDA). The major definitional difference between the two sets of

estimates relates to corporate farms. The USDA totals include net

income of corporate farms, whereas the BEA personal income series,

which measures farm proprietors' net income, by definition excludes

corporate farms. Additionally, BEA classifies the salaries of

officers of corporate farms as part of farm wages and salaries; USDA

treats the corporate officers' salaries as returns to corporate

ownership and as part of the total return to farm operators.

The State control totals for the BEA county estimates of farm

proprietors' income are taken from the component detail of the USDA

State estimates, which are modified to reflect BEA definitions and to

include interfarm intrastate sales.

The methods used to estimate farm proprietors' income at the county

level rely heavily on data obtained from the 1974, 1978, and 1982

censuses of agriculture and on selected annual county data prepared by

the State offices affiliated with the National Agricultural Statistics

Service (NASS), USDA. (Data from the 1987 Census of Agriculture will

be incorporated into the estimates with the next cycle of

comprehensive revisions.) The NASS data, which are described by Iwig

in Chapter 7 of this report, are used, wherever possible, to

interpolate and extrapolate to noncensus years. In addition, data

from other sources within USDA, such as the Agricultural Stabilization

and Conservation Service, are used to prepare a fairly detailed income

and expense statement covering all farms in the State and county.

For census years, BEA prepares county estimates of 46 components of

gross income and 13 categories of production expenses. For

intercensal and postcensus years, the component detail of the

estimates for each State is set to take advantage of the best annual

county data available for the State.

Farm gross income includes estimates for the following items: (1) The

cash receipts from farm marketing of crops and livestock (in component

detail); (2) the income from other farm-related activities, including

recreational services, forest products, and custom-feeding services

performed by farm operators; (3) the payments to farmers under several

government payment programs; (4) the value of farm products produced

and consumed on farms; (5) the gross rental value of farm dwellings;

and (6) the value of the net change in the physical volume of farm

inventories of crops and livestock.

Cash receipts from marketing is the most important component of farm

gross income. The USDA generally has annual production, marketing,

and price data available for preparing the State estimates for about

150 different commodities. However, annual county estimates of cash

receipts--usually for total crops and for total livestock--are

currently available for only 19 States (BEA 1991, fn. 15, p. M-14).

For the other States, the USDA State estimates of cash receipts from

the marketing of individual commodities are summed into the 13 crop

and 5 livestock groups for which value-of-sales data are reported by

county in the censuses of agriculture. The aggregates for the census

years are then allocated by the related census county distributions.

Estimates for intercensal years are based on supplemental county

estimates of annual production of selected field crops and on State

season average prices available from the State NASS offices, or they

are calculated by straight-line interpolation between the census years

and adjusted to State USDA levels.

The county estimates of the remaining components of gross income, of

production expenses, of the adjustment for interfarm intrastate

transactions, and of the adjustment to exclude the income of corporate

farms are based mainly on data from the censuses of agriculture and

data provided by NASS and by the Agricultural Stabilization and

Conservation Service. See BEA (1991) for a full description of the

methodology.

3.3.6.5 Personal Dividend Income, Personal Interest Income, and

Rental Income of Persons

These components accounted for more than 17 percent of total personal

income in 1990. Dividends are payments in cash or other assets,

excluding stock, by corporations organized for profit to noncorporate

stockholders who are U.S. residents. Interest is the monetary and d

imputed interest income of persons from all sources. Imputed interest

represents the excess of income received by financial intermediaries

from funds entrusted to them by persons over income disbursed by these

intermediaries to persons. Part of imputed interest reflects the

value of financial services rendered without charge to persons by

depository institutions. The remainder is the property income held by

life insurance companies and private noinsured pension funds on the

account of persons; one example is the additions to Policyholder

reserves held by life insurance companies.

Rental income of persons consists of the monetary income of persons

(except those primarily engaged in the real estate business) from the

rental of real property (including mobile homes); the royalties

received by persons from patents, copyrights, and rights to natural

resources; and the imputed net rental income of owner-occupants of

nonfarm dwellings.

The State and county estimates of dividends, interest, and rent are

based mainly on data tabulated from Federal individual income tax

returns by the Internal Revenue Service. The methodology for

dividends, interest, and rent is given in BEA (1991).

3.3.6.6 Transfer payments

Transfer payments are payments to persons, generally in monetary form,

for which they do not render current services. As a component of

personal income, they are payments by government and business to

individuals and nonprofit institutions. Nationally, transfer payments

accounted for almost 15 percent of total personal income in 1990. At

the county level, approximately 75 percent of total transfer, payments

are estimated on the basis of directly reported data. The remaining

25 percent are estimated on the basis of indirect, but generally

reliable, data.

For the State and county estimates, approximately 50 subcomponents of

transfer payments are independently estimated using the best data

available for each subcomponent. The methodology for all of these

subcomponents is given in BEA (1991); the following items are

presented here as examples.

Old-age, survivors, and disability insurance (OASDI) payments.--These

payments, popularly known as social security, consist of the total

cash benefits paid during the year, including monthly benefits paid to

retired workers, dependents, and survivors and special payments to

persons 72 years of age and over; lump-sum payments to survivors; and

disability payments to workers and their dependents. The State

estimates of each OASDI segment are based on Social Security

Administration (SSA) tabulations of calendar year payments. The

county estimates of total OASDI benefits are based on SSA tabulations

of the amount of monthly benefits paid to those in current-payment

status on December 31, by county of residence of the beneficiaries.

Medical vendor payments.--These are mainly payments made through

intermediaries for care provided to individuals under the federally

assisted State-administered medicaid program. Payments made under the

general assistance medical programs of State and local governments are

also included. The county estimates are based on available payments

data from the various State departments of social services. For

States where no county data are available, the county estimates are

based on the distribution of payments made under the aid to families

with dependent children program.

Aid to families with depenndent children (AFDC).--This

State-administered program receives Federal matching funds to provide

payments to needy families. The State estimates are based on

unpublished quarterly payments data provided by the SSA. The county

estimates are prepared from payments data provided by the various

State departments of social services. County data are no longer being

received from some State for these States, the most recent available

data are used for the county estimates for each subsequent year.

State unemployment compensation.--These are the cash benefits,

including special benefits authorized by Federal legislation for

periods of high unemployment, from State-administered unemployment

insurance (UI) programs. Most States report benefits directly by

county, but a few report by local district office. In the latter

case, local district office data are distributed among the counties

within the jurisdiction of the local district office in proportion to

the annual average number of unemployed persons estimated by the

Bureau of Labor Statistics (BLS). When the State is unable to supply

the county data in time to meet the publication deadline, a

preliminary set of estimates is made and is revised the following year

to incorporate the delayed county data. The preliminary county

estimates are prepared by extrapolating the preceding year's estimates

forward by the change in the BLS estimate of the annual average number

of unemployed persons.

Veterans life insurance benefit payments.--These are the claims paid

to beneficiaries and the dividends paid to policyholders from the five

veterans life insurance programs administered by the Department of

Veterans Affairs. The county allocations of the combined payments of

death benefits and dividends are based on the distribution of the

veteran population.

Interest payments on guaranteed student loans.--These are the payments

to commercial lending institutions on behalf of individuals who

receive low-interest deferred-payment loans from these institutions to

pay the expenses of higher education. The State estimates are based

on Department of Education data on the number of persons enrolled in

institutions of higher education. The county allocations are based on

the distribution of the civilian population.

3.3.6.6 Personal Contributions For Social Insurance

Personal contributions for social insurance are the contributions made

by individuals under the various social insurance programs. These

contributions are excluded from personal income by treating them as

explicit deductions. Payments by employees and the self-employed for

social security, medicare, and government employees' retirement are

included in this component. Also included are the contributions that

are made by persons participating in the veterans life insurance

program and in the supplementary medical insurance portion of the

medicare program.

The State and county estimates of personal contributions for social

insurance are generally based either on direct data from the

administering agency or on the geographic distribution of the

appropriate earnings component; see BEA (1989 and 1991) for the full

methodologies.

3.3.6.7 Residence Adjustment

Personal income is a "place-of-residence" measure of income, but the

source data for the components that compose more than 60 percent of

personal income are recorded by place of work. The adjustment of the

estimates of these components to a place-of-residence basis is the

subject of this section.

At the national level, place of residence is an issue only for border

workers (mainly those living in the United States and working in

Canada or Mexico and vice versa). At the State and county levels, the

issue of place of residence is more significant. Individuals

commuting to work between States are a major factor where metropolitan

areas extend across State boundaries--for example, the Washington,

DC-MD-VA MSA. Individuals commuting between counties are a major

factor in every multicounty metropolitan area and in many

nonmetropolitan areas.

BEA's concept of residence as it relates to personal income refers to

where the income to be measured is received rather than to "usual,"

"permanent," or "legal" residence. It differs from the Census

Bureau's concept mainly in the treatment of migrant workers. The

decennial census counts many of these workers at their usual place of

residence rather than where they are on April 1 when the census is

taken. Except for out-of-State workers in Alaska (where migrant

workers are unusually important) and for certain groups of border

workers, BEA assigns the wages of migrant workers to the area in which

they reside while performing the work. Similarly, BEA assigns the

income of military personnel to the county in which they reside while

on military assignment, not to the county in which they consider

themselves to be permanent or legal residents. Thus, in the State and

local area personal income series, the income of military personnel on

foreign assignment is excluded because their residence is outside of

the territorial limits of the United States.

Three of the six major components of personal income are recorded, or

are treated as if recorded, on a place-of-residence (where-received)

basis. They are transfer payments; personal dividend income, personal

interest income, and rental income of persons; and proprietors,

income. Nonfarm proprietors' income is treated as income recorded on

a place-of-residence basis because the source data for almost all of

this part of proprietors' income are reported to the IRS by tax-filing

address, which is usually the filer's place of residence. The source

data for farm proprietors' income are recorded by place of production,

which is usually in the same county as the proprietor's place of

residence.

The remaining three major components--wages and salaries, other labor

income (OLI), and personal contributions for social insurance--are

estimated, with minor exceptions, from data that are recorded by place

of work (point of disbursement). The sum of these components (wages

plus OLI minus contributions) is referred to as "income subject to

adjustment" (ISA).

Residence adjustment procedure (excluding border workers

The county residence adjustment estimates for 1981 and later years are

based on those for 1980 because intercounty commuting data are

available only from the decennial censuses of population. (Data from

the 1990 Census of Population will be introduced into the residence

adjustment estimates as part of the comprehensive revisions to the

State and local area personal income estimates that are now underway.)

The estimation of these adjustments can be understood using the

example of a two-county area comprising counties f and g. The

two-county example is easily generalized to more complex situations.

Click HERE for graphic.

data from the 1980 Census of Population on the number of wage and

salary workers (W) and on their average earnings (E) by county of work

for each county of residence:

Click HERE for graphic.

The initial 1980 BEA estimates were modified in three situations.

First, for clusters of counties identified as being closely related by

commuting (mostly multicounty metropolitan areas), modifications were

made to incorporate the 1979 wage and salary distribution from the

1980 Census of Population. The 1979 wage and salary distribution from

the 1980 Census of Population reflects the residential distribution of

the income recipients as of April 1, 1980, regardless of where they

were living when they received the wages and salaries.) These

modifications are needed because in numerous cases the 1980-census JTW

data and the source data for the BEA wage estimates are inconsistently

coded by place of work. (For example, the source data may attribute

too much of the wages of a multiestablishment firm to the county of

the firm's main office, or the geographic coding of the Defense

Department payroll data and of the JTW data may attribute a military

base extending across county boundaries to different counties.)

Initial county estimates of place-of-residence wages and salaries were

derived as place-of-work wages and salaries plus net residence

adjustment for wages and salaries. (For the calculation of this net

residence adjustment, only the gross flows for wages and salaries were

used.) Then, the initial 1980 BEA place-of-residence wage and salary

estimates were summed to a total for each cluster. Finally, the BEA

total for each cluster was redistributed among the counties of the

cluster in the same proportion as the 1979 wage and salary

distribution from the 1980 census. To facilitate the extension of the

1980 residence adjustment estimates to later years, the cluster-based

modifications--derived as net additions to or subtractions from the

initial residence adjustment estimates for each of the 1,287

counties--were expressed as gross flows between pairs of counties

within the same cluster. In the simplest case--a two-county

cluster--the additional gross flow was assumed to be from the county

with the negative modification to the county with the (exactly

offsetting) positive modification.

Second, modifications were made for selected noncluster adjacent

counties if large, offsetting differences occurred between the initial

1980 BEA estimates and the census wage data for these counties. These

adjacent-county modifications were expressed as gross flows in the

same way and for the same reason as the cluster-based modifications.

Third, modifications were made for eight Alaska county equivalents

(boroughs and census areas) to reflect the large amounts of labor

earnings received by seasonal workers from out of State. The

1980-census JTW data reflect the "commuting" of many of these workers,

and the initial 1980 residence adjustment estimates for a majority of

the county equivalents did not require modifications. However, for

eight county equivalents, the initial 1979 estimates yielded BEA

place-of-residence wage and salary totals that were so much higher

than the comparable census data that the could not be an accurate

reflection of the wages of only the permanent residents. The 1979

residence adjustment estimates, although based mainly on the

1980-census JTW data, also reflect--at the appropriate one-tenth

weight--1970-census JTW data.) Based on the assumption that the

excess amounts were attributable to out-of-State migrant workers,

these amounts were removed by judgmentauy increasing the JTW-based

gross flows to the large metropolitan counties of Washington, Oregon,

and California.

Click HERE for graphic.

As a last step, the total place-of-residence ISA OSA plus net

residence adjustment) for each cluster is derived and then distributed

to the counties of the cluster based on 1980 place-of-residence ISA

extrapolated to later years by the percentage change in the IRS-based

wage series. The net residence adjustment, estimate for each cluster

county is calculated as place-of-residence ISA minus place-of-work

ISA.

3.4 Evaluation Practices

In the past few years, two major studies were undertaken by BEA to

evaluate the State and local area income estimates: (1) a reliability

study of the State quarterly personal income series and (2) a study of

the accuracy of the county residence adjustment estimates. In

addition, In March of this year, the U.S. General Accounting Office

(GAO) completed a study of BEA's national and State estimates.

3.4.1 Evaluation of the State Quarterly Personal Income Series

This study provided a detailed measurement and analysis of the

reliability of quarterly and annual estimates of State personal income

(Brown and Stehle 1990). The study, which covered the State estimates

from 1980-87, assessed the reliability of State quarterly personal

income using several statistical measures to examine the size of the

revisions made to the estimates. One measure used analyzes the range

of revisions, where revision is defined as the percent change in the

final estimates minus the percent change in the preliminary estimates.

Other sets of measures used were dispersion, relative dispersion,

bias, and relative bias. The findings of the study were intended to:

(1) help BEA isolate particular problem areas in the production of

these estimates; and (2) help users of these data determine the

suitability for their purposes of the estimates released at different

stages of the estimating process. The four principle findings of the

study were: (1) the major sources of the revisions to the quarterly

percent changes in the preliminary quarterly estimates of State

personal income are farm proprietors' income and wages and salaries;

(2) largely reflecting wages and salaries, the preliminary quarterly

estimates of total personal income tend to be underestimated in

fast-growing States and overestimated in slow-growing States; (3)

beginning in 1984, the reliability of the second quarterly estimates

(that is, the estimates yielded by the first routine revision) was

improved by the incorporation of quarterly data from employers'

payroll tax reports (the ES-202 data), and (4) the annual revisions of

total personal income are smaller than the quarterly revisions.

3.4.2 Residence Adjustment Reliability Study

In October 1988, a study was completed which measured the reliability

of the Census commuting data used to prepare the net residence

adjustments for county personal income (Zabronsky 1988). While the

impact of the residence adjustments are generally small at the region

and State level, the residence adjustments constitute a large portion

of total personal income for most counties in the U.S. In 1989 for

instance, the absolute value of the net residence adjustments

accounted for about 12.5 percent of total personal income for all

counties, on average, while accounting for about 25 percent of total

personal income in metropolitan area counties, on average.

In this residence adjustment reliability study, a comparison of a

Census file of county commuting data constructed from the

journey-to-work question on the 1980 decennial census was made with a

file of aggregate wages and salaries independently tabulated by

Census. In the course of the study, comparisons between

thejourney-to-work and aggregate wage series were explored across a

variety of geographic demographic, and industrial detail to develop a

comprehensive reliability profile for the census commuting data.

The major conclusion of this study was that taking the 1980 Census

aggregate income series as a benchmark measure of county wages and

salary income, the 1980 census journey-to-work data proved to be a

highly reliable source for measuring commuter's income in the

development of BEA's county residence adjustment estimates. Although

careful analysis of the Census journey-to-work wage data did reveal a

bias in that series that was correlated with county size, wage

amputations undertaken by BEA largely corrected the problem while

commuting patterns between counties indicated that for the relevant

comparisons, the Census journey-to-work data were consistent with the

Census aggregate income-based wage series.

3.4.3 GAO Study of BEA's State and National Estimates

The GAO study (GAO, 1993) was conducted in response to a request by

the Honorable Ernest F. Hollings, Chairman, Committee on Commerce,

Science, and Transportation, U.S. Senate. Senator Hollings expressed

concern about press reports that alleged that BEA "did not

incorporate, for political purposes, a downward revision of original

employment levels into its October 1991 estimate of first quarter 1991

State personal income growth and its December 1991 estimate of first

quarter 1991 gross domestic product (GDP) growth. The report

concluded that "We found no evidence that BEA manipulated first

quarter 1991 personal income or GDP estimates for political purposes.

BEA generally followed its standard procedures for using employment

data in these estimates and deviated from these procedures only when

required by what we believe were reasonable technical judgments" (GAO,

1993, p. 1)

3.5 Current Problems and Activities

BEA's regular publication schedules are carefully developed to take

into account the needs of users, balanced against the responsibility

to produce data of high quality. In general, the four-month lag of

the State quarterly and preliminary annual State personal income data

and the eight-month lag in the release of more detailed annual State

personal income estimates are timely enough for most purposes and

cause few hardships for the users of these series.

For county and metropolitan area data, the fifteen-month lag required

to produce these estimates is considered too long for many purposes

and has limited the usefulness of these data. In an effort to address

the issue of the timeliness of its local area estimates, BEA has

recently been testing the feasibility of developing preliminary annual

estimates of personal income for metropolitan areas and

non-metropolitan portions of States. These estimates would be

available with a seven-month lag.

3.6 Conclusions

Rapid advances in computer technologies continue to provide

improvements in the range of regional data available for local area

estimation as well as in the timing of their availability. For

example, the more timely availability of ES-202 wage and salary data

coupled with BEA's improved computing capabilities and estimating

procedures may allow for the much more timely release of preliminary

income estimates for metropolitan areas.

These rapid advances in computer technologies also continue to expand

the ease of data transfer, storage, and manipulation. For example,

BEA recently introduced a CD-ROM containing the local area personal

income estimates; many data users can now acquire the entire set of

estimates rather than placing an order each time they need some of the

data. As in the past, it is anticipated that these advances in

electronic capabilities will continue to expand the uses and users of

BEA's regional estimates.

Table 3. 1: Programs Using BEA Personal Income Estimates in Allocation Formulas

for Federal Domestic Assistance Funds, Fiscal Year 1992

Program Program FY 1992 Obligations

Number Name (Millions of $)

---------------------------------------------------------------------

17.235 Senior Community Service 395.2

Employment Program

84.126 Rehabilitation Services 1,783.5

84.154 Public Library Construction 29.8

and Technology Enhancement

93.020 Family Support Payments to 13,814.9

States (AFDC)

93.138 Protection and Advocacy for 19.1

Mentally Ill Individuals

93.630 Developmental Disabilities 90.2

Basic Support and Advocacy

93.645 Child-Welfare Services--State 273.9

Grants

93.658 Foster Care--Title IV-E 2,342.1

93.659 Adoption Assistance 201.9

93.778 Medical Assistance Program 72,502.7

(Medicaid; Title XIX)

93.779 Health Care Financing Research 78.4

93.992 Alcohol & Drug Abuse & Mental 292.0

Health Services

TOTAL 92,823.7

----------------------------------------------------------------------

Source: Office of Management and Budget and U.S. General Services

Administration (1992), 1992 Catalog of Federal Domestic Assistance,

Washington, DC: U.S. Government Printing office. For information

about the grant formulas, see U.S. General Services Administration

(1992), 1992 Formula Report to the Congress, Washington, DC: U.S.

Government Printing Office.

----------------------------------------------------------------------------

Click HERE for graphic.

REFERENCES

Advisory Commission on Intergovernmental Relations (ACIR) (1990), Significant

Features of Fiscal Federalism, Volume 1: Budget Processes and Tax Systems,

M-169, pp. 10-13, Washington, DC: U.S. Government Printing Office.

Bureau of the Census, U.S. Department of Commerce (1992), Statistical

Abstract of the United States: 1992, Appendix II, Washington, DC:

U.S. Government Printing Office.

Bureau of Economic Analysis (BEA), U.S. Department of Commerce (1985),

Expermental Estimates of Gross State Product by Industry, BEA Staff

Paper 42, Washington, DC: National Technical Information Service.

______(1989), State Personal Income. 1929-87, Estimates and a Statement

of Sources and Methods, Washington, DC: U.S. Government Printing Office.

______(1991), Local Area Personal Income, 1984-89, Volume 1:

Summary, Washington, DC: U.S. Government Printing Office.

Brown, R.L. and Stehle, J.E. (1990), "Evaluation of the State Personal Income

Estimates," pp. 20-29, Survey of Current Business 70 (December, 1990).

Creamer, D. and Merwin, C. (1942), "State distribution of Income

Payments, 1929-4l," Survey of Current Business 22 (July 1942).

Nathan, R.R, and Martin, J.L. (1939), "State Income Payments, 1929-37,"

mimeographed report, Washington, DC: Bureau of Foreign and Domestic

Commerce, U.S. Department of Commerce.

Parker, R.P. (1984), "Improved Adjustments for Misreporting of Tax

Return Information Used to Estimate the National Income and Product

Accounts, 1977," pp. 17-25, Survey of Current Business 64 (June 1984).

Trott, E.A., Dunbar, A.E., and Friedenberg, H.L. (1991), "Gross State Product

by Industry, 1977-89," pp. 43-59, Survey of Current Business 71

(December 1991).

U.S. General Accounting Office (GAO) (1993), Gross Domestic Product: No

Evidence of Manipulation in First Ouarter 1991 Estimates, Washington, DC:

U.S. Government Printing Office.

Zabronsky, D. (1988), "Reliability of the Census Journey-to-Work data

in the Residence Adjustment for County Personal Income," Discussion

Paper #35, Bureau of Economic Analysis, U.S. Department of Commerce.

CHAPTER 4

Postcensal Population Estmates:

States, Counties, and Places

John F. Long, U. S. Bureau of the Census

4.1 Introduction and Program History

The U. S. Bureau of the Census produces population estimates for the

nation, states, counties, and places (cities, towns, and townships) as

part of its program to quantify changes in population size and

distribution since the last census. These estimates provide updates

to the population counts by demographic and geographic characteristics

from the last census. They also indicate the pace of population

change since the last census and the relative influence of the

components of population change. While the national estimates can be

produced by a careful accounting system that adds annual births,

deaths, and international migration to the previous year's population,

subnational estimates require development of methods for dealing with

the largely unmeasured component effects of internal migration. Many

of these methods represent the type of small domain estimates that

constitute the subject of this working paper.

4.1.1 Uses of postcensal population estimates

There are five major categories of uses for the Census Bureau's

population estimates: 1) Federal and state funds allocation, 2)

denominators for vital rates and per capita time series, 3) survey

controls, 4) administrative planning and marketing guidance, and 5)

descriptive and analytical studies (Table 4.1). More than 70 federal

programs distribute tens or billions of dollars annually on the basis

of population estimates (GAO, 1990). Even more money was distributed

indirectly on the basis of indicators which used population estimates

for denominators or controls (GAO, 1991). Many states also use the

postcensal subnational estimates to allocate state funds to counties,

townships, and incorporated places within the state.

A large number of Federal statistical series including state and

county per capita income, national and state birth and death rates,

and county level cancer rates by age, sex, and race use the results of

the postcensal estimates. While many Federal agencies directly

collect time series data on events and amounts, they require annual

postcensal estimates of state and county population to produce per

capita rates. These series provide an indication of national and

subnational trends for fertility and mortality rates, incidence of

cancer and other diseases, per capita economic changes, and other

social, demographic, and administrative indicators.

Population surveys require independent controls from national

population estimates by age, sex, race, and ethnicity as well as data

on the geographic distribution of the population by states and

selected metropolitan areas. These estimates are used to weight the

sample cases such that the survey results equal the postcensal

estimates used as controls. Each of the major surveys conducted by

the Census Bureau control to somewhat different levels of geographic

and demographic detail (Table 4.2). There are a number of reasons to

control surveys to independent estimates. They were initially

instituted to reduce the variance of the survey estimates. They are

also used for a number of secondary reasons: reduction in month-to-

month variability of longitudinal data from consecutive surveys,

partial correction for the large rates of undercoverage of surveys

relative to the census, and improved consistency between different

surveys and other population data series based on independent

estimates.

There are numerous other administrative and analytical uses of the

postcensal population estunates. They provide the only regular

mechanism by which the components of population change are combined to

track changes in the size and demographic and geographic distribution

of the nation's population. The postcensal estimates provide

essential information for administration and planning in the

government and private sectors. In addition, they are used as a

standard by state and local governments and the private sector in

producing their own population estimates for smaller scale geography

or for greater social and economic detail.

4.1.2 History or Census Bureau estimates program

Since the early 1900s, the Census Bureau has produced national

population estimates. The methodology for these estimates developed

into a component method in which the measured components of population

change (births, deaths, immigration, and emigration) are added to or,

in the case of deaths, subtracted from the most recent decennial

census to estimate the current population.

When the Census Bureau attempted state population estimates beginning

in the 1940s, it faced the difficult prospect of adding internal

migration to the other components of population change. Since annual

measures of internal migration by state are not available, many

attempts were made to develop other ways to estimate state population

change.

Through 1960, the principal method (known as Component Method II) was

to estimate net migration based on annual changes in school,

enrollment. In the 1960s, a second method was added that estimated

changes in the population level rather than measuring the components

of population change. This method (the ratio-correlation method) uses

regression analysis that relates changes in selected independent

variables to changes in state population since the last census. These

independent variables come from federal or state data sources. In the

1960s, the major proxy variables were vital events, school enrollment,

tax returns, number of votes cast, motor vehicle registrations, and

building permits. In the 1970s, the variables for votes cast and

building permits were dropped and a variable for the size of the work

force was added.

As the demand for estimates spread to the county level, the Federal

State Cooperative Program for Population Estimates was formed to

involve state governments in a joint effort with the Census Bureau.

This organization permitted the extension of Component Method II, the

ratio correlation method, and a housing unit method to the county

level by providing data on school enrollment and various state

administrative data systems at the county level. This system

permitted the flexibility of using data sets selected for each

individual state.

The enactment of General Revenue Sharing created a demand for

population estimates for all general purpose governments(incorporated

places, towns, and townships). To estimate these subcounty areas the

Census Bureau returned to a component based method (the administrative

record methoo)'in which migration was estimated using income tax data

from the Internal Revenue Service (IRS). This method required

matching addresses on successive years of tax returns and calculating

a migration rate based on the total number of exemptions that moved

into and out of each area. The key challenge in developing this

methodology was to design a suitable method of coding mailing

addresses to counties, incorporated places and minor civil divisions.

The result was a probability coding guide based on a question on place

of residence placed on the tax returns in selected years. This

methodology proved so successful that it was added as an independent

method in the estimation of state and county populations as well.

4.2 Program Description, Policies, and Practices

The level or geographic and characteristic detail and the

methodologies or the current population estimates program are legacies

of the expansion of estimates demands during the last three decades.

Tables 4.3 and 4.4 show the frequency, detail, and methodology used at

each geographic level of the Census Bureau's population estimates

program.

While the national population is estimated by age, sex, race, and

Hispanic Origin, the subnational population estimates vary greatly in

demographic and socioeconomic detail. In general, the level of

characteristic detail declines as the level of geography becomes

finer. Each level of geography also has its own combination of

methods and input data. State population is currently produced on an

annual basis by age and sex. County estimates are produced annually

for the total population and, on an experimental basis, by age, race,

and sex. Estimates for the total population of metropolitan areas are

produced annually by summing the appropriate county data and by making

adjustments for New England areas which are composed of townships

rather than counties. Every other year, the Census Bureau produces

total population estimates for incorporated places, towns, and

townships.

4.3 Estimator Documentation

The methodology for postcensal estimates varies by level of geography

with the widest array of methods used in county estimates. This

methodological discussion focusses on the county estimates with

occasional extensions to include methods specific to states or places.

Postcensal population estimates update the last census population

based on changes in the population or in components of population

change. Actual information on such components of population, change

as births and deaths or on changes in symptomatic indicators related

to changes in the population since the last census provide benchmarks

to anchor the estimates.

The art of postcensal estimation of population comes in choosing

appropriate benchmarks (or auxiliary data) to use in estimating the

population change since the last census. One type of benchmark data,

population flow data, consists of measures of the components of

population change (eg. births, deaths, internal and external

migration). The other type of benchmark data, population stock data,

includes indicators that are correlated with population size an uses

changes in those indicators to estimate the total change in

population. Methods based on each of these two classes of data are

found in several variations in the Census Bureau's postcensal

population estimates program.

4.3.1 Flow methods

Flow methods are also known as component methods. They require some

estimation of each of the components of population change since the

last census. In the most general form, the component method reduces

to a basic accounting equation for population change.

Click HERE for graphic.

(IRS) for changes in filing addresses between two consecutive annual

tax filings (U.S. Bureau of the Census, 1988). In the estimates

process, tax returns from one year are matched with those from

previous years by matching Social Security numbers of the filers. For

persons with a new address, the new mailing address is coded to state,

place, and county. If the state, place, or county is different from

the previous year, the filer and all exemptions are classified as

migrants. These data are then used to construct net migration rates

for each county and place as an input to the population estimation

formula. An estimate of the rate of net migration is calculated by

dividing the net flow of exemptions (the tax filer plus his or her

dependents) moving into the area by the number of exemptions filed in

the area (See equation 4.2).

Click HERE for graphic.

This net migration rate is then multiplied by the initial population

as shown in equation 4.1. A critical assumption in this method is

that the population not covered by the administrative data set moves

similarly to the population covered or that the uncovered population

is too small to affect the results markedly. Since this assumption is

especially inappropriate for the population over 65 and for certain

military and institutionalized populations, those populations are

handled separately as explained below. Other potential problems

include the difficulty of coding addresses to geography, changes in

administrative coverage over time, and the elimination of

administrative data sources as governmental programs change.

Click HERE for graphic.

migration rate of the school-aged population in the most recent

census. The critical assumption here is that the

relationship of net school-aged migration and net total migration

remains constant over time.

4.3.2 Change in Stock Methods

A fundamentally different approach to population estimates emphasizes

the total change in population size since the last census rather than

demographic components of change. These change in stock methods

relate changes in population size to changes in other measured

variables that are assumed to be correlated with population change.

The choice of possible variables is wide: number of housing units,

automobile registrations, total number of deaths (and or births), tax

returns, etc. Note that births and deaths in this method are not

viewed as components but as indicators of the size of the population.

Similarly, drivers licenses and tax returns are not used as indicators

of migration as they were in the flow methods but as proxies for the

size of the total population.

Click HERE for graphic.

The key assumption in this method is that the relationship among

geographic units between change in population and change in the

selected indicator variables remains constant over time (Tayman and

Schafer, 1985). Complications also arise if indicator variables

change over time in selected areas for reasons unrelated to population

-- for example, changes in the tax law, changes in general fertility

rates, increases in automobile registrations per person, etc.

Another population stock method used to estimate the ratio of the

current population to the household change is the housing unit method.

In this method, tax rolls, construction permits, certificates of

occupancy, or utility data could be used to calculate changes in the

number of housing units in an area (Smith and Mandell, 1984). In the

Census Bureau's methodology the housing stock from the last Census is

updated using data on housing construction, demolitions, and

conversions (Eq. 4.4).

Click HERE for graphic.

The number of households in area i for date t is estimated by

multiplying the estimated number of housing units at time t by an

updated estimate of the occupancy rate for area i at time t. By

assuming that the local occupancy rate changes as the national rate,

we can update the area's rate by multiplying the occupancy rate for

area i at the time of the census by the ratio of the national

occupancy rate at time t from the Current Population Survey (CPS) to

the national occupancy rate at the time of the census.

Click HERE for graphic.

Finally, the population for the area i is calculated by multiplying

the area's household estimate by an updated estimate of population per

household. Again we assume that the area's population per household

from the last census can be updated by multiplying by the ratio of the

national population per household from the CPS to the national

population per household in the last census.

Click HERE for graphic.

All of the methods discussed so far refer to the household population

under 65. The two other segments of the population, the population 65

and over and the group quarters population, are measured by their own

specific change in stock methodologies. Since these two groups have

unique characteristics (especially in terms of their migration

patterns), we use administrative, records systems that are unique to

each of the two groups. The population over 65 is estimated by using

changes in the medicare population since the last census as a direct

measure of the change in the population 65 and over. No such

nationwide systems exists for the group quarters populations (defined

for estimates purposes as the population in military barracks, college

dormitories, prisons and other institutions). Changes in these

population since the last census are obtained from an inventory of

major group quarters locations that is maintained and annually updated

by a special data collection process in the Population Estimates

Branch of the Population Division in cooperation with state agencies

affiliated with the Federal-State Cooperative Program for Population

Estimates.

4.3.3 Combined methods

The U.S. Census Bureau's postcensal population estimates program

combines methods in two ways. Within each level of geography (states,

counties, and places) several of the above methods are combined (Table

4.4). Since certain methods represent given subpopulations better, a

combination of methods may be viewed as more robust -- less likely to

change due to extraneous factors that might affect one or the other of

the estimates. There is a further mixing of methods since the

estimates at each level of geography are controlled to the results of

the estimates made at the next higher level of geography.

The methodology for making state estimates during the 1980s averaged

the results of the administrative record method with those of the

composite method. In the composite method, the population is divided

into three age groups, each of which is estimated by a separate

method. The population under 15 is estimated using changes in the

levels of school enrollment (similar to Component Method II). The

population ages 15-64 is estimated by a ratio- correlation method in

which the independent variables are tax returns, school enrollment,

and housing units. The population over 65 is estimated using a method

in which changes in the number of persons on medicare since the last

census date are added to the population aged 65 and over at the last

census (U.S. Bureau of the Census, 1984). The total state population

by age is then controlled to equal the estimated national population

age structure.

Annual county population estimates are produced independently for each

state to coincide with the state's total population estimated above.

A distinct methodology for each state is developed in consultation

with that state's member of the Federal-State Cooperative Program for

Population Estimates. In most cases, it consists of the average of

two or three of the methods described above: the administrative

records method, component method II, and the ratio-correlation method.

Moreover, within the ratio correlation method, different states use

different independent variables which may include school enrollment,

tax returns, medicare enrollment, automobile registrations, births,

deaths, dummy variables for county size, or other state-specific data

series. Additional adjustments are made for changes in selected

military and institutional populations and for changes in the

population over 65. Final results are controlled to the state

population estimate produced by the Census Bureau using a uniform

method across all states (van der Vate, 1988).

Place estimates use a strict administrative record methodology where

migration is based solely on the migration rates derived from changes

in addresses on tax returns. The only other adjustments for place

estimates are for changes in selected military and institutional

populations and a final control to county level population estimates

(U.S. Bureau of the Census, 1980).

4.4 Evaluation Practices

The estimation process demands continuous vigilance. Methods that

appear to work well at the beginning of a decade may be unsatisfactory

later in the decade. Only constant testing, data evaluation, quality

control, and checks for reasonableness can ensure a sound program of

population estimation.

Whatever the method of estimation chosen, a number of considerations

should be kept in mind, No matter how sophisticated the methodology,

the estimate will only be accurate if the underlying assumptions hold

and the input data are reliable. Many things can happen to endanger

these conditions. For example, the relationships that held between

variables in a previous decade might no longer hold in the current

decade. The data series that one is depending upon to update the

population may deteriorate or fail to measure the same underlying

phenomenon as conditions change. Even if the administrative or other

indicator data measure the population well, there may well be problems

of geographic coding that fail to assign the population to the correct

geography.

Finding an appropriate yardstick against which to measure the

postcensal population estimates is difficult. During the decade,

aside from special censuses for a handful of places, there are no

suitable numbers to compare to the estimates -- thus we know little

about the short run accuracy of population estimates. We can only

measure their accuracy at the extreme end of their range (after 10

years) using the next decennial census. Even here, the changing level

of coverage between censuses for any given area can lead to

imprecision in our measurement of estimates accuracy. Using the

results of the 1980 and 1990 censuses as enumerated, the Census Bureau

evaluated the accuracy of the population estimates program. The

results (summarized in Table 4.5) show that population estimates made

for the nation, for states, and for counties were reasonably accurate,

but that estimates made for small places were quite inaccurate.

Estimates for places under 5,000 had a mean absolute error of more

than 15 percent while places over 50,000 had a mean absolute error of

less than 5 percent.

The last two columns in Table 4.5 present a more telling comparison.

Column two compares the 1990 census and the provisional 1990

postcensal estimate while column three compares the 1990 census with

the 1980 census. For most levels of geography the postcensal

population estimate provides a far more accurate estimate than simply

holding the population constant at the level of the last census. For

example, state postcensal estimates had an mean absolute error of only

1.5 percent, while holding the last census constant would give an

error of 10.0 percent. On average, the estimates methodology is also

much better than using the last census for counties and incorporated

places over 5,000 population. However, for many incorporated places

under 5,000, holding the population constant at the 1980 level would

have given more accurate results that did our postcensal estimate

program.

These inaccuracies for small places may be due to a number of sources:

The problem of coding administrative records to small units of

political geography, the greater importance of migration in population

change for small areas, and the greater likelihood that the broad

assumptions that might apply on average for larger areas would not

apply to small localities with very specific characteristics. Since

the Census Bureau is required by law to produce data for all

incorporatedd places and townships, we will need to show places under

5,000 as well as the larger places for which we can produce good

estimates. However, it is incumbent on us to show the uncertainty in

the estimates for small areas in future publications in addition to

making continual progress in refining and improving our estimates

methodologies and data bases.

4.5 Current Problems and Planned Activities

Many of the problems of the current population estimates system are

the results of its past success and rapid growth during the 1960s and

1970s. Each new program, each expansion of characteristic detail,

each reduction in the size of geographic unit has been accompanied by

new data sets, by new methods, and by new production procedures.

Although the Census Bureau has done a good job of meeting users

expectations as these demands have increased, there is room for

improvement in the estimates methodology and operations.

We have embarked on a set of seven initiatives to revamp the

population estimates program and lead it into the next century. These

initiatives fall under the following headings: 1) defining the

mission, 2) methodological integration, 3) input data quality, 4)

geographic flexibility, 5) characteristic detail, 6) analysis of

trends, and 7) production efficiency.

4.5.1 Defining the Mission

The products currently estimated by the Census Bureau's Population

Estimates Program are the results of opportunities and legislative

requirements over a period of three decades. We plan to reexamine the

demands for and uses of population estimates. A thorough study of the

needs for population estimates and the Census Bureau's proper mission

in filling those needs is an initial priority. We are currently

polling a number of our users -- Federal government agencies, the

Federal-State Cooperative Program members, private data vendors, and a

number of other groups to ascertain their needs for population

estimates.

Some of the suggestions received so far involve modifying the

population estimates program in order to produce more detailed

characteristic information at the state and county level. We hope to

produce age, sex, race, and Hispanic Origin data for counties. With

more research, we may also be able to produce the county-level data on

households -- number, size, and income -- that is currently demanded

by many users. We are examining the feasibility of producing

estimates for larger places on a yearly basis and producing estimates

for other subcounty geography as well -- possibilities include census

tract aggregates, subareas within large cities, and (for some

purposes) Zip codes.

4.5.2 Methodological Integration

The many different methods of estimating population developed over the

past decades have resulted in a complex population estimates program.

The need now is to integrate these disparate methods into an orderly

system. Traditionally, the various estimation models used at the

Bureau have been integrated by a simple averaging of the different

estimates at a given level of geography and by controlling the sum of

estimates at one level of geography to the averaged estimate at the

next higher level.

The time has come to reexamine each set of methods for suitability as

parts of an integrated, parsimonious model for producing population

estimates. In order to discuss methods of integrating our current

methods, it is useful to distinguish between methods that measure the

changes in the population stock and those that measure the components

of population change. Methods showing the change in population stock

(the ratio correlation method, the medicare change methodology, and

the change in group quarters population) use changes in proxy

variables since the last census to produce estimates of the total net

change since the last census. These methods permit the use of many

symptomatic measures of population size that may not be amenable to a

flow approach.

Component methods such as the administrative records method and

component method II represent flow methods in which the components of

population change births, deaths, international migration, and

internal migration are each measured separately and added to or

subtracted from the initial population. The advantage of this type of

method is that it gives an estimate ont only of the population but

also of the components of population change. This method provides

additional information about the reasons for change, the

reasonableness of the estimates, and provides inputs for population

projections. Component methods are often preferable for larger areas

because they use relatively accurate counts of births and deaths to

compute a large part of population change. Consequently,

administrative records which are often less accurate need only be used

to estimate the portion of population change due to migration.

Current research at the Census Bureau is underway to quantify the

relative effects of errors in each component on the final population

estimates. For small area, these advantages disappear and change in

stock methods such as the housing unit method may be more appropriate.

As we integrate methods, we should be careful to retain the

flexibility offered by multiple independent methods of estimating

population. Since methodologies for population estimates are

dependent upon the use of data sets collected for purposes other than

population estimates, the quality and availability of a given input

data set is never certain. Only with multiple methods can we be

assured of the ability to produce population timely and reliable

population estimates. Multiple methods also provide a necessary check

on the validity of the estimates results; surprising changes in

demographic trends can be checked using independent sources in order

to see if the results are merely idiosyncracies of a given input data

source. The existence of independent methods of estimating population

could prove a distinct advantage in trying to gauge the accuracy of

estimates between censuses. We should examine the potential of using

measures of divergence between independent estimates to determine the

reliability and degree of confidence we have in the accuracy of

postcensal estimates. If three independent estimates give very close

values, we should have more confidence in those estimates than if the

estimates vary widely.

4.5.3 Input Data Quality

Perhaps even more important than the type of method chosen is the

choice of data set used in, the estimate. Producing postcensal

population estimates requires integrating traditional ting traditional

demographic data sets such as census results, birth and death records,

and immigration statistics with nontraditional sources collected for

other administrative purposes such as tax returns, school enrollment,

drivers' licenses, housing construction, survey data, etc. The art of

population estimation is to combine these traditional and

nontraditional sources to make maximum advantage of all the data

available.

The most challenging aspect of working with population estimates is

the use of data sets designed and collected for administrative

purposes rather than for statistical or demographic purposes.

Ideally, such data sets should have universal coverage, change in

direct relation with population changes, and be consistent over time

in content and form. No data set actually meets these criteria. The

level of population coverage is often less than 100 per cent.

Programmatic changes or changes in social behavior independent of

population change may affect the coverage rate. Worst of all, the

administrative data set may even disappear if its programmatic need or

funding disappears.

Consequently, a healthy population estimates program requires careful

attention to the quality and timeliness of input data as well as to

the reliability of access to the input data. This requires working

with our data providers to monitor the input databases on a number of

requirements including reliability, consistency, coverage,

characteristic detail, and idiosyncracies produced by programmatic and

other changes. It also entails work with data producers to address

questions of mutual interest such as cost, confidentiality, and legal

requirements for data handling. Since administrative datasets may

disappear over time, work must also continue on nurturing alternative

data sets to provide similar or superior data. The need for

flexibility to address changing data set availability and quality is

yet another argument for using multiple independent methods and data

sets to provide redundancy in the estimates program.

4.5.4 Geographic Flexibility

Linking data on population to geography is the key to population

estimation methodology. Any system for making subnational population

estimates must have a credible method for developing such geographic

correspondence. Population estimates are required for legally defined

geographic entities such as counties and incorporated places and the

estimates methodology must take these requirements into account. In

the county estimates conducted jointly with states under the

Federal-State Cooperative Program, we assume that the input data used

in the ratio correlation methodology systems are correctly coded by

county of residence.

In the administrative records method matching tax returns to determine

state, county, and place migration, the Census Bureau must provide the

geographic coding for movers based on the mailing addresses or filers

from the IRS tax forms. In order to categorize these filers by county

and place of residence, the current methodology uses a probability

coding guide. With the aid of data from a residence question on the

1980 tax form, mailing addresses were categorized by P.O. name, state,

zip code, and address type (street address, P. 0. box, RFD route) and

assigned a probability of falling within each of 3100 counties and

39000 places.

There are several problems that lead to deterioration in the coding

guide over time. Some of the more obvious ones can be corrected by

manual adjustments in the coding guide, eg. creation of new Zip codes

or revised boundaries for old Zip codes, changes in the boundaries of

incorporated places, etc. A key cause of deterioration that cannot be

fixed is the change over time in the distribution of the population

within a given address key (post office, state, Zip code, address type

combination). To the extent that those changes in distribution cross

county, town, and city boundaries, the resulting coding will be

incorrect. Moreover, the probability system itself may well put

individual persons in the wrong county or place. We know little about

how these errors propagate through the system after several years and

multiple migrations.

The Census Bureau is currently developing a new geographic coding

system that permits frequent updating and, if possible, exact matching

rather than probability matching of addresses to geography. The

system is based on the "master address list" proposed by the Census

Bureau's Geography Division as an outgrowth of the development of the

TIGER digitized mapping project and the "address control file" created

for use with the 1990 census. This system would provide an annually

updated digitized data base that could place most addresses in the

United States into the appropriate census block (and thus into any

unit of geography that also has its boundaries in the TIGER system).

In the estimates area, we are exploring the feasibility of developing

a coding system that would code street addresses to subcounty areas

using such a master address list. The existence of a continuously

updated master address list could provide far greater geographic

detail, ease of updating and correcting for boundary changes, and

flexibility in dealing with changing geographic concepts and shifts in

population distribution.

This methodology also provides the promise of a far greater benefit in

the future. The ability to provide exact matching based on geography

might one day permit the matching of records on the basis of address

rather than an identifier such as social security number. Such an

ability would provide the opportunity to bring far more information

sources to bear on the estimation effort.

4.5.5 Characteristic Detail

Another major area for innovation is the expansion of data on

population characteristics both demographic characteristics such as

age, race, and sex and social/economic characteristics such as

household structure and income. In order to get a better hold on the

demographic structure of substate areas and to use as a denominator in

calculating incidence rates, there is a major increase in the demand

for age, race, and sex distributions at the county level between

censuses. These data are not available from the IRS tax records that

form the principal part of our administrative records processing.

Consequently, we are developing alternative methods to provide these

data for counties and large places as an integral part of the

estimation process.

We have experimented with a number of possible approaches. One of

these experimental programs developed a projected estimate by which

county trends in migration by age, race, and sex from the previous

decennial census were extrapolated into the current decade, added to

actual birth and death rates to produce a population by age, race, and

sex that was then controlled to the official estimate of total

population for a county. Another experimental program extends the

current administrative record method by adding information by age,

race, and sex from Social Security records to a sample of IRS returns

to provide internal migration data for states and large metropolitan

areas. Current plans call for integrating these programs into our

standard procedures by the mid 1990s.

There are also possibilities for using survey data combined with

administrative data to obtain characteristic information. While

matched survey and administrative data records on an individual basis

may prove difficult, there have been efforts to combine data on an

aggregate basis. A recent example is an analysis of internal

migration that combined aggregate data from the decennial census,

matched tax return migration data, and survey data from the Current

Population Survey (CPS) to provide a time series of migration by

characteristics for state to state flows. Research is proceeding on

whether more information from surveys could be combined with the

administrative record methods by either aggregate or individual

statistical modeling approaches.

Another major effort is underway to produce estimates for housing

units and households for survey controls to the American Housing

Survey and other housing based surveys. This program uses data on

additions and deletions from housing stock to update the housing

inventory from the decennial census. While this method is similar to

the housing unit method for population estimates described above, the

resulting housing unit estimates are used directly as survey controls

rather than only used to estimate population.

There is also the potential for integrating more administrative data

into the estimates procedure. A number of federal, state, and even

private data sets have been suggested. Possible data sets include

state tax data, post office change of address forms, state drivers

license information, food stamp enrollment information, utility hookup

records, and telephone directory information. These and other data

sets will be explored for their potential utility for, making

subnational estimates assuring proper attention is given to protection

of privacy and proper disclosure safeguards.

4.5.6 Analysis of Trends

A prime advantage of the population estimates programs is its

information on the changes in spatial population distribution between

censuses. While the Census Bureau has put great emphasis on the

production of estimates for individual states, counties, and places we

have only occasionally provided the summary information on the broader

trends in population redistribution. An analysis of population

redistribution trends between cities and suburbs, high and low density

areas, areas of high and low unemployment, and other analytical

categories should be an annual part of our activities. In order to do

this, a simple first step is to classify counties by relevant

analytical characteristics so that such summaries could be a standard

part of our processing. In addition, we plan an annual analytical

report on population distribution trends based on the entire range of

population estimates.

Much of the intermediate data on components of population change

(migration, births, deaths, numbers of housing units, etc) used in

constructing the population estimates is of analytical interest in its

own right. These data should be developed as their own data products

and used to provide an analytical view of the dynamics of current

population change. An integrated set of historically consistent data

series on births, deaths, international, and internal migration should

be developed for all major geographic areas for which population

estimates are produced. As As a first step, we are producing a

consistent time series of population counts for all counties and for

cities over 25,000 from 1790 through 1990.

4.5.7 Production Efficiency

The uncoordinated and erratic growth pattern in the population

estimates area has had a substantial effect on production efficiency.

During the 1980s, delays in production and unreliable publication

dates have frequently resulted from the unwieldiness of the current

production process. For the 1990s, we have streamlined the production

process as a result of more parsimonious methodologies and a more

focused set of products. Many of our users repeatedly tell us that it

is more important to have a firm production date than to be too

optimistic in our timetables. Efforts toward redesigning the

estimates product have as a major goal a firm production schedule with

realistic deadlines. While considerable progress has been made on

this commitment, we expect to strive toward continuous improvement in

timeliness as well as reliability and cost reduction.

4.6 Conclusion

Postcensal population estimates are an integral part of the

U.S. statistical system combining census results with tabulations on

vital events, providing the population controls by which household

survey results can be weighted, and producing a continuous and

up-to-date time series of changing population size and distribution

between censuses. These estimates are only possible with the creative

use of censuses, vital events, administrative data, and other

unconventional sources for estimating changes in population on a

timely basis.

As we approach the twenty-first century, the population estimates

program provides an ideal starting point for an integrated demographic

and social accounting system. The system already unites the decennial

census and population survey results through a series of longitudinal

controls. These longitudinal controls are based on previous censuses

and vital events, and could be modified to incorporate measurements of

undercoverage if desired. In the 1990 census, the estimates system

provided substantial information for coverage improvement during the

operation of the census and in evaluating coverage after the results

were in. The system provides the opportunity to integrate the results

of administrative records collected for other purposes to augment and

improve traditional demographic data. Our efforts to integrate our

geographic coding with the decennial census data base (TIGER), to

maintain estimates of housing units and households as well as

population, and to use data on social and economic characteristics

from surveys in the estimation process take us beyond a purely

demographic system to an enhanced estimates program that could

eventually provide continuously updated data on many of the variables

now only measured by the census. Moreover, such an integrated

estimates system could provide data on the components and rhythm of

population, housing, geographic, social, and economic change that no

individual data source can now provide.

Table 4.1: USES OF CENSUS BUREAU POPULATION ESTIMATES

National

- Survey Controls

- National Social and Economic Series

- Descriptive and Analytical Studies

- Controls for Subnational Estimates

State

- Direct Federal Fund Allocation Formulas

- Indirect Federal Fund Allocation

- Denominators for Federal and Other Data Series

- Federal Regulatory Actions

- Survey Controls

- Descriptive and Analytical Studies

- Controls for Substate Estimates

Counties

- Fund Allocation by State Governments

- Denominators for Federal, State, and Other Data Series

- Regulatory Action by State Governments

- Guides for Government and Private Sector Planning

- Descriptive and Analytical Studies

- Federal Data Series

- Controls for Subcounty Estimates

Places

- Federal Block Grants

- Fund Allocation and Regulatory Actions by Federal and

State Governments

- Descriptive and Analytical Studies

- Government and Private Sector Planning

- Private Sector Marketing Efforts

- Base Data for Private Sector Data Development

Click HERE for graphic.

BIBLIOGRAPHY

Batutis, Michael J. 1991. "Subnational Population Estimates Methods

of the U. S. Bureau of the Census," U. S. Bureau of the Census,

Population Division Working Paper.

General Accounting Office. 1990. Federal Formula Programs: Outdated Population

Data Used to Allocate Most Funds. September. GAO/HRD-90-145.

General Accounting Office. 1991. Formula Programs: Adjusted Census Data Would

Redistribute Small Percentage of funds to States. November. GAO/GGD-92-12.

Mandell, M. and J. Tayman. 1982. "Measuring Temporal Stability in Regression

Models of Population Estimation." Demograpby, 19:135-136.

Namboodiri, N. K. 1972. "On the Ratio-Correlation and Related Methods of

Subnational Population Estimation." Demograpby. 9:443-453.

National Academy of Sciences. 1980, Estimating Population and Income of Small

Areas. Washington, D.C., National Academy Press.

O'Hare, W. P. 1976. "Report on a Multiple Regression Method for Making

Population Estimate." Demography. 13:369-379.

O'Hare, W. P. 1980. "A Note on the Use of Regression Methods in Population

Estimates." Demography. 17:341-343.

Roe, Linda K., John F. Carlson, and David A. Swanson. "A Variation of

the Housing Unit Method for Estimating the Population of Small, Rural

Areas: A Case Study of the Local Expert Procedure," Survey

Methodology, 19: 155-163

Smith, Stanley K. and Bart Lewis. 1980. "Some New Techniques for Applying the

Housing Unit Method of Local Population Estimation," Demography, 17: 323-339.

Smith, Stanley K. and Bart Lewis. 1983. "Some New Techniques for Applying the

Housing Unit Method of Local Population Estimation: Further Evidence",

Demography, 20: 407-413.

Smith, Stanley K. and Marylou Mandell. 1984. "A Comparison of

Population Estimation Methods: Housing Unit Versus Component II, Ratio

Correlation, and Administrative Records,' Journal of the American

Statistical Association, 79:282-289.

Smith, Stanley K. 1986. "A Review and Evaluation of the Housing Unit

Method of Population Estimation," Journal of the American Statistical

Association, 82: 287-296.

Statistics Canada. Population Estimation Methods: Canada. Ottawa: Ministry of

Supply and Services.

Swanson, David A. 1980. "Improving Accuracy in Multiple Regression

Estimates of Population Using Principles from Causal Modelling,"

Demography. 17:413-427.

Swanson, David W. 1989. "Confidence Intervals for Postcensal

Population Estimates: A Case Study for Local Areas," Survey

Methodology. 15: 217-280.

Swanson, David W. and L. Tedrow. 1984. "Improving the Measurement of Temporal

Change in Regression Models Used for County Population Estimates," Demography.

21: 373-381.

Tayman, Jeff and Edward Scharer. 1985. "The Impact of Coefficient

Drift and Measurement Error on the Accuracyor Ratio-Correlation

Population Estimates." The Review of Regional Studies. 15:3-10.

U.S.Bureau of the Census. 1980. "Population and Per Capita Money

Income Estimates for Local Areas: Detailed Methodology and

Evaluation," Current Population Reports. Series P-25, No 699.

U. S. Bureau of the Census. 1983. "Evaluation of Population

Estimation Procedures for States, 1980: an Interim Report." Current

Population Reports. Series P-25, No. 933.

U. S Bureau of the Census. 1984. "Estimates of the Population of

States: 1970 to 1983," Current Population Reports. Series P-25,

No. 957.

U. S. Bureau or the Census. 1995. "Evaluation of 1980 Subcounty Population

Estimates," Current Population Reports. Series P-25, No. 963.

U. S. Bureau of the Census. 1986. "Evaluation of Population

Estimation Procedures for Counties: 1980," Current Population Reports.

Series P-25, No. 984.

U. S. Bureau of the Census. 1987. "State Population and Household

Estimates, With Age, Sex, and Components of Change: 1981-1986",

Current Population Reports. Series P-25, No. 1010.

U. 8. Bureau of the Census. 1988. "Use of Federal Tax Returns in the

Bureau of the Census' Population Estimates and Projections Program".

Population Division Working Paper.

U. S. Bureau of the Census. 1988. "Methodology for Experimental

County Population Estimates for the 1980s", Current Population

Reports. Special Studies. Series P-23, No. 158.

U. S. Bureau of the Census. 1989. "Population Estimates by Race and

Hispanic Origin for States, Metropolitan Areas, and Selected Counties:

1980 to 1985." Current Population Reports. Series P-25,

No. 1040-RD-1.

U. S. Bureau of the Census. 1989. "County Population Estimates: July

1, 1988, 1987, and 1986," Current Population Reports, Series P-26,

No. 88-A.

U. S. Bureau of the Census. "Population Estimates for Metropolitan

Statistical Areas: July 1, 1988, 1987, and 1986," Current Population

Reports. Series P-26, No. 88-B.

U. S. Bureau of the Census. 1990. "State Population and Household

Estimates: July 1, 1989." Current Population Reports. Series P-25,

No. 1058.

U. S. Bureau of the Census. 1990, "1988 Population and 1987 Per Capita

Income Estimates for Counties and Incorporated Places," Current

Population Reports. Series P-26, No. 88-SC.

van der Vate, Barbara J. 1988. "Methods Used in Estimating the

Population of substate Areas in the United States," Paper presented at

the International Symposium on Small Area Statistics, New Orleans, LA,

Aug. 26-27.

CHAPTER 5

Bureau of Labor Statistics' State and Local Area

Estimates of Employment and Unemployment

Richard Tiller and Sharon Brown, Bureau of Labor Statistics

Alan Tupek, National Science Foundation

5.1 Introduction and Program History

The Bureau of Labor Statistics' (BLS) Local Area Unemployment

Statistics (LAUS) Program produces state and area employment and

unemployment estimates under a federal-state cooperative program. At

present, monthly employment and unemployment estimates are prepared,

for the 50 states and the District of Columbia, all Metropolitan

Statistical Areas (MSA's), all counties, and selected subcounty areas

for which data are required by legislation -- more than 5,300 areas.

The Current Population Survey (CPS), conducted by the Bureau of the

Census for the BLS, is the official survey instrument for measuring

the labor force in the United States. The CPS sample provides direct

monthly survey estimates of employment and unemployment for the

nation, selected states and New York City and Los Angeles. However,

the CPS sample is not sufficiently large in most states and substate

areas to provide reliable monthly estimates. Therefore, methods are

used to combine data from other sources with current and historical

CPS sample estimates to produce monthly estimates of employment and

unemployment for the remaining states, the District of Columbia, and

substate areas.

The CPS began during the Great Depression as a project of the

Works Project Administration (WPA). During and following World War

II, the need for unemployment data at the local level began to

develop. A number of state and federal agencies began making

estimates using various procedures. In 1950, the U.S. Department of

Labor's Bureau of Employment Security, in an attempt to standardize

the estimation methods, issued guidelines in a booklet: Techniques for

Estimating Unemployment. In 1960, the Handbook Method on Estimating

Unemployment was issued. This building block or accounting method for

developing total employment and unemployment estimates is essentially

still used for substate areas today. About the same time, Congress

began passing legislation using local unemployment data for the

allocation of funds, such as the Area Redevelopment Act in 1961 and

the Public Works Economic Development Act in 1965. Legislated

programs which currently allocate funds to states and local areas

based on unemployment estimates, include the "Disadvantaged Adults and

Youths", "Summer Youth", and "Dislocated Workers"' programs of the Job

Training Partnership Act, the "Emergency Food and Shelter Program",

and the "Public Works Program". In FY91, more than 9 billion dollars

in appropriations to states and local areas were based, in full or in

part, on local area unemployment statistics.

In 1972, the BLS acquired responsibility for, unemployment

statistics. BLS subsequently introduced chnages to the Handbook

methodology, including the use of annual average estimates from the

CPS as controls for the state and area monthly estimates. Beginning

in 1973, the CPS sample size was expanded to allow for direct sample

based estimates for the 10 largest states, Los Angeles and New York

City. In 1984, an llth state was added. The Handbook method was

still used for the 39 remaining states and the District of Columbia.

However, a 6-month moving average adjustment using CPS data was

applied to the state estimates. For substate areas, Handbook

estimates are prepared for all labor market areas in the state, which

are controlled to the state CPS- based estimates of employment and

unemployment. At present, monthly employment and unemployment

estimates are also prepared for all MSA'S, all counties, and selected

subcounty areas for which data are required by legislation. In 1989,

a new methodology was introduced for producing monthly state

employment and unemployment statistics for the 39 smaller states and

the District of Columbia. This method is a time-series regression

model, and uses a state-space Kalman filter approach.

Monthly estimates for the 39 smaller states and the District of

Columbia are published approximately 6 weeks after the reference week

of the CPS, which is the week including the 12th. Sample based

estimates for the largest 11 states are released a few weeks earlier,

(usually the first Friday of the month following the reference month)

with the national estimates. Estimates for the 39 smaller states and

the District of Columbia are revised a month later to reflect

revisions in the Unemployment Insurance Statistics and the Current

Employment Statistics (Payroll Employment Survey) Program, which are

used in both the Handbook Method and State Modeling Method. At the

end of the year, monthly state estimates are revised (benchmarked) so

that their annual average equals the CPS sample based annual average.

For the 11 large, states, data are revised to reflect population

controls.

5.2 Program Description, Policies, and Practices

Only five labor force estimates are published monthly for state

and substate areas: Civilian noninstitutional population, civilian

labor force, employed, unemployed, and the unemployment rate. Each

month a press release - The Employment Situation - is issued and the

Commissioner of Labor Statistics testifies before the Joint Economic

Committee of Congress. The press release includes employment and

unemployment estimates for the 11 largest states, in addition to

national estimates. These data, as well as data for the remaining 39

states, the District of Columbia and Metropolitan Statistical Areas

(MSAS) are published about four weeks later in Employment and

Earnings. Seasonally adjusted data are provided for all 50 states and

the District of Columbia, beginning in January 1992. Although the

data for the smaller states are published with the data for the 11

largest states, the data are published in two sets in a table. The

estimating methodology for the smaller states is provided in a

footnote at the bottom of the page. A separate monthly publication -

State and Metropolitan Unemployment - also includes data using direct

sample survey estimates for the 11 largest states, the State Model

methodology for the remaining states and the state CPS additively-

adjusted Handbook method for sub-state estimates. This publication

provides more detailed estimates for sub-state areas. In all, monthly

labor force estimates are provided for 5,300 areas, including

Metropolitan Statistical Areas (MSA's), Labor Market Areas (LMA's),

all counties (cities and towns in New England), and cities of 25,000

population or more.

Estimates for all but the 11 largest states, Los Angeles, and New

York City are revised a month following the initial publication in

which they appear, and again at the end of the year. The first

revision takes into consideration revisions to the Payroll Employment

and Unemployment Insurance statistics. The end of year revision

adjusts the monthly estimates such that their annual average equals

the CPS sample based annual average estimates for those states and

sub-state areas for which CPS data are provided.

5.2.1 Design or the Current Population Survey

The CPS monthly sample consists of 72,000 housing units. This

sample size was chosen to meet national and state reliability

requirements. Assuming a 6% unemployment rate, the national sample

size was chosen so that a month-to-month change of 0.2 percentage

points in the unemployment rate would be statistically significant at

the 90 percent confidence level. This translates to a coefficient of

variation (CV) of 1.8% for the national unemployment rate. The 11

largest states have a CV of 8.0% on the monthly unemployment rate.

Tbe other 39 states and the District of Columbia have a CV of 8.0% on

the annual average unemployment rate. The CPS sample is located in

729 areas comprising over 1,000 counties and independent cities with

coverage in every state and the District of Columbia. Prior to 1984,

the CPS had been designed as a national sample with the goal of

providing the best estimates of employment and unemployment for the

U.S. as a whole.

The CPS sample is selected by first dividing the entire area of

the United States into 1,973 primary sampling units (PSU's), where a

PSU is a county or a number of contiguous counties. The 1,973 PSU's

are grouped into strata within each state. One PSU is selected from

each stratum with probability of selection proportionate to the

population size of the PSU. The most populated PSU's are grouped by

themselves and selected with certainty. Since the sample design is

state based, the sampling ratio differs by state, ranging roughly from

1 in every 200 households to 1 in every 2500 households. There are

several stages of selecting the household units within PSU'S. First,

enumeration districts, which are administrative units and contain

about 300 housing units, are ordered so that the sample would reflect

the demographic and residential characteristics of the PSU. Within

each enumeration district the housing units are sorted geographically

and are grouped into clusters of approximately four housing units. A

systematic sample of these clusters of housing units is then selected.

Part of the sample is changed each month. For each sample, eight

systematic subsamples (rotation groups) or segments are identified. A

given rotation group is, interviewed for a total of 8 months -- 4

consecutive months in the survey, followed by 8 months out of the

survey, followed by 4 more consecutive months in the survey. Under

this system, 75 percent of the sample segments are common from

month-to-month and 50 percent of the sample segments are common from

year-to- year.

The estimation procedures involves weighting the data from each sample

person by the inverse of the probability of the person being in the

sample. These estimates are then adjusted for noninterviews, followed

by two ratio estimation procedures to adjust the CPS estimates to

known population totals. The last step in the preparation of

estimates makes use of a composite estimating procedure. The

composite estimate for the CPS is a weighted average of the estimate

for the current month and the estimate for the previous month,

adjusted for the net month-to-month change in households.

Balanced repeated replication and collapsed stratum methods are used

to estimate CPS variances for selected characteristics. Generalized

variance functions are used to present the sampling error estimates in

publications. Sampling error estimates are provided for all direct

sample based estimates, which include the annual average estimates for

states and some sub-state areas, as well as monthly estimates for the

11 largest states. Error estimates are not provided for estimates

which use the State Model methodology or the Handbook methodology.

General variance functions are used for calculating sampling error

estimates for the direct sample based estimates from the CPS.

Employment and Earnings series provides methods for calculating

sampling error estimates for almost any estimate in the publication.

These methods can also be used to calculate estimates for unpublished

CPS estimates, such as the monthly unemployment rates for the smaller

states. These sampling error estimates can be used to approximate the

error in the model based estimates.

5.3 Estimator Documentation

The method used to provide monthly state estimates for the 39 states

and the District of Columbia is based on the time series approach to

sample survey data. Originally suggested by Scott and Smith (1974),

this approach treats the population values as stochastic and uses

signal extraction techniques developed in the time series literature

to improve on the direct survey estimator. Recent work has been

conducted by Bell and Hillmer (1990), Binder and Dick (1990),

Pfefferman (1992), and Tiller (1992a).

The actual monthly CPS sample estimates are represented in signal plus

noise form as the sum of a stochastically varying true labor force

series (signal) and error (noise) generated by sampling only a portion

of the total population. Issues related to non-sampling errors are

not considered by this approach. The signal is represented by a time

series model that incorporates historical relationships in the monthly

CPS estimates along with auxiliary data from the Unemployment

Insurance (UI) and Current Employment Statistics (CES) programs. This

time series model is combined with a noise model that reflects key

characteristics of the sampling error to produce estimates of the true

labor force values. This estimator has been shown to be design

consistent under general conditions by Bell and Hillmer (1990) and is

optimal under the model assumptions.

Unlike the typical small area estimation application that seeks to

improve on the direct survey estimator by borrowing strength over

areas, the time series approach borrows strength over time for a given

area. While variance reduction is a primary goal of both these

approaches, when there are strong overlaps in the sample design and

relatively long historical series are available, the time series

approach provides powerful tools for estimating the underlying

population values. As discussed in the previous section, the CPS

design creates major sample overlaps resulting in very strong

autocorrelations in the sampling errors. By combining a model of both

the true labor force values and the sampling error, the time series

approach controls for the autocorrelation induced by the sample design

making it easier to identify the population dynamics. This is

particularly useful in trend analysis and seasonal adjustment. When

sampling error is strongly autocorrelated, trend and sampling effects

are confounded in the observed data (Tiller, 1992b).

Click HERE for graphic.

Seasonal Component

The seasonal component is the sum of six trigonometric terms

associated with the 12-month frequency and its five harmonics

Click HERE for graphic.

Irregular component

The irregular component is a residual not explained by the

regression or time series components discussed above. The convention

in classical decomposition of a univariate time series is to represent

the irregular as a highly transient phenomena, i.e., as white noise or

a low order MA process.

Noise

The noise component of the observed CPS estimate represents

error that arises from sampling only a portion of the total

population. Its structure depends upon the CPS design and population

characteristics. For our purposes, we focus on those design features

that are likely to have a major effect on the variance-covariance

structure of the sampling error, e(t).

One of the most important features of the CPS is the large overlap

in sample units from month to month. As described in the previous

section, units are partially replaced each month according to a 4-8-4

rotating panel. Since this system produces large overlaps between

samples one month and one year apart, we can expect e(t) to be

strongly autocorrelated. Also, there is likely to be some correlation

between nonidentical units in the same rotation group because of the

way in which new samples are generated. When a cluster of housing

units permanently drops out of a rotation group, it is replaced by

nearby units. Since the new units will have characteristics similar

to those being replaced, this will result in correlations between

nonidentical households in the same rotation group (Train, Cahoon, and

Makens, 1978).

Finally, the dynamics of the sample error will also be affected by

the composite estimator. This is a weighted average of an estimate

based on the entire sample for the current month only and an estimate

which is a sum of the prior month composite and change that occurred

in the six rotation groups common to both months (Bureau of the

Census, 1978). In effect, this estimator takes a weighted average of

sample data from the current and all previous months.

Another important feature of the CPS is its changing variance over

time. There are three major sources of heteroscedasticity: (1) sample

redesigns; (2) changes in the sample size; and (3) changes in the true

value of the population characteristic of interest. The first two

cause discrete shifts in, the sample variance. For example, the CPS

is redesigned each decade to make use of decennial census data to

update the sampling frame and estimation procedures. Most recently, a

state-based design was phased in during 1984/85 along with improved

procedures for noninterviews ratio adjustments and compositing.

Changes in state sample sizes have occurred more frequently than

redesigns and have had major effect on variances at the state level.

Even with a fixed design and sample size, the error variance will be

changing because it is a function of the size of the true labor force.

Since the labor force is both highly cyclical and seasonal, we can

expect the variance to follow a similar pattern. To capture the

autocorrelated and heteroscedastic structure of e(t), we may express

it in multiplicative form (see Bell and Hillmer, 1990) as

Click HERE for graphic.

The autocovariance structure may also change over time with

redesigns of the sample. However, since the most important source of

autocorrelation is the 4-8-4 rotation scheme, which has not changed,

it seems reasonable to treat this structure as stable, at least,

between sample designs.

The application of the signal-plus-noise approach requires

information on the variance- covariance structure of the CPS at the

state level. In principal this information can be estimated directly

from the sample unit data using conventional designed based methods.

In practice, this is not always feasible, since the CPS variance

estimation involves complex computations on large microdata files. In

the initial implementation of models in 1989 for the 39 states and the

District of Columbia not enough information was available to

explicitly model the sampling error. Instead, the noise component was

estimated as a correlated residual (Tiller, 1989). More recently,

sampling error autocorrelations have been developed and new models are

being tested incorporating this information (Tiller, 1992a).

Estimation

The models described in the previous section are estimated using

the Kalman filter (KF). The is a highly efficient algorithm for

estimating unobserved components of, a time series model, when that

model can be represented in state-space form. The state-space form

consists of two sets of equations, transition and measurement

equations, and a set of initial conditions. The unobserved signal and

noise components are collected into the state vector, Z(t). The

transition equations represent the state vector as a first-order

vector-autoregressive process (VAR) with a normal and independently

distributed disturbance vector, v(t), which contains the white noise

disturbances

Click HERE for graphic.

associated with each of the unobserved component processes. The

transition equations are set out below in a simplified form

appropriate for our specific application.

In the new models under development, the CPS error correlation

structure is estimated outside of the time series model from

design-based information. Variances for the state CPS estimates are

computed using the method of generalized variances (Tiller, 1992).

Autocorrelations were derived in a study by Dempster and Hwang (1992).

In that study, state-specific variance component models were fit to a

time series of data for the 8 CPS rotation groups. From the estimated

variance parameters, autocorrelations were derived and, then, ARMA

parameters were, estimated from these autocorrelations.

Click HERE for graphic.

Once, an additional observation, y(t), becomes available, the

update equations revise the conditional moments with the new

information in that observation.

Click HERE for graphic.

To initialize these equations, it is necessary to specify starting

values for the conditional moments, Z(O) and P(O). Those elements of

the state vector that are stationary, i.e., sampling error and the

irregular, are initialized with their unconditional moments. The

nonstationary and nonstochastic state variables are initialized with

diffuse priors.

Together, the prediction and update equations constitute the KF.

The KF updates its latest prediction of the state vector with current

sample data, prepares a prediction for the next period and updates

that prediction when new sample data become available, but the

estimate of Z(t) will not be revised with data later than period t.

Thus, the KF estimator at time t is optimal only with respect to data

later than period t. The estimator of Z(t) optimal for all

observations, before and after t, is known as a smoother. By taking a

linear combination a forward and backward KF, which is the KF run in

reverse, starting at the end of the sample period at time t=T, and

preceding to the beginning, a Kalman (fixed interval) smoother (KS) is

obtained. Let the backward filter prediction of the state vector at

time t, conditional on data from t+l to T be denoted by Z(t/t+l) and

its covariance by P(t/t+l). The smoothed estimator is

Click HERE for graphic.

State agency staff prepare their official monthly estimates using

software developed by BLS that implements the KF. This algorithm is

particularly well suited for the preparation of current estimates as

they become available each month. Since it is a recursive data

processing algorithm, it does not require all previous data to be kept

in storage and reprocessed every time a new sample observation becomes

available. All that is required is an estimate of the state vector

and its covariance matrix for the previous month. The software is

interactive, querying users for their UI and CPS data and, then

combining these data with CPS estimates to produce model based

estimates. At the end of the year, the monthly estimates are revised,

along with previous year estimates with the smoothing algorithm.

5.4 Evaluation Practices

Click HERE for graphic.

For each of the 39 states and the District of Columbia, signal plus

noise models of the CPS unemployment rate and employment level were

fit to monthly data beginning in 1976. Each of the 80 models has been

subjected to a wide variety of statistical tests.

An analysis of the model's prediction errors is the primary tool for

assessing model adequacy. The prediction errors are computed as the

difference between the current values of the CPS and the predictions

of the CPS made from the model based on data prior to the current

period. Since these e errors represent movements not explained by the

model, they should not contain any systematic information about the

behavior of the signal or noise component of the CPS. Specifically,

the prediction errors, when standardized, should approximate a

randomly-distributed variate with Zero mean and constant variance

(white noise). The tests used to check the prediction errors for

departure from these properties included:

- General tests for non-zero correlations in the innovations

- Departures from white noise behavior at the seasonal frequencies

- Heteroscedasticity

- Non-nonmality

- Prediction bias

About 50 to 60 percent of the total variance in the monthly CPS series

is attributable to the estimated signal with the remainder due to the

aggregate noise term. The time varying regression mean is

considerably smoother than the underlying CPS series. Based on the

diagnostic tests, the 80 models appear to fit the systematic

underlying movements in the CPS fairly well. The major problems with

the models were high autocorrelations in 11 states, and

heteroscedasticity in 9 of the 40 unemployment rate models. The

heteroscedasticity is in part a reflection of changing variances (and

sample sizes) in the CPS. Explicitly modeling the CPS sample errors

would alleviate this problem and is discussed in the current problems

and activities section, below.

The current state model estimates were introduced in January 1989.

The previous Handbook method could be classified as an accounting

method. Several years of research and development, beginning in the

early 1980's, examined numerous regression and time series approaches

to replace the accounting method. A number of workgroups were setup

to determine the criteria to be used to select the new methodology as

well as how to implement the new methodology.

Ongoing evaluation of models includes annual reassessment of the

regressor variables, if requested by staff in the state employment

security agencies. Typically, state agency staff express concerns

with the models if either the month to month movements in the

unemployment rate estimates are larger than they expect or the

unemployment rate level seems unreasonable compared to other economic

dam. Diagnostics tests, similar to the ones used for developing the

model, are run. Adjustments to the regressor variables may be made if

the diagnostic indicate a problem with the model. In this case

historical estimates would be replaced, in addition to developing a

new model for concurrent estimates.

5.5 Current Problems and Activities

The implementation of model estimates for states in January 1989

resulted, not unexpectedly, in estimates with more month to month

volatility than the previous Handbook method. The previous method

incorporated a 6 month moving average, which limited month to month

movement. The seasonal variation in the employment and unemployment

statistics series is usually large relative to the trend and cycle.

However, the BLS decided to conduct a research project to investigate

the issues of seasonally adjusting model based estimates prior to

implementing seasonally adjusted state estimates. In November 1989, a

work group was formed to examine issues related to the seasonal

adjustment of estimates of employment level and unemployment rate for

the 39 smaller states and the District of Columbia (non-direct use

states). The group was charged with addressing two primary areas:

1. Evaluation of the performance of the model estimates relative to

the CPS sample-based estimates, with emphasis on the trend/cycle

characteristics of the series.

2. Evaluation of the use and limitations of the BLS standard seasonal

adjustment method, X- 11 ARIMA, to seasonally adjust the model

estimates.

The evaluation of the modeling approach involved simulating a

reduction in reliability of direct-use CPS samples in 2 large states

(direct use -- Florida and Massachusetts) to nondirect-use levels,

fitting models to the resulting weakened series, and then comparing

model estimates to the CPS estimates from the full sample. While it

would have been desirable to simulate sample cuts by subsampling the

original data, this was considered too costly. Instead random noise

was added to the full sample estimates, using an estimated

variance/covariance structure of the CPS estimator. For each state,

two weakened samples were generated for employment and unemployment,

and separate models were fitted to the full and weakened samples. The

main findings are summarized as follows:

Model Evaluation

1. Modeling the weakened unemployment rate series resulted in

estimates which were closer to the full CPS sample than the unmodeled

weak CPS series for all four unemployment rate series. Values for the

root mean squared relative difference (RMSRD) comparing full CPS to

model estimates were 28 to 38 percent smaller than values of RMSRD

comparing the full CPS to the weakened series. Modeling also reduced

the number of weakened series estimates falling outside two standard

deviation intervals about the full sample estimates by 50 to 75

percent.

2. Modeling the weakened employment series resulted in modest, if any,

reductions in the RMSRD from the full CPS sample. In one case for the

Florida employment series, the RMSRD values for the model were

actually larger than those for the weakened series. This appeared to

be due primarily to the fact that the difference in reliability

between the full sample and the weakened series for employment was

very small compared to the difference for unemployment rate.

3. Modeling dramatically reduced the magnitude of irregular

fluctuation in both employment and unemployment rate. It was not

unusual for the relative contribution of the variance of monthly

change in the CPS to be 4 to 8 times that of the modeled series. The

much smoother quality of the model estimates have important

implications for seasonal adjustment (see below).

Evaluation of Seasonal Adjustment

Using X-11 ARIMA, the CPS and modeled series were seasonally adjusted.

The adequacy of the seasonal adjustment was evaluated using X-11 ARIMA

quality control statistics, spectral analysis, sliding spans, and

graphs of seasonal factors. The major findings are as follows:

1. Tbe direct sample based unemployment rate CPS series could not be

adequately seasonally adjusted. Frequently, several of the X-11 ARIMA

quality control statistics failed. Seasonal adjustments had poor

stability properties, the seasonal variation could not be completely

removed, and distortion was added to the nonseasonal variation in the

series. The results were better for the CPS employment series but not

nearly as good as for the modeled series.

2. The seasonal adjustments for all four of the employment and

unemployment model series appeared satisfactory. Spectral analysis

shows that X-11 ARIMA was able to effectively remove seasonal

variation in the modeled series without introducing distortions in the

nonseasonal components of the series. The sliding span statistics

indicate seasonal factors remain stable as the span of the data is

shifted across time. In addition, monthly seasonal factors using the

model were similar to the seasonal factors of the full sample

estimates of the two direct-use states. This indicates that the

models were not forcing an artificial pattern, but were "picking up"

the seasonal pattern of the underlying CPS series, despite the extra

noise which was introduced.

In summary, despite some limitations to the methods of

evaluations, the study provided important information to help

understand the value of modeling and the use of X-11 ARIMA to

seasonally adjust model-based estimates; however, the theoretical base

for superimposing the modeling structure for X-11 ARIMA to already

smoothed, model based values remains to be explored. Although the

study confirmed support for modeling, further work will be done to

demonstrate the utility of the employment models.

The BLS introduced seasonally adjusted state employment and

unemployment estimates beginning in January 1992, based on the results

of this study.

Current research is focusing on further reduction in irregular

movement in employment and unemployment models by introducing several

changes to the methods. The most important change is the inclusion of

the variance/covariance structure of CPS estimates into the models,

rather than relying on the model to make these estimates (Tiller,

1992a). Information on the structure of the CPS sample error is being

used to decompose the disturbance term into its sample error and model

error components. Given CPS error variances and lag covariance, ARMA

models can be developed to approximate the time series behavior of the

sampling error. Treating the ARMA coefficients as known parameters of

the state space system, standard time series diagnostic tools may be

used to model the errors in equation disturbances. The need for

estimating the variance-covariance structure of the CPS estimates

stems from sample redesign and changes in sample size.

Other changes, such as removing some exogenous CPS variables, are

expected to improve the seasonal movements in the model estimates.

Florida and Massachusetts will again be used to examine the ability of

the weakened series to track the full sample CPS estimates. Research

is expected to be completed in FY93 for implementation in January

1994.

Long term research will focus on substate estimation.

Hierarchical and Empirical Bayes methods may be considered in addition

to a time-series approach for substate estimates. Spatial models,

which borrow strength from CPS sample data within the state, may be

appropriate for substate estimates.

A number of related studies have been conducted under the

auspices of other govemmental agencies. The Census Bureau has

supported research by Bell and Hillmer (1990) that has been

instrumental in stimulating renewed interest in the time series

approach to survey estimation. In this study, the authors applied

ARIMA models to retail survey data. Binder and Dick (1990), at

Statistics Canada, fitted ARMA models to Canadian Labour Force survey

data. Both of these studies estimated the sampling error structure

outside the time series model, using design-based methods.

Pffermann (1991) applied a structural time series model to

individual panel estimates from the Israeli labor force survey. The

sampling error structure was estimated through the model rather than

by design-based methods.

Dempster and Hwang (1993) have developed prototype Bayesian

models for estimating U.S. State employment and unemployment rates.

Their basic time series models are constructed from fractional

Gaussian noise processes.

REFERENCES

Bell, W.R. and Hillmer, S.C. (1990), "The Time Series Approach to

Estimation for Repeated Surveys". Survey Methodology, 16, 195-215.

Binder, D.A. and Dick, J.P. (1990), "A Method for the Analysis of

Seasonal ARIMA Models," Survey Methodology, 16, 239-253.

Bureau of Labor Statistics (1988), Handbook of Methods, Washington, D.C.

Bureau of Labor Statistics (1991), Report on the Seasonal

Adjustment of LAUS Model Estimates, Washington, D.C.

Bureau of Labor Statistics (1991), The Current Population Survey

- An Overview, Internal Document by Edwin Robison, Washington, D.C.

Bureau of the Census (1978), The Current Population Survey:

Design and Methodology, Technical Paper 40, Washington, D.C.

Dempster, A.P. and Jing-Shiang Hwang (1993), "Component Models

and Bayesian Technology for Estimation of State Employment and

Unemployment Rates," paper presented at the 1993 Annual Research

Conference, Census Bureau.

Harvey, A.C. (1989), Forecasting Structural Time Series Models

and the Kalman Filter, Cambridge University Press

Pfeffermann, D. (1992). Estimation and Seasonal Adjustment of

Population Mean Using Data from Repeated Surveys. Journal of Business

and Economics Statistics, 9, 163-175.

Scott, A.J. and Smith, T.M.F. (1974), "Analysis of Repeated

Surveys Using Time Series Methods," Journal of the American

Statistical Association, 69, 674-678.

Tiller, R. (1989), "A Kalman Filter Approach to Labor Force

Estimation Using Survey Data," in proceedings of the Survey Research

Methods Section, American Statistical Association.

____(1992a), "Time Series Modeling of Sample Survey Data from the

U.S. Current Population Survey," Journal of Official Statistics, 8,

149-166.

____(1992b), "A Time Series Approach to Small Area Estimation," in

Proceedings of the Survey Methods Research Section, American

Statistical Association.

Train, G., Cahoon, L., and Makens, P. (1978). The Current

Population Survey Variances, Inter-Relationships, and Design Effects.

In Proceedings of the Survey Research Methods Section, American

Statistical Association, 443-448.

CHAPTER 6

County Estimation of Crop Acreage Using

Satellite Data

Michael Bellow, Mitchell Graham, and William C. Iwig

National Agricultural Statistics Service

6.1 Introduction and Program History

The National Agricultural Statistics Service (NASS) of the

U.S. Department-of Agriculture (USDA) has published county estimates

of crop acreage, crop production, crop yield and livestock inventories

since 1917. These estimates assist the agricultural community in

local agricultural decision making. Also the Federal Crop Insurance

Corporation (FCIC) and the Agricultural Stabilization and Conservation

Service (ASCS) of the USDA use NASS county crop yield estimates to

administer their programs involving payments to farmers if crop yields

are below certain levels. The primary source of data for these

estimates has always been a large non-probability survey of

U.S. farmers, ranchers, and agribusinesses who voluntarily provide

information on a confidential basis (see Chapter 7). In addition, the

Census of Agriculture, conducted by the Bureau of the Census every

five years, serves as a valuable benchmark for the NASS county

estimates.

Earth resources satellite data, particularly from the Landsat series

of satellites, provide another useful ancillary data source for county

estimates of crop acreage. The potential for improved estimation

accuracy using satellite data is based on the fact that, with adequate

coverage, all of the area within a county can be classified to a crop

or ground cover type. The accuracy of the estimates is then dependent

on how accurately the satellite data are classified to each crop type

based on the "ground truth" data obtained from the annual June

Agricultural Survey (JAS) conducted by NASS. Through the use of

aerial photographs, this survey identifies the crop type of individual

fields within randomly selected land segments. Segments in major

agricultural areas are approximately one square mile in area and

normally contain 10 to 20 fields. The satellite spectral data are

matched to the corresponding fields for use in classifying all

individual imaged areas, known as pixels, to a particular crop type.

Recent studies (Bellow 1991; Bellow and Graham 1992) have shown that,

for certain crops, approximately 80 percent of the pixels are

classified correctly. This correct classification level is high

enough to provide improved estimation accuracy.

NASS has been a user of remote sensing products since the 1950's when

it began using mid- altitude aerial photography to construct, area

sampling frames (ASF's) for the 48 states of the continental United

States. A new era in remote sensing began in 1972 with the launch of

the Landsat I earth-resource monitoring satellite. Four additional

Landsats have been launched since 1972, with Landsat IV and V still in

operation in 1993. The polar-orbiting Landsat satellites contain a

multi-spectral scanner (MSS) that measures reflected energy in four

bands of the electromagnetic spectrum for an area of just under one

acre. The spectral bands were selected to be responsive to vegetation

characteristics. In addition to the MSS sensor, Landsats IV and V

have a Thematic Mapper (TM) sensor which measures seven energy bands

and has increased spatial resolution. The large area (185 by 170 km)

and repeat (16 day per satellite) coverage of these satellites opened

new areas of remote sensing research: large area crop inventories,

crop yields, land cover mapping, area frame stratification, and small

area crop cover estimation.

Research from 1972 to 1978 led to the creation of an operational

procedure for large area crop acreage estimation. A regression

estimator was developed which related the ground-gathered area frame

data to the computer classification of Landsat MSS images. The basic

regression approach used to produce State estimates does not produce

reliable county estimates. Domain indirect regression estimators were

developed for this purpose. In the 1978 crop season, corn and soybean

acreage State and county estimates based on remotely sensed data were

produced for Iowa. One to two States were added to the project

through 1984. For the 1984-1987 crop seasons, this project covered an

eight-State area in the central United States and produced regression

estimates of corn, winter wheat, soybeans, rice, and cotton acreages.

These regression estimates were combined with other survey indications

and administrative data to provide final published county estimates.

Estimation based on data from Landsat MSS sensors was discontinued in

1988 in order to implement the increased capabilities of higher

resolution sensors.

France entered the field of resources satellites in 1986 with the

launch of SPOT I, which carries an improved multi-spectral scanner.

This scanner images an even smaller area than the TM sensor but only

measures three energy bands. Several NASS research projects compared

the SPOT MSS and Landsat TM sensors with respect to crop estimation.

This research led to the selection of Landsat TM as the preferred

sensor for crop area estimation based on its superior spectral

characteristics. The spatial characteristics of the SPOT MSS sensor

provide a benefit only in areas with mostly small fields.

Regression estimation of crop acreages for large and small areas based

on computer classification was reinstated in 1991 with the Delta

Remote Sensing Project using Landsat Thematic Mapper data imaged over

the Mississippi Delta region, which is a major rice and cotton area.

Results from the operational eight-State program in 1987 and from

sensor comparison experiments showed that the regression approach was

most effective for rice and cotton estimation. State and county

estimates of rice, cotton, and soybean acreages were produced for

Arkansas and Mississippi in 1991, with Louisiana added in 1992. The

project only covers Arkansas in 1993 due to budgetary constraints.

Three domain indirect regression estimators have been used or

considered for producing small area county estimates using ancillary

satellite data. From 1976 to 1982, the Huddleston-Ray estimator was

used (Appendix B). In 1978, the Cardenas family of estimators was

considered but not implemented (Appendix C). Beginning in 1982, the

Battese-Fuller family of estimators was used for calculating county

crop acreage estimates using Landsat MSS data. Since 1991, the

Battese-Fuller model has been used to produce county estimates with

Landsat TM data. Currently, this is the preferred model. However,

non-regression estimation procedures based on total pixel counts are

being evaluated.

6.2 Program Description, Policies, and Practices

The basic element of Landsat spectral data is the set of measurements

taken by a sensor of a square area on the earth's surface. The sensor

measures the amount of radiant energy reflected from the surface in

several bands of the electromagnetic spectrum. The individual imaged

areas, known as pixels, are arrayed along east-west rows within the

185 kilometer wide north-to-south pass (swath) of the satellite. For

purposes of easy data storage, the data within a swath are subdivided

into overlapping square blocks, called scenes. The two satellites

currently in operation (Landsats IV and V) image a given point on the

earth's surface once every 16 days. The MSS sensor, formerly used for

crop area estimation, contained four spectral bands with 80 meter

spatial resolution. The more advanced TM sensor has seven bands

(three visible and four infrared) with 30 meter resolution.

Several Landsat scenes may be required to cover an entire region of

interest within a given State. It is not always possible to have the

same image date for all such scenes due to schedule, cloud cover, and

image quality factors. Consequently, analysis districts are created.

An analysis district is a collection of counties or parts of counties

contained in one or more Landsat scenes that have the same image date,

or in areas for which usable Landsat data is not available to the

analyst. To obtain State level crop acreage estimates, NASS sums all

analysis district level estimates within the State. County level

estimates are obtained using domain indirect regression and synthetic

estimation methods, to be discussed later.

The area sampling frame for each State is stratified based on land use

such as percentage cultivation, forest, and rangeland. NASS uses the

regression estimator described by Cochran (1977, pp. 189-204) to

compute crop acreage estimates for each land use stratum within an

analysis district that has satellite coverage for an adequate number

of JAS segments. These regression estimates are more precise than the

direct expansion estimates obtained from JAS data alone. A detailed

description of the procedure involved is provided by Allen (1990).

Briefly, the steps required are as follows:

1. A graphics oriented registration process associates Landsat pixels

with JAS sampled segments.

2. JAS data for sampled segments are used to label each pixel within

the segments to a crop or other cover type.

3. Labelled pixels are clustered based on their Landsat data values

to develop discriminant functions (signatures) for each cover

4. The discriminant functions are used to classify each pixel within

the sampled segments to a cover type.

5. The segment level classification results are used to develop

regression relationships for each crop between the ground and

satellite data within each land use stratum. For each stratum, the

independent (regressor) variable is the number of pixels classified to

that crop per segment, and the dependent variable is the JAS segment

reported crop acreage.

6. All pixels within the analysis district are classified, using the

discriminant functions developed in Step 3.

7. For each stratum, the mean number of pixels per segment classified

for a given crop over all segments in the population is substituted

into the corresponding regression equation to obtain the stratum level

mean crop acreage per segment. This mean is multiplied by the known

total number of segments in the stratum to obtain the stratum level

crop acreage estimate.

8. The stratum level estimates are summed to obtain the analysis

district level crop acreage estimate for the portion of the analysis

district covered by satellites data.

For land use strata lacking satellite coverage of an adequate number

of JAS segments to develop the regression relationship, the direct

expansion of JAS data is used to obtain estimates. These stratum

level JAS estimates are also summed to obtain analysis district

estimates for each crop representing the area not covered by satellite

data. The total analysis district estimate for a particular crop is

then:

Click HERE for graphic.

In many States, counties typically contain fewer than five sampled JAS

segments, and may contain no segments at all. This fact makes it

generally infeasible to define analysis districts to be individual

counties and then use the above procedure to obtain county level

estimates. Instead, the Huddleston-Ray, Cardenas, and Battese-Fuller

domain indirect regression estimators have been developed and

investigated for providing county estimates of crop acreage. The

Battese-Fuller approach is currently favored by NASS, and is described

in detail in Section 6.3.

The NASS County Estimates system, described in Chapter 7, is designed

to accept the Battese-Fuller values as a separate set of county crop

acreage estimates. Within this system, the Battese-Fuller county

estimates are first scaled to be additive to the official NASS State

estimate for each commodity. The scaled Battese-Fuller values are

then composited with scaled values from other NASS surveys and

administrative data sources. Thus the Battese-Fuller estimates serve

as an additional input to the County Estimates system in States where

they are available. Currently, the composite weights are subjectively

set by the statisticians in the State office to provide satisfactory

and reliable estimates. Each NASS State Statistical Office (SSO)

prepares their own annual publication of the final county estimates.

Although sampling variances are calculated for the Battese-Fuller

estimates, no variances or error information are published for the

final county estimates. Mean squared error information is only

published for major agricultural items at the U.S. level.

6.3 Estimator Documentation

The Battese-Fuller family of estimators was first developed in the

general framework of linear models with nested error structure (Fuller

and Battese 1973), and later applied to the special case of county

crop area estimation (Battese, Harter, and Fuller 1988). The method

has been used for all Landsat county estimation done by NASS since

1982.

Similar to the State level estimation, land use strata are separated

into those that have adequate satellite coverage and those that do

not. The Battese-Fuller model can be applied within an analysis

district for all strata where classification and regression have been

performed. The analyst computes stratum level Battese-Fuller acreage

estimates for all counties and subcounfies within the boundaries of

each analysis district. For land use strata where regression cannot

be done due to lack of adequate satellite coverage or too few

segments, a domain indirect synthetic estimator is used to obtain

county estimates.

Click HERE for graphic.

was used within stratum A for the parts of counties outside the scene,

and in stratum B for all nine counties.

Table 2 gives the computed county estimates by stratum and estimation

method. Table 3 contains the official county estimates issued by the

Iowa Agricultural Statistics Service. These published estimates are

based on additional survey and administrative data (see Chapter 7),

and are considered as the standard for evaluating the Battese-Fuller

model values. The tables show that the computed county estimates for

corn were more efficient overall than those for soybeans. For eight

of the nine counties, the C.V. for corn was less than 4 percent. No

county had a C.V. of less than 4 percent for soybeans. The percent

difference ranged from 0.2 to 9.2 for corn, and from 0.8 to 17.8 for

soybeans.

Table 2: Iowa 1988 County Estimates of Crop Acreage by Stratum and

Estimation Method

Stratum A Stratum A Stratum-B

County Battese-Fuller Synthetic Synthetic Total C.V.

Corn acres (000) acres (000) acres (000) acres (000) percent

Audubon 91.9 - .3 92.2 3.5

Calhoun 130.3 2.6 .4 133.2 2.9

Carroll 140.7 - .7 141.4 3.2

Crawford 128.4 23.4 .9 152.7 3.1

Greene 129.6 - .4 130.0 3.0

Guthrie 105.7 - .6 106.3 4.9

Ida 43.4 63.2 .4 107.0 3.7

Sac 137.5 - .8 138.3 2.9

Shelby 140.2 - .5 140.7 2.9

Total 1047.789.2 5.0 1141.8

Soybeans

Audubon 69.8 - .1 69.9 6.6

Calhoun 143.2 1.7 .1 145.0 4.0

Carroll 106.6 - .1 106.7 9.0

Crawford 91.3 15.5 .2 106.9 5.4

Greene 117.4 - .1 117.5 4.6

Guthrie 64.3 - .1 64.4 10.9

Ida 34.6 41.7 .1 76.4 6.9

Sac 112.8 - .1 112.9 4.9

Shelby 80.9 - .1 81.0 7.4

Total 820.9 58.8 1.0 880.7

6.5 Evaluation Practices

NASS first began to address the problem of applying satellite data to

small area estimation in the mid 1970's. In 1976, Huddleston and Ray

(1976) proposed that within each stratum, the mean pixels per segment

calculated by classifying all segments within an entire analysis

district be replaced by the mean pixels per segment computed by

classifying all segments within a given county. This county pixel

mean is substituted into the corresponding stratum regression equation

for the crop of interest. Amis, Martin, McGuire, and Shen (1982)

describe the Huddleston-Ray estimator as an analysis district

regression estimator applied to a subarea of the analysis district.

The regression coefficients are estimated from sampled segments

located throughout the analysis district, while the mean being

estimated is from a subpopulation of the analysis district. The

Huddleston-Ray estimator is simple and intuitively appealing, but

Walker and Sigman (1982) point out two major drawbacks. First, it is

unclear how to accurately compute the variance of the estimator.

Second, the estimator lumps together a term attributable to sampling

error within a given county and another term that measures the

inherent distinction between a county and the analysis district. Amis

et al. (1982) empirically demonstrate that the Huddleston-Ray method

can generate biased estimates and that the variance estimatation

formula can overestimate the variability for a given county. The

mathematical formulas for the Huddleston-Ray estimator and its

variance estimator are provided in Appendix B.

The problems with the Huddleston-Ray estimator documented by Walker

and Sigman (1982) and by Amis et.al (1982) were recognized soon after

its development and prompted Cardenas, Blanchard, and Craig (1978) to

devise a different type of estimator. The Cardenas family of

estimators has three forms, each of which uses auxiliary Landsat data,

through a regression type estimator. However, the versions use

different methods of estimating the slope term. The three forms are

the ratio estimator, the separate regression estimator, and the

combined regression estimator. (Appendix C gives the mathematical

formulation for the Cardenas family of estimators.) As with the

Huddleston-Ray method, within each stratum the Cardenas method

compares the analysis district level mean pixels per segment

classified to a crop to the corresponding county level mean for that

crop. However, the Cardenas methods uses all segments in the analysis

district to calculate the analysis district mean, where the

Huddleston-Ray approach only uses sample segments. The estimate of

average crop area per segment is adjusted by an amount proportional to

this difference between the county and analysis district means. Amis

et al. (1982) examined the ratio and separate regression Cardenas

estimators, and compared them with the Huddleston-Ray estimator.

Cardenas et al. (1978) stated that none of the estimators they

presented were shown to be "best" in any sense, nor did they

demonstrate any optimum properties. They did show that each of these

estimators, when summed over counties, provides an unbiased stratum

level estimate for the State. Also, assuming that the within county

variance is the same for all counties, the method enables unbiased

estimation of the State-wide variance. Amis et al. (1982) emphasized

that an unbiased estimate of the county mean crop area per segment may

not be possible when there are few sample segments in a county.

Whenever there are significant differences in county variances, the

Cardenas estimators appear to have higher variances than the

Huddleston-Ray estimator. Amis et al. (1982) concluded that there

appears to be no difference between the Cardenas ratio estimator and

the separate regression estimator, and that the Cardenas estimators do

not perform better than the Huddleston-Ray estimator. Both Cardenas

estimators studied appeared to be biased, with larger variances than

the Huddleston-Ray estimator.

The Cardenas method was never used in an operational remote sensing

program since it did not provide sufficient improvement over the

Huddleston-Ray estimator. The Huddleston-Ray estimator was used to

generate county estimates for use by the NASS State Statistical

Offices (SSO's) until 1982. At that time, Walker and Sigman (1982)

advised that calculation of county estimates using the Huddleston-Ray

method be discontinued, and that the Battese-Fuller method be used

instead.

Walker and Sigman (1982) studied the Battese-Fuller model using

Landsat MSS data over a six county region in eastern South Dakota.

They found a modest lack of fit of the model, with larger model

departure corresponding to low correlation between classified pixel

counts and ground survey observations. A key feature of the

Battese-Fuller model is the county effect parameter and this effect

was found to be highly significant for corn, the most prevalent of the

four crops considered in the study. Furthermore, this effect

manifested itself within several strata but was negligible across

strata. The study nonetheless indicated robustness of the Battese-

Fuller estimators against departure from certain model assumptions.

Two members of the Battese-Fuller family satisfied the criterion for

small relative root mean square error; i.e. less

Table 4: County Estimates for Mississippi 1991

County Official Computed % Diff* CV

Cotton acres (000) acres (000) percent percent

Bolivar 65.5 61.6 6.0 9.9

Coahoma 105.7 88.3 16.5 4.8

Humphreys 61.6 57.3 7.0 5.9

Issaquena 38.0 34.6 9.0 11.3

Leflore 79.2 87.8 10.9 4.0

Quitman 31.O 46.4 49.7 8.6

Sharkey 47.0 48.6 3.4 7.0

Sunflower 100.0 79.3 20.7 6.9

Tallahatchie 64.2 67.9 5.8 7.2

Tunica 45.6 38.0 16.7 6.6

Washington 95.7 102.4 7.0 3.9

Yazoo 94.5 93.9 .6 8.0

Total 828.0 806.1

Rice

Bolivar 74.0 66.2 10.5 5.4

Coahoma 15.8 1O.4 34.2 24.0

Humphreys 3.6 7.1 97.2 32.4

Leflore 16.6 19.4 16.9 18.6

Sharkey 5.0 7.8 56.0 21.8

Sunflower 36.0 37.8 5.0 9.3

Tallahatchie 9.6 8.5 11.5 35.3

Tunica 17.5 9.9 43.4 26.3

Washington 30.5 22.6 25.9 15.5

Total 208.6 189.7

* Click HERE for graphic.

than 20 percent of the estimate was attributable to root mean square

error. These members were the estimators that minimized mean square

error and bias, respectively, under the model assumptions. However,

the Battese-Fuller estimate closest to the Huddleston-Ray estimate was

far less satisfactory, failing to meet the desired upper limits for

mean square error and bias.

This study provided the justification for replacing the Huddleston-Ray

estimator with the Battese- Fuller family.

The reliability of the county estimates based on the Battese-Fuller

model has been closely watched since its implementation in 1982. As

mentioned previously, these estimates are only one of possibly four or

more indications that are composited to provide the final published

crop acreage values. The reliability of the Battese-Fuller estimates

can vary between years, between crops, between counties and between

States depending on the stage of the crop at the time of the Landsat

imagery, the amount of crop acreage within the county, the number of

segments within the county and cloud cover. The results presented in

Tables 2 and 3 for corn are relatively good with all CVs less than 5

percent and over half of the percentage differences from the published

value less than 4 percent. The soybean results are slightly poorer,

with CVs ranging from 4 percent to 11 percent and percentage

differences ranging up to 18 percent. Table 4 presents more recent

results for a set of counties covered by Landsat in Mississippi for

1991. A review of the CVs and percentage differences indicate that

the Battese-Fuller estimates can have relatively large CVs and

percentage differences when the county crop acreage is less than

30,000 acres. Some summary statistics of the differences for the four

crop examples discussed are presented in Table 5. The mean average

difference is typically less than 10,000 acres, but

Table 5: Summary Statistics on Accuracy of Battese-Fuller Estimates

(1000 acres)

Crop/State/Year MD* RMSD* MAD* LAD*

Corn Iowa 1988 -0.6 6.8 5.4 14.3

Soybeans Iowa 1988 -8.6 11.9 9.1 25.5

Cotton Mississippi 1991 -1.8 10.0 7.8 20.7

Rice Mississippi 1991 -2.1 5.2 4.5 7.9

* MD = mean difference between Battese-Fuller and published value

RMSD = root mean squared deviation

MAD = mean absolute difference

LAD = largest absolute difference

for small county acreages such as rice in Mississippi, large

percentage differences may still occur. Consequently, NASS SSO's

still use additional survey and administrative data to help set the

published values.

6.6 Current Problems and Activities

As technology improves, new sensors produce satellite data that can be

more accurately classified to a given crop than ever before.

Consequently, the overall count of pixels classified to a given crop

within a county can possibly be used directly to estimate crop

acreage. The overall pixel count represents a census of pixels

covering the county and therefore is not subject to sampling error.

However, a nonsampling error is introduced due to inaccuracies in the

classification. A general expression for such an estimator is:

Click HERE for graphic.

Both adjustment terms are conceptually simple. The combined ratio

uses stratum level survey information to compute the adjustment term

that may provide a more accurate conversion of pixel counts to crop

area than the set conversion factor. Also, the ratio has a readily

available formula for estimating the variance.

Research continues to focus on identifying new geographic areas and

crops where this estimator would be applicable. Also, possible

benefits of remotely sensed data from alternative sources, such as

radar satellites, will be investigated as the newer sources are

available. In recent years TM sensor data have been used to produce

county estimates in the Delta region. County estimates of rice,

cotton, and soybeans were produced for Arkansas and Mississippi, in

1991, with Louisiana added in 1992. In 1993 satellite data are only

being used in Arkansas due to budgetary constraints. To date, the

satellite based estimates have only been produced on a limited scale.

The NASS SSO's continue to rely on other data series for helping set

the published county estimates of crop acreages. They conduct a large

non-probability county estimates survey (see Chapter 7) that serves a

dual purpose of also providing updated control data for the list

sampling frame. This is an integral part of the NASS survey program

and so will continue in some form for the foreseeable future. Fairly

reliable administrative data sources are also available. NASS is

continuing to investigate the benefits of satellite based county

estimates in relation to these other available data sources. One

by-product of the satellite data process that is attractive to the

State offices is color coded land use maps at the county level. These

maps provide a pictorial view of the distribution of the crops within

each county. Identifying alternative uses of satellite data such as

this is an important research objective of NASS.

REFERENCES

Allen, J.D. (1990), "A Look at the Remote Sensing Applications Program

of the National Agricultural Statistics Service," Journal of Official

Statistics, 6, pp. 393-409.

Amis, M.L., Martin, M.V., McGuire, W.G., and Shen, S.S. (1982)

"Evaluation of Small Area Crop Estimation Techniques Using Landsat and

Ground-Derived Data," LEMSCO-17597, Houston, TX: Lockhead Engineering

and Management Services Company, Inc.

Angelici, G., Slye, R., Ozga, M., and Ritter, P. (1986), "PEDITOR - A

Portable Image Processing System," Proceedings of the IGARSS '86

Symposium, Zurich, Switzerland, pp. 265-269.

Battese, G.E., Harter, R.M., and Fuller, W.A. (1988), "An

Error-Components Model for Prediction of County Crop Areas Using

Survey and Satellite Data," Journal of the American Statistical

Association, 83, pp. 28-36.

Bellow, M.E. (1991), "Comparison of Sensors for Corn and Soybean

Planted Area Estimation," NASS Staff Report No. SRB-91-02,

U.S. Department of Agriculture.

Bellow,, M.E. and Graham, M.L. (1992), "Improved Crop Area Estimation

in the Mississippi Delta Region using Landsat TM Data," Proceedings of

the ASPRS/ACSM Convention, Washington, D.C., pp. 423-432.

Cardenas, M., Blanchard, M.M., and Craig, M.E. (1978), "On The

Development of Small Area Estimators Using LANDSAT Data as Auxiliary

Information," Economic, Statistics, and Cooperatives Service,

U.S. Department of Agriculture.

Cochran, W.G. (1977), "Sampling Techniques," New York, N.Y.: John Wiley & Sons.

Fuller, W.A. and Battese, G.E. (1973), "Transformations for Estimation

of Linear Models with Nested-Error Structure," Journal of the American

Statistical Association, 68,pp. 626-632.

Huddleston, H.F. and Ray, R. (1976), "A New Approach to Small Area

Crop Acreage Estimation," Proceedings of the Annual Meeting of the

American Agricultural Economics Association, State College, PA.

Ozga, M. (1985), "USDA/SRS Software of Landsat MSS-Based Crop Acreage

Estimation," Proceedings of the IGARSS '85 Symposium, Amherst, MA,

pp. 762-772.

Prasad, N.G.N. and Rao, J.N.K. (1990), "The Estimation of the Mean

Squared Error of Small-Area Estimates," Journal of the American

Statistical Association, 85, pp. 163-171.

Walker, G. and Sigman, R. (1982) "The Use of LANDSAT for County

Estimates of Crop Areas - Evaluation of the Huddleston-Ray and

Battese-Fuller Estimators," SRS Staff Report No. AGES 820909,

U.S. Department of Agriculture.

Appendix A: Estimators of Battese-Fuller Variance Components

Click HERE for graphic.

Appendix B: Huddleston-Ray Estimator

The Huddleston-Ray estimator replaces the classified pixel average for

the analysis district with the classified pixel average for a county

when estimating the county mean crop area per frame unit. Within the

analysis district, the overall mean crop area in regression stratum h

is estimated by:

Click HERE for graphic.

Appendix C: Cardenas Family of Estimators

The Cardenas family of estimators uses the stratum level differences

between mean number of pixels classified to the crop of interest in

the county and the analysis district, respectively, to adjust the mean

reported crop area per sample segment. Within a regression stratum h,

the estimate of mean crop area per segment for a county c is:

Click HERE for graphic.

CHAPTER 7

The National Agricultural Statistics Service

County Estimates Program

William C. Iwig

National Agricultural Statistics Service

7.1 Introduction and Program History

The National Agricultural Statistics Service (NASS) of the

U.S. Department of Agriculture (USDA) publishes over 300 reports

annually regarding the Nation's crop acreage, crop production,

livestock inventory, commodity prices, and farm expenses. The primary

source of this information is surveys of U.S. farmers, ranchers, and

agribusinesses who voluntarily provide information on a confidential

basis. These surveys are normally designed to provide State and

U.S. level indications of agricultural commodities. There is also a

need for county level estimates to assist farmers, ranchers,

agribusinesses, and government agencies in local agricultural decision

making.

NASS has published annual county estimates for over 70 years through

funding provided by cooperative agreements with State departments of

agriculture and agricultural universities, and directly from other

USDA agencies. The earliest known record of published county

estimates is by the Wisconsin State Board of Agriculture, which issued

county estimates on acreage and production of crops for 1911 and 1912

along with the number and value of livestock for 1912. Not until

1917, following the signing of the first Federal-State cooperative

agreement, did the USDA assist in the preparation and publication of

the Wisconsin county estimates. The cooperative agreement helped

eliminate duplication of efforts between Federal and State

statisticians, making possible more service for less cost. The

cooperative work grew rapidly after 1917 as other State departments of

agriculture and State agricultural universities established

cooperative agreements with the USDA. State governments needed county

level information and their funding made possible the publication of

county level estimates by USDA.

The New Deal Farm Programs of President Franklin D. Roosevelt's

Administration used county estimates of agricultural commodities

extensively and refocused USDA's attention to these estimates. In May

1933, the Agricultural Adjustment Act was passed and the Agricultural

Adjustment Administration (AAA) was soon in place. This agency had

the task of reducing supply in order to improve prices of agricultural

commodities. These programs greatly increased demands on NASS for

county estimates of commodities used by the AAA to set county quotas

and program pay-outs for surplus items.

In more recent years, the Federal Crop Insurance Corporation (FCIC)

and the Agricultural Stabilization and Conservation Service (ASCS) of

the USDA have used NASS county estimates to administer their programs

and they provide funding to NASS for that purpose. Their programs

involve payments to farmers if crop yields are below certain levels.

Both agencies have chosen to use the NASS county estimates, when

available, as the basis for determining these payments.

The estimation approach has remained relatively unchanged over the

years. The basic process for estimating totals such as crop acreage

and livestock inventory initially involves scaling various survey

estimates and other available administrative data at the county level

to be additive to the official USDA State level estimate. These

scaled estimates are composited together, usually with the previous

year estimate, to provide the actual county estimate for the current

year. This scaling and compositing process tends to strengthen the

final estimate over a direct design based expansion. These estimates

are checked against any vailable administrative data that are reliable

indicators of minimum levels and modifications are made if necessary.

Program changes that have been made since 1917 involve, data

processing advances, allowing more data to be used, and larger

sampling frames and more sophisticated sample selection techniques,

providing better coverage of the farm population. Also, advances have

been made to improve the quality of the State level estimates, which

indirectly benefit the quality of the published county estimates

through the scaling process. In the late 1950's, methodology was

developed to conduct probability area frame surveys, where random

segments of land would be selected for enumeration. In the 1960's

these surveys became operational, which provided for the first, time

probability survey indications of crop acreage and livestock

inventories on a State level basis. During this time frame, the State

reporter lists were also increasing in size and improving in quality.

With improved data processing capabilities in the 1970's, probability

Multiple Frame (MF) Surveys were implemented at the U.S. and State

levels, which combined the use of list and area sampling frames.

Also, some States have conducted probability or quasi-probability MF

County Estimates surveys (North Carolina Ag Statistics Service 1986).

States have traditionally shown a large degree of autonomy in

designing and conducting their county estimates surveys. This has

been due, in large part, to funding from the State cooperator, the

quality of different data sources and different computing capabilities

in each State. Recently, a NASS task force developed a County

Estimates system for sample selection and summarization that provides

a general framework, but still allows considerable flexibility to each

State in their sample selection and summarization procedures (Bass et

al. 1989). This system is now the standard being used by NASS State

offices for their county estimates program.

7.2 Program Description, Policies, And Practices

The NASS County Estimate Program is really 45 different programs

conducted separately by each NASS State Statistical Office (SSO).

There is some general structure provided by the 1989 County Estimates

Task Group, but still each State has considerable flexibility in the

implementation of the procedures. The quality of the county estimates

is to some degree related to the amount of financial support being

provided by the State cooperator, which is usually the State

Department of Agriculture.

The Census of Agriculture, conducted by the Bureau of the Census, has

always served as a benchmark for the USDA crop and livestock

estimates, and especially for county estimates. The annual State Farm

Census, funded by the State cooperator, was also an important

benchmark for the county estimates in many States until the late

1970's. Since then it has been discontinued in most States due to

lack of funding. The Census of Agriculture has been conducted every

five years since 1920 (on a 4 year schedule from 1974 to 1982),

providing county, district, State, and U.S. level estimates of most

agricultural commodities. Since 1982, the Census has been conducted

to coincide with the economic censuses (business, industry, etc.) in

years ending in 2 and 7. Census county level estimates are closely

watched since the USDA estimates are often based on very few survey

returns. At the same time, the quality of the Census numbers are also

closely evaluated. The completeness of the Census varies from State

to State, county to county, and item to item. Consequently, the

Census values are interpreted differently. After the Census values

are published, NASS statisticians review their estimates and make

revisions as necessary.

Another major component to the county estimate program has been the

official USDA State level estimate. Preliminary survey estimates and

administrative data are scaled to be additive to the official State

total. State estimates are based on more data than each individual

county estimate and, in recent years, have been based on probability

survey indications. Consequently, the State estimates have always

been considered more reliable than any individual county estimate.

In: addition to being more reliable, State level estimates are usually

already published before county estimates are published. For these

reasons, county level indications have always been scaled to the State

level estimates rather than the State level estimate being the sum of

independently derived county estimates.

Over the years, the county estimate surveys have developed into a

major source of information for list frame maintenance and updating.

Farm operations that had not been contacted within a prescribed time

frame can be targeted for sampling for the annual county estimates

survey. Currently, NASS has a stated policy that all control data on

the list sampling frame (LSF) should be less than five years old (USDA

1991, Policy and Standards Memorandum 14-91). Control data refers to

the historic survey data values or data values from external sources

that are stored on the LSF and used for stratification and sample

design purposes.

Another policy that is followed in all States is the suppression of

any county estimate that would disclose the data of any individual

operation, as specified in Policy and Standards Memorandum 12-89 (USDA

1989). This policy preserves the confidentiality of all reports,

which is a foundation of voluntary reporting to NASS. Estimates

cannot be published if either: (1) the estimate is based on

information from fewer than three respondents, or (2) the data for one

respondent represents more than 60 percent of the estimate.

Exceptions to this rule are only granted when written and signed

permission is given by the respondent. Suppressed estimates may be

combined with another county as long as the confidential data are not

disclosed.

In most States, county estimates are made for all major crop and

livestock categories. This may cover 50 - 100 separate commodity

items. Estimates for crop items usually include planted acres,

harvested acres, yield, production, and value of production for a

particular crop year. Some States also publish separate estimates for

different cropping practices, such as irrigated and non-irrigated

acreages. Livestock estimates include inventory numbers on a

particular date, possibly marketings, and inventory value. Each SSO

develops their own county estimate publication because they are State

funded. These estimates have associated sampling and non-sampling

errors. No variances or error information are published for the final

county estimates. Mean squared error information is only published

for major agricultural items at the U. S. level.

7.3 Estimator Documentation

Tne new NASS County Estimate System uses a combination of scaling and

compositing techniques to provide a county level total estimate for

any particular agricultural item. Separate estimates that may be

composited together include the previous year official estimate,

current year direct expansion and ratio estimates, and other available

indications. In recent years, remotely sensed data from satellites

have been used to generate county level estimates of crop acreages for

selected crops where this technology has been applied (see Chapter 6).

County estimates of a ratio such as crop yield, which is the ratio of

total crop production to total harvested acres, are dependent on the

final estimates of the two items involved. Current year data are

collected using primarily a mail survey in the fall of the year with

some selected telephone follow-up. State sample sizes can range up to

40,000 with usable record counts around 200 for major items in major

counties. However, county estimates for many commodities are based on

fewer than 20 sample records.

A key feature of the system is the sample design which involves

selecting sampling units from multiple overlapping stratified designs.

A separate design is developed for each commodity of interest. The

system combines data collected from sampled operations from these

different designs such that the selection probabilities are not used

in, calculating the survey estimates. Another key feature of the

system is the coordination of survey contacts from the different

designs to control respondent burden. A third feature is a synthetic

scaling of the county estimates in order that they sum to the official

U.S. Department of Agriculture State level estimates. A fourth

feature is the compositing of the different estimates to provide final

county level estimates. Further details on each of these features

follow.

7.3.1 Commodity Specific Stratified Designs

The NASS County Estimate Program depends primarily on a large mail

survey in the fall of the year with State level sample sizes ranging

up to 40,000. Some States conduct two surveys, with an early fall

survey covering acreage and production of small grains which are

usually harvested by September. Then the late fall survey covers the

fall harvested crops and livestock. The sample units are farm

operations selected from the NASS list sampling frame in each State.

One of the major goals of the new system is to provide a framework

that will ensure adequate representation for each agricultural item of

interest. In order to provide adequate county level estimates, major

farm operations for each item of interest must be represented

appropriately in the sample. This is relatively easy for the major

crops in a State since a sample design representing all known

operations with cropland would represent any major crop adequately.

However, in order to provide adequate representation for rare crop and

livestock items, the strategy used in the new system is to develop

separate stratified sample designs for each agricultural commodity as

needed. The sample design strata for each commodity are based on the

positive control data for that particular item. Control data are the

historic data values stored on the list sampling frame. Strata

boundaries typically coincide with the categories used in the Census

of Agriculture publications. Table I illustrates the stratified

design that might be developed for barley in a particular State,

covering all known operations that have positive control data for

barley.

Table 1: Example Stratified Design for Barley

=============================================

Population Boundary

Stratum Count (acres)

_____________________________________________

10 2,500 1 - 49

20 1,000 50 - 99

30 400 100 - 299

40 100 300+

_____

Total 4,000

______________________________________________

The major function of the stratified design is to provide a framework

to group similar size operations for summarization (see 7.3.3).

Initial sampling may occur at the State level within each stratum.

Or, different sampling rates may be used at the county level in order

to assure an adequate sample within each county. Different sampling

rates by county would typically occur when the commodity frame

contains only a few records in a particular county. It may be

necessary to sample all records with "probability one" in that county,

where a smaller sampling fraction is sufficient in other counties.

This most frequently occurs with rare commodities. Another sampling

option keys on whether the sampling unit reported in the previous

year. If the current to previous year ratio is a primary indication

for a State, units that reported in the previous year may be sampled

heavily, and other records sampled at a lighter rate.

7.3.2 Coordination of Multiple Samples

The samples selected from the different commodity designs contain many

overlapping records. A farming operation could easily be selected

from multiple commodity designs. In addition, many of the selected

operations may have already provided all or some of the requested

information on another current year survey. These other survey data

files are used as input to the County Estimate System. The system is

designed to identify which records already have provided the requested

information and questionnaires are not sent to these operations. Even

if an operation has only provided some of the needed data on previous

crop specific or livestock specific surveys, it will typically not be

recontacted to help control respondent burden. Data items not

included on the previous surveys are treated as "missing" in the

county estimates expansions. The system also identifies which records

are duplicated in multiple designs and in multiple samples. Only one

questionnaire is sent to each sampled unit. The same questionnaire,

containing all items of interest, is used regardless of the commodity

design (barley, corn, hogs, etc.) from which the record was selected.

There is usually some telephone follow-up to non- respondents as

resources allow. Telephoning may be targeted to provide sufficient

data for each commodity. Since a secondary objective of the county

estimate survey is to update control data on the list sampling frame,

some telephoning may be targeted at operators with missing control

data or control data that are more than five years old.

7.3.3 Creation of Survey Indications

The County Estimates System is designed to provide direct expansion

and ratio estimates based on sample data collected from the county

estimates survey and from sample data collected from other current

year surveys. As mentioned previously, the same questionnaire is used

for all farm operations selected specifically for the county estimates

survey, regardless of the originating commodity design. Consequently,

a farm operation selected from the barley design will also be asked to

provide data on all other crop and livestock items. All reported data

from the county estimates survey and from other surveys are used in

providing the survey indications. For each operation, the system

identifies the assigned strata from all of the commodity designs. All

records will not be included in each commodity design since all

records do not have positive control data for all commodities.

Records that do not have an original design stratum for a commodity

are assigned to "pseudo stratum 99" for summary. Then corn data are

summarized in the corresponding stratum from the corn design for each

operation and hog data are summarized in the corresponding hog

stratum. Since data are used for a particular item from records that

were not selected in the original sample design, the direct expansion

and ratio estimates are not based on the selection probabilities.

However, this approach probably doubles the number of positive data

records available for most survey items compared to just using data

records from the original commodity designs. The use of this

additional data is a stabilizing factor in providing reliable county

level estimates.

Survey estimates from the County Estimate System are provided at

State, district, and county levels for each item. Districts are

groups of geographically contiguous counties with relatively

homogeneous agricultural practices and climate within each district.

There are usually four to nine districts per State. The State and

district estimates are used primarily in the scaling process described

later. The county level survey estimates are the basis for the final

published estimates, but they also go through a scaling and

compositing process. Population counts and useable record counts are

generated by the system at each level. The direct expansion estimate

for a particular commodity at any level is represented as follows:

Click HERE for graphic.

Table 2: Examples of Direct Expansion County District, and State

Estimates, for Corn Planted Acres

Click HERE for graphic.

In addition to direct expansion estimates, ratio estimates of totals

and ratio estimates of ratios are also created. For crop acreage

items, possible ratio estimates are based on ratios of current year

planted acres to previous year planted acres, harvested to planted

acres, planted acres to total cropland acres, and irrigated acres to

planted acres.: The ratio estimates are generated from usable reports

for both the numerator and denominator and are expressed as:

Click HERE for graphic.

7.3.4 Scaling of Indications

The first step in the process is to scale the individual county and

district "indications to the official published USDA State level

estimate. Typically, "indications" that are scaled include:

1) survey direct expansion estimate

2) survey ratio estimates

3) previous year estimate

4) other indications (remotely sensed acreage estimates, Census of

Agriculture, other Administrative data).

Initially, each district indication (direct expansion, ratio,

administrative data) is scaled. Suppose there are "M" different

indications. The scaling at the district level occurs as follows:

Click HERE for graphic.

The resulting county level estimates for each of the "M" indications

(direct expansion, ratio, administrative data) then sum to the

district estimate. This scaling process serves as a weighting

adjustment to account for any incompleteness in the various

indications. As mentioned previously, the NASS list sampling frame

typically provides about 80% coverage for major commodities.

Administrative data values also have varying degrees of completeness.

7.3.5 Compositing of Scaled Estimates

The next step in the process is to composite together the various

scaled estimates to provide satisfactory county and district level

estimates. The composite estimates generated for each county and

district are represented as follows:

Click HERE for graphic.

Rounding rules are incorporated into this process so that the final

estimates are the published values. These estimates are reviewed by

statisticians in the State office for reasonableness based on their

knowledge of the location and general size of the largest operations

in the State for each commodity. The estimates must exceed minimum

levels and not exceed maximum levels provided by reliable

administrative data sources. For example, a State may check that the

sum of major crop acreages does not exceed the Census of Agriculture

total cropland acres for each county. If estimates are not

reasonable, the data will be more closely examined for outliers and

insufficient sample sizes. Different weights for the compositing

process or adjustments to the outlier indications may be needed to

provide the final published county level estimates.

7.4 Evaluation Practices

Each NASS State Statistical Office has taken a major responsibility in

developing and evaluating procedures that help provide reliable county

estimates in an efficient manner in their State. The automony in each

program is primarily a function of the funding received from the

different State cooperators. The recently developed NASS County

Estimates System provides a common framework for producing county

estimates within each State. However, the actual sampling and

estimation methods still vary to some degree. Some documented

research has been conducted over the years to evaluate different

procedures. But the Census of Agriculture continues to be the major

evaluation tool.

Ford, Bond, and Carter (1983) examined a model-based approach that

estimates the percentages of the total USDA State level crop acreage

allocated to each county and district. A composite estimator was used

to estimate North Carolina county and district level percentages for

1981. The composite included the estimated percentages based on

direct estimates of crop acreage from two separate probability crop

acreage surveys and the estimated percentage from a simple linear

regression on the percentages over time (1972-1980). The time trend

component tended to have much larger weights than the survey

components in the composite. Results demonstrated that indications

from this procedure were more stable and closer to published, values

than indications from either of the separate crop acreage surveys.

Since the published values tended to follow the composite which is

strongly influenced by the time trend model, the results suggested

that NASS statisticians were already informally following the linear

time trends in setting the county estimates, and consequently, these

procedures were never implemented.

The major evaluation process of the NASS county estimates continues to

be the review against the Census of Agriculture numbers every five

years. NASS statisticians are actually involved in the review of the

Census numbers before they are published to resolve any major

discrepancies based on their knowledge of the State's agriculture and

their, county estimates for the comparable year. After this review,

the Census data are resummarized and published. NASS State offices

then go through the "Census Review" process. The county estimates

series during the last five years is reviewed for consistency with the

Census numbers and any necessary changes are made. This is a

subjective process, and handled differently in each State. Other

available check data may also be used in the revision process, such as

data from livestock or crop associations.

7.5 Current Problems and Activities

Currently, research is being conducted on general small area

estimation methodology through a cooperative agreement with the

Department of Statistics, The Ohio State University. In .addition,

research needs are being identified by the developers and users of the

county estimates system as they gain experience with the programs.

The methodology research with The Ohio State University has focused on

statistical procedures for non-probability survey data with the

constraint that the sum of the county estimates must sum to the

official NASS State estimate. Initial research considered a multiple

regression estimator for obtaining county estimates of wheat

production in Kansas (Stasny, Goel, and Rumsey 1991). The regression

model is of the form:

Click HERE for graphic.

The county total can be estimated if county level values are known for

all independent variables in the regression model. In the initial

analysis of wheat production county estimates, the independent

variables were planted acres of wheat and a district indicator which

accounted for differences in yield for different areas of the State.

Since production is closely related to planted acres and yield, these

seem to be reasonable independent variables. It may be more difficult

to identify independent variables for estimated planted acreage.

These indications would then be scaled by some method. Evaluation of

the regression estimator using simulated data indicated that it

generally produced more precise indications than a direct expansion of

sample data within the respective county. Analysis also indicated

that a constant proportional scaling method worked just as well as

more sophisticated methods involving the sum of squared differences or

the sum of squared relative differences between the county indications

and the final estimates. Future research is planned to consider other

variables and other small-area estimators.

Research is also being conducted through the cooperative agreement

with The Ohio State University on a synthetic estimator for counties

that have zero or only a few positive records for a commodity. In

spite of the improved sampling capabilities of the new system, this

situation still occurs. Approaches that share information from

neighboring counties and across States are being investigated.

Also, there is a need to evaluate survey estimates (direct expansion

and ratio) generated on a probability basis. The current program

combines data from different sampling designs in such a manner that

the actual selection probabilities are not used. This procedure was

chosen because it is easy to implement. Also, it makes use of all

data collected. As stated previously, the same questionnaire is used

for all sample units, regardless of the original sampling design.

Consequently, barley data are collected from the barley design, from

the corn design, from the hog design, etc.. An alternative approach

that also makes use of all data collected is to first generate, for

each commodity, probability based estimates independently from each

design. That is, generate separate barley acreage estimates from the

barley design, from the corn design, from the hog design, etc., using

the appropriate selection probabilities. These estimates can then be

combined to produce an unbiased (or nearly unbiased) estimator with

less variance than an estimate based on a single design. Analysis is

currently being conducted to evaluate alternative post-stratification

and composite estimation strategies.

As has been described, the NASS County Estimates System has evolved

over the past 70 years. The published estimates continue to be a

relied upon source of essential information for many data users in the

agricultural community. However, there is a constant concern about

the quality of the estimates and methodological improvements that

could be made. The program requires a major commitment of resources

for the editing, summarization, and publishing of the data. These

issues will continue to be a focus of future research as resources

allow.

REFERENCES

Bass, J., Guinn, B., Klugh, B., Ruckman, C., Thorson, J., and Waldrop,

J. (1989), "Report of the Task Group on County Estimates," National

Agricultural Statistics Service, U.S. Department of Agriculture.

Brooks, E. M. (1977), "As We Recall: the Growth of Agricultural

Estimates, 1933-1961," Statistical Reporting Service, U.S. Department

of Agriculture.

Ford, B. L., Bond, D., and Carter, N. (1983), "Combining Historical

and Current Data to Make District and County Estimates for North

Carolina," Staff Report AGES 830906, Statistical Reporting Service,

U.S. Department of Agriculture.

North Carolina Agricultural Statistics Service (1986), "North Carolina

Probability A&P; and County Estimates Surveys," Raleigh, NC: Author.

Stasny, E. A., Goel, P. K., and Rumsey, D. J. (1991), "County

Estimates of Wheat Production," Survey Methodology, Vol. 17, pp

211-225.

U.S. Department. of Agriculture (1917), "Conference of Agricultural

Statisticians," Author.

U.S. Department of Agriculture, Bureau of Agricultural Economics

(1933), "The Crop and Livestock Reporting Service of the United

States," Misc. Publication No. 171, Author.

U.S. Department of Agriculture, Bureau of Agricultural Economics

(1949), "The Agricultural Estimating and Reporting Services of the

United States Department of Agriculture," Misc. Publication No. 703,

Author.

U.S. Department of Agriculture, Agricultural Marketing Service (1957),

"National Conference of Agricultural Statisticians: Conference Papers,

Part B, Commodity Branch Sessions," Author.

U.S. Department of Agriculture, Statistical Reporting Service (1969),

"The Story of U.S. Agricultural Estimates," Misc. Publication

No. 1088, Author.

U.S. Department of Agriculture, National Agricultural Statistics

Service (1989), "Standard for Suppressing Data Due to

Confidentiality," Policy and Standards Memorandum No. 12-89, Author.

U.S. Department of Agriculture, National Agricultural Statistics

Service (1991), "Sampling Frame Standards for Coverage and

Maintenance," Policy and Standards Memorandum No. 14-91, Author.

U.S. Department of Agriculture, National Agricultural Statistics

Service (1992), "Estimation Manual," Volume 10, Author.

CHAPTER 8

Model Based State Estimates

from the National Health Interview Survey

Donald Malec

National Center for Health Statistics (NCHS)

8.1 Introduction and Program History

There is a continuing need to assess health status, health practices

and health resources at both the national level and subnational

levels. Estimates of these health items help determine the ete demand

for quality health care and the access individuals have to it.

Although NCHS survey data systems can provide much of this information

at the national level, little can be provided directly at the

subnational level, except for a few large states and metropolitan

areas. The need for State and substate health statistics exists,

however, because health and health care characteristics are known to

vary geographically. Also, health care planning often takes place at

the state and county level.

In this chapter our focus will be the production of state and substate

indirect estimators from the National Health Interview Survey (NHIS).

Information on health status, health practices and health resources is

collected annually in the NHIS and direct national estimates of these

items are also produced annually. The NHIS is a multistage, personal

interview sample survey. It is redesigned every ten years, in order

to make use of new population data collected in the U.S. Census of

Population. The current sample design uses 1,983 primary sampling

units (PSU's), each PSU consisting of a single county or a group of

contiguous counties (minor civil divisions are used instead of

counties in New England and Hawaii). The population of 1,983 PSU's is

stratified and approximately 200 are sampled with probability roughly

proportional to their population sizes. Within each sampled PSU

clusters of households are formed and sampled. Areas within a PSU

with a high concentration of Blacks are oversampled. The NHIS is a

cross- sectional survey, each year a new sample containing

approximately 50,000 households and 120,000 individuals is selected.

For additional details about the design of the NHIS see Massey et al

(1989).

High costs are the primary reason that NCHS is unable to provide

subnational estimates from its national surveys. With the current

budget, the sample size in most states is often too small to produce

precise direct estimates. There is also a concern that direct

estimates of small areas will have a larger component of nonsampling

error. For example a small area may be only canvassed by one

interviewer and the resulting direct estimate may be affected by this

interviewer's style. In contrast, direct estimates that cover higher

geographic levels are canvassed by many interviewers and will tend to

have a smaller interviewer affect due to the involvement of many

independent interviewers. However, even with large sample sizes and

many interviewers, problems can occur in preparing direct estimates of

the variance of direct state estimates because the NHIS was not

designed for this purpose (Parsons, Botman and Malec, 1990).

Since 1968, the National Center for Health Statistics has produced and

evaluated indirect state estimators of health items derived from the

NHIS. Although NCHS does not have a program for the regular

publication of subnational estimates based on indirect estimation

methods, it has supported the development and evaluation of these

techniques. This aim has been achieved through the support of

in-house research, research grants and small-area conferences and

workshops. Through these efforts, three Public Health Series reports,

containing indirect State estimates of disability and the use of

health care have been published (see section 8.2.1). A number of

methodological research projects have also taken place at the Center.

Research results have appeared in the Center's series reports and in

journals and conference proceedings. The demand for small area

estimates is increasing. Subnational estimates are sometimes needed

for the administration of Federal Block grants. States are also

striving to meet the health guidelines for the year 2000 as promoted

in Healthy People 2000: National Health Promotion and Disease

Prevention Objectives. The assessment of the dietary and nutritional

status of the U.S. requires an understanding of these factors at the

subnational level. Accurate estimates are needed for all these

purposes.

The Center is continuing research efforts into the development of

subnational estimators. Currently, estimates based on a hierarchical,

logistic regression are being produced and evaluated. This model

includes demographic effects and county level effects and includes

county level variation. These continuing efforts are being made to

both improve the accuracy of small area estimates and to produce

estimates of their accuracy.

The next redesigned NHIS, which will be fielded from 1995 to 2005,

will possibly oversample and screen for both Blacks and Hispanics. It

is planned that approximately twice as many PSUs will be sampled as

are sampled now. In addition PSUs will, most likely, be stratified by

state and by urban/rural PSUs within a state. This stratification

will not produce precise state estimates but it will provide a

convenient framework for supplementing the NHIS state data with

additional state data. The use of state strata may also improve

indirect state estimates.

8.2 Program Description, Policies and Practices

8.2.1 Estimates from the NHIS

Three reports containing indirect state estimates from the NHIS have

been published.

The first report, Synthetic State Estimates of Disability Derived from

the National Health Survey (NCHS 1968), contains estimates of long-

and short-tem disability measures collected during July 1962 - June

1964. Specifically, the report contains the percent of persons who

suffer from one or more chronic conditions, the percent of persons

whose activity is limited by a chronic condition, the average number

of restricted activity days per person, the average number of bed

disability days per person and the average number of work-loss days

per employed person. Estimates were made using a ratio adjusted

synthetic estimate.

The second report, State Estimates of Disability and Utilization of

Medical Services: United States, 1969-1971 (NCHS 1977), also contains

state estimates of disability as well as state estimates of short-stay

hospital utilization, physician visits and dental visits. The

estimates in this report are also ratio adjusted synthetic estimates.

The third report, State Estimates of Disability and Utilization of

Medical Services: United States 1974-76 (NCHS 1978) contains estimates

of the same health items as the preceding report; disability,

short-stay hospital utilization, physician visits and dental visits.

These estimates were made using a composite estimation method.

These reports present estimates of levels but contain no estimates of

accuracy. The reason estimates of accuracy were not presented is

because satisfactory estimates of the error of individual estimates of

states did not exist for either synthetic or composite estimates.

8.2.2 Small Area Research Conferences

The Center has sponsored or cosponsored three research conferences on

small area estimation. The first conference, cosponsored with the

National Institute on Drug Abuse was held in Princeton, N.J. in 1978.

The second conference was held in Snowbird, Utah, in 1984. 7le third

conference was held in New Orleans, LA, in 1988. The first two

conferences produced published proceedings (see NIDA Research

Monograph 24 1979 and NCHS'1984).

8.3 Estimator Documentation

NCHS has no regular program of producing indirect state estimates from

the NHIS. However, the following estimators have been used in the

past to prepare state estimates, of health characteristics. Many of

these estimators were introduced to correct for the known deficiencies

of the synthetic estimator.

8.3.1 Basic Synthetic Estimator

The basic synthetic estimator is used for State estimation when

national estimates by class and State-specific population counts by

class are available. The synthetic estimator weights the national

class means by the proportion of persons in the state belonging to the

class.

The form of the estimator for state d is:

Click HERE for graphic.

A synthetic estimator is unbiased if the population can be divided

into mutually exclusive and exhaustive classes, b, in which the

average health characteristic in each class does not vary among small

areas. If this assumption is true and if a large enough sample is

selected in each class, then the synthetic estimator will be accurate.

In chapter two, entitled "Synthetic Estimation in Followback Surveys

at the National Center for Health Statistics", a detailed example of

the use of synthetic estimation is provided. Another example of the

construction of a synthetic estimate can be found in Schaible, et

al. (1977) where they create sixty-four demographic classes based on

cross-classifications of variables defined by race, sex, age, family

size and industry occupation of head of family.

Click HERE for graphic.

The synthetic estimator appears in a number of NCHS related

publications (e.g., NCHS 1968 1977, Levy 197l and Namekata, Levy and

O'Rourke 1975) and has been used extensively.

8.3.2 Ratio Adjusted Synthetic Estimator

When regional, direct estimates are available, state synthetic

estimates are often ratio adjusted to their regional direct estimate.

In this way regional estimates, obtained by combining synthetic State

estimates in a region together, will equal the corresponding direct

estimator. This adjustment removes all bias in the synthetic estimate

at the regional level and is an attempt to remove bias of the

synthetic estimator at the state level.

The form of this estimator is:

Click HERE for graphic.

This estimator has also been used in a number of NCHS publications

(e.g., NCHS 1968 1977 and Levy 1971). It is also used by the

Department of Agriculture in their NASS county estimates program (see

chapter 7).

Click HERE for graphic.

8.3.5 Nearly Unbiased Estimator

Click HERE for graphic.

Several key features can be seen in this figure. There is a

predominant sex effect. While relatively fewer males in their 20's or

30's visit a physician, relatively more females visit a physician

during these child-bearing years. In addition, after accounting for a

linear age term, the relative propensity to visit a physician

increases for both the underaged and the overaged, regardless of sex.

To account for these effects, independent variables corresponding to

linear splines were used. After examining a number of residual plots

by age, race and sex, a final set of independent variables was chosen.

Based on visual inspection, race effects were considered negligible

and not included in the final model. In relation to an individual in

county, c, in the age and sex class denoted by, b, the final set of

independent variables are defined as follows:

Click HERE for graphic.

Partial residual plots based on this eight parameter model were

computed and figure 2 plots the averages of these residuals within age

by sex group. As can be seen, the partial residual plot indicates a

relative absence of age and sex affects. (The apparent heterogeneity

in the plots is at least partly due to an unequal sample size in each

age group.)

Click HERE for graphic.

These residuals, based on the eight parameter model outlined above,

are then used to assess the affect of county covariates. The

resulting residuals, with the individual age and sex effects removed,

were then averaged within counties of a given type, for example

counties with a high level of educational attainment. Corresponding

to various typologies of counties, plots of the residuals versus

county types were used to assess the influence of county covariates on

the proportion of persons visiting a physician. Figure 3 illustrates

one such comparison. Here, residuals are averaged within counties

exhibiting a certain education level. A number of independent

variables were examined in this manner. For physician visits,

economic type variables such as per capita income, percent of

population below poverty and education level exhibited similar trends.

Other county covariates from the March 1989 Area Resource File (1989)

and the NCHS County Mortality files were also examined.

Click HERE for graphic.

Based on this preliminary work, State estimates for both physician

visits and health status are being planned for publication.

8.4 Evaluation Practices

8.4.1 Reports or Evaluation Studies

A number of evaluations on small area estimators have been conducted

at NCHS. The first publication of synthetic estimates (NCHS 1968)

includes a comparison of a synthetic estimator, a nearly unbiased

estimator and Woodruff's estimator. Since then a number of other

evaluation studies have been conducted at the Center. A short

description of each is given below.

The Use of Mortality Data in Evaluating Synthetic Estimates

A basic synthetic estimator, a ratio adjusted synthetic estimator and

a regression adjusted synthetic estimator are evaluated by

constructing these estimates of mortality rates for Motor Vehicle

Accidents, Major Cardiovascular Diseases, Suicides and Respiratory

T.B. using the complete population of mortality events compiled by

NCHS. These state estimates are then compared to their known rates,

using the same data. It was found that the accuracy of the synthetic

estimates varied considerably from state to state and from item to

item. (Levy 1971).

Synthetic Estimates of Work Loss Disability for Each State and the

District of Columbia

The basic synthetic estimates of partial and complete work disability

are compared to precise estimates from the 1970 Census of Population

and Housing. Here, the agreement between synthetic and direct

estimates of partial work disability were found to be fairly good

while the agreement, between corresponding estimates of complete work

disability were fairly poor. (Namekata, Levy and O'Rourke 1975)

Synthetic Estimates of State Health Characteristics Based on the

Health Interview Survey

Formulas for the bias and variance of the nearly unbiased estimator

and the ratio-adjusted synthetic estimator are developed.

Correlations and average percentage errors between state synthetic

estimates based on different formulations of demographic groups are

determined for each of a number of NHIS items. For the cases

evaluated here, the coarseness of the demographic groupings had little

effect on the resulting synthetic estimates. (Levy and French 1977)

An-Empirical Comparison of the Simple Inflation, Synthetic and

Composite Estimators for Small Area Statistics

Using NHIS data, state unemployment rates and percent completing

college were estimated using a simple direct estimator, a synthetic

estimator and a composite estimator. Each of these estimates were

compared to accurate estimates obtained from the 1970 Census. It was

demonstrated that the composite estimator was much more accurate than

either the synthetic estimator or the inflation estimator for the

items under study. (Schaible, Brock, Casady and Schnack 1977)

8.4.2 Comparison with a known standard

The following is a summary of the types of evaluation methods that

have been used in the aforementioned reports.

Measurements such as: work-loss disability, unemployment rates,

percent completing college and marital status are compiled in the

U.S. Population Census and are known exactly or with a high degree of

accuracy for small geographic areas. In addition, vital statistics

such as the mortality rates from motor vehicle accidents,

Cardiovascular disease, suicides and Tuberculosis are known exactly

for counties. Although these measurements are not the same as the

morbidity rates or health care utilization rates, measured in the

Center's surveys, they are related. Small area estimation procedures

have been tested by making state estimates for these quantities and

comparing them to the known quantity. This technique has shown that

synthetic state estimates will often have less variability than their

corresponding ensemble of true values (Schaible, Brock, Casady,

Schnack 1977,1979). By estimating known quantities, some of the

deficiencies of an estimation method can be ascertained. However, the

fact that a particular method works well in estimating known

quantities does not imply that the method will work well in other

situations.

When estimating a known quantity, the distribution of the errors is

usually presented in summary form. Usually, the average absolute

error or the average squared error is calculated. To assess the

similarity of two estimation methods, the simple correlation between

errors is calculated.

There are other ways to obtain accurate standards for comparison. For

example, accurate, direct estimates are often available for the very

large states and for groups of small states (Levy 1971). In addition,

years of data can be pooled together to create a larger population in

which direct estimates can be made for a number of subnational areas

and then compared to an indirect estimate (Malec, Sedmnsk and Tompkins

(1993)).

NCHS publications utilizing this method include (NCHS, 1968; Levy,

1971; Malec, Sedransk and Tompkins, 1993; Namekata, Levy, O'Rourke,

1975; Schaible, Brock, Casady, Schnack, 1977,1979).

8.4.3 Comparison or alternative estimators

An alternative estimator can be constructed to specifically deal with

a deficiency in another estimator. For example, a composite estimator

of a state will preferentially use its state data, whereas a synthetic

estimator will not. The relevance of this problem can be evaluated by

comparing the two estimates. This method can be used to improve

estimates but, since the true. value is not known, the quality of the

improved estimate is not known. For example, two types of estimators

may yield similar state estimates but both be in error. In contrast,

a number of different types of estimators may each yield vastly

different estimates but one of them may be accurate.

8.4.4 Model Evaluation with Data Analysis

When a statistical model is used to produce indirect estimates,

features of the model can be checked using the data that has been

sampled. In fact, data from a national survey can be used to develop

a model that fits the data well (Malec, Sedransk and Tompkins, 1993),

-- although there may be features of a model that are difficult to

evaluate, if the model is complex.

8.5 Current Problems and Activities

The Bayesian method, utilizing a hierarchical model (in section 8.3.8)

is currently being refined and evaluated with the aim of producing

state estimates based on a single year of data. A relatively

efficient method of Gibbs sampling due to Gilks and Wild (1992) has

been used to produce state estimates. This procedure uses the exact

specifications from (1) and (2) of section 8.3.7 and does not require

the pooling of four years of data or a normal approximation to the

likelihood. Gilks and Wild's method is still very computationally

time consuming and it is being compared to both an alternative normal

approximation to the conditional posterior of A, and to the normal

approximation of the likelihood. The comparison is in terms of both

accuracy and computational effort.

The variable selection procedure for the hierarchical model has also

been improved. Stepwise spline selection is now being used for both

individual level variables, county level variables and interactions

between individual-level and county-level variables. In addition,

more specific methods to evaluate the model fit are being considered.

The impact of the design on inference is also being further evaluated.

In addition to the Bayesian hierarchical method presented above, three

other research projects have recently been undertaken at NCHS.

o As part of a research contract to evaluate and redesign the NHIS, a

generalization of the synthetic estimator is being developed. First,

the population is partitioned into mutually exclusive demographic

cells, as in synthetic estimation. Then, within each demographic

cell, a hierarchical model is specified for the responses among the

small areas. A distribution is not specified, only the first two

moments are used. Estimators are derived in a Bayesian framework

where data-based prior information, also specified by only the first

two moments, may be incorporated. Alternative estimators, derived

within this framework, are compared. Estimates of mean square error,

that are more state specific, are also examined. For more

information, see Marker and Waksberg (1993).

o The regression adjusted synthetic estimation method of Elston, Koch

and Weissert (1991), (see section 8.3.4 above) has been extended by

them, under an NCHS contract, to produce estimates of disability

covering individuals of all ages.

o An empirical study utilizing 1990 Census data on disability is also

being planned. A subsample of the census data, similar to the NHIS

sample, will be used to model and predict states. These estimates

will then be compared to values based on the entire census. Both the

hierarchical model, in section 8.3.7, and the aforementioned synthetic

regression method, of Elston, Koch and Weissert, will be used to

produce and compare estimates.

8.6 Summary and Some Conclusions

The National Center for Health Statistics has been developing,

producing and evaluating indirect state estimates for over two

decades. The newer methods proposed generally incorporate local data

with the aim of correcting known deficiencies of established estimates

and providing more accurate estimates. Recently, the goal to provide

both a good, indirect estimate as well as an estimate of its accuracy

has received increased attention. Improvements or deficiencies of

indirect estimators have been evaluated using related data for which

the actual population values are known. Improvements have also been

made by developing classes of estimators which include competing

estimators as a special case. Small area estimation research has also

followed new developments in computing algorithms and the availability

of cheaper computing. In particular, the Gibbs Sampler and related

methods have opened up the possibility of utilizing more realistic and

complex models for small area estimation. It is safe to say that much

more can be accomplished in this domain.

The continued research and evaluation of indirect estimators of small

areas has helped educate the consumer of these statistics to be more

critical of them and to be aware of the underlying assumptions. The

fact that small area estimation research continues to be an active and

supported area of research is a testament to the continued demand for

detail.

REFERENCES

Berger, J.O.(1985) Statistical Decision Theory and Bayesian Analysis,

second edition. Springer- Verlag.

Gilks, W.R., and Wild, P. (1992) Adaptive Rejection Sampling for Gibbs

Sampling. Applied Statistics, 41: 337-348.

Elston, J.M., Koch, G.G., and Weissert, W.G. (1991)

Regression-Adjusted Small Area Estimates of Functional Dependency in

the Noninstitutionalized American Population Age 65 and Over.

American Journal of Public Health, 81: 335-343.

Kass, R.E., and Steffey, D. (1989) Approximate Bayesian Inference in

Conditionally Independent Hierarchical Models (Parametric Empirical

Bayes Models). Journal of the American Statistical Association, 84:

717-726.

Landwehr, J.M., Pregibon, D., and Shoemaker, A.C. (1984) Graphical

Methods for Assessing Logistic Regression Models. Journal of the

American Statistical Association, 79: 61-83.

Levy, P.S. (1971) The use of Mortality data in evaluating synthetic

estimates. Proceedings of the American Statistical Association,

Social Statistics Section: 328-331.

Levy, P.S., and French, D.K. (1977) Synthetic Estimation of State

Health Characteristics Based on the Health Interview Survey. Vital

and Health Statistics: Series 2, No. 75, DHEW Publication (PHS)

78-1349. Washington: U.S. Government Printing Office.

MacGibbon, B. and Tomberlin, T.J. (1989) Small Area Estimates of

Proportions Via Empirical Bayes Techniques. Survey Methodology, 15:

237-252.

Malec D. and Sedransk J. (1993) Bayesian Predictive Inference for

Units with Small Sample Sizes: The Case of Binary Random Variables.

Medical Care, 5: YS66-YS70.

Malec D., Sedransk J. and Tompkins, L. (1993) Bayesian Predictive

Inference for Small Areas for Binary Variables in the National Health

Interview Survey. In Case Studies in Bayesian Statistics, eds.,

C. Gatsonis, J.S. Hodges, R.E. Kass and N.D. Singpurwalla.

Springar-Verlag.

Marker, D.A. and Waksberg, J. (1993) Small Area Estimation for the

U.S. National Health Interview Survey. In Small Area Statistics and

Survey Designs, Vol. 1, Central Statistical Office, Warsaw, Poland.

Massey, J.T., Moore, T.F., Parsons V.L. and Tadros, W. (1989), "Design

and Estimation for the National Health Interview Survey, 1985-94,

"National Center for Health Statistics. Vital and Health Statistics,

2: 1110.

Namekata, T., Levy, P.S., and O'Rourke, T.W. (1975) Synthetic

Estimates of Work Loss Disability for Each State and the Districts of

Columbia. Public Health Reports, 90: 532-538.

National Center for Health Statistics. (1968) Synthetic State

Estimates of Disability. PHS Publication No. 1759. U.S. Government

Printing Office.

National Center for Health Statistics. (1977) State Estimates of

Disability and Utilization of Medical Services, United States,

1969-1971. DHEW Publication No. (HRA) 77-1241. Health Resources

Administration. Washington: U.S. Government Printing Office.

National Center for Health Statistics. (1978) State Estimates of

Disability and Utilization of Medical Services, United

States. 1974-1976. DHEW Publication No. (PHS) 78-1241. Public Health

Service. Washington: U.S. Government Printing Office.

National Center for Health Statistics (1984) Invited Papers to the

Data Use Conference on Small Area Statistics. Proceedings of the 1984

NCHS Data Use Conference on Small Area Statistics, Snowbird, Utah.

NIDA Research Monograph 24 (1979) Synthetic Estimates for Small Areas

DHEW Publication No. (ADM) 79-801. Health Resources Administration.

Washington: U.S. Government Printing Office.

Parsons, V.L., Botman, S.L. and Malec, D. (1990) State Estimates for

the NHIS. 1989 Proceedings of the Section on Survey Research Methods,

American Statistical Society. pp. 854- 859.

Prasad, N.G.N. and Rao, J.N.K. (1990) The Estimation of the Mean

Squared Error of Small- Area Estimators. Journal of the American

Statistical Association, 85: 163-171.

Sarndal, C.E. (1984) Design-Consistent Versus Model-Dependent

Estimation for Small Domains. Journal of the American Statistical

Association, 79: 624-631.

Schaible, W.L., Brock, D.B., Casady, R.J. and Schnack, G.A. (1977) An

Empirical Comparison of the Simple Inflation, Synthetic and Composite

Estimators for Small Area Statistics. Proceedings of the American

Statistical Association, Social Statistics Section: 1017-1021.

Schaible, W.L., Brock, D.B., Casady, R.J. and Schnack, G.A. (1979)

Small Area Estimation: An Empirical Comparison of Conventional and

Synthetic Estimators for States. Vital and Health Statistics: Series

2, No. 82, DHEW Publication (PHS) 80-1356. Washington:

U.S. Government Printing Office.

Schaible, W.L. (1979) A composite Estimator for Small Area Statistics

Synthetic Estimates for Small Areas DHEW Publication No. (ADM) 79-801.

Health Resources Administration. Washington: U.S. Govemment Printing

Office.

Smith, A.F.M. and Roberts, G.O. (1993) Bayesian Computation Via the

Gibbs Sampler and Related Markov Chain Monte Carlo Methods. Journal

of the Royal Statistical Society, Series B, 55, 3-23.

U.S. Department of Health and Human Services (1989), The Area Resource

File (ARF) System. ODAM Report No. 7-89.

U.S.Department of Health and Human Services, Public Health

Service. (1990) Healthy People 2000: National Health Promotion and

Disease Prevention Objectives. DHHS Publication No. (PHS) 91-50213.

Washington: U.S. Government Printing Office.

Wong, G.Y. and Mason, W.W. (1985) The Hierarchical Logistic Regression

Model for Multilevel Analysis. Journal of the American Statistical

Association, 80: 513-524.

Woodruff, R.A.(1966) Use of a Regression Technique to Produce Area

Breakdowns of the Monthly National Estimates of Retail Trade. Journal

of the American Statistical Association, 61: 497-504.

CHAPTER 9

Estimation of Median Income for

4-Person Families by State

Robert E. Fay and Charles T. Nelson, U.S. Bureau of the Census

Leon Litow, Department of Health and Human Services

9.1 Introduction and Program History

Starting with income year 1974, the U.S. Census Bureau has computed

model-based estimates of median annual income for 4-person families by

state using data from the decennial censuses, the Current Population

Survey (CPS), and estimates of per capita income (PCI) from the Bureau

of Economic Analysis (BEA). Originally, these estimates were used in

determining eligibility for the former Title XX Program of the Social

Security Act, which provided social services for individuals and

families.

Beginning in fiscal year (FY) 1982, the Department of Health and Human

Services (HHS) has employed the estimated 4-person family medians to

administer the Low Income Home Energy Assistance Program (LIHEAP).

This program is one of six block grant programs authorized by the

Omnibus Budget Reconciliation Act of 1981 (PL 97-35) and administered

by HHS. The Augustus F. Hawkins Human Services Reauthorization Act of

1990 (PL 101-501) reauthorized the LMAP through FY 1994.

States, the District of Columbia, Indian tribes and tribal

organizations, and territories that wish to assist low income

households in meeting the costs of home energy may apply for a LIHEAP

block grant. "Home energy" is defined by the LIHEAP statute as "a

source of heating or cooling in residential dwellings."

Section 2603(7) of Title XXVI of PL 97-35 requires the Secretary of

HHS to establish the state median incomes for purposes of the program.

Section 2605(b)(2)(B)(ii) of PL 97-35 provides that 60% of the state

median income is one of the income criteria that states can use in

determining a household's eligibility for the LIHEAP.

HHS publishes the estimated 4-person family medians by state annually

in the Federal Register. For purposes of administration, state median

incomes are established for families of other sizes as a fixed

proportion; depending on the size of family, of the estimated median

for 4-person families. The following percentages of the 4-person

family medians are used: 52% for 1-person households, 68% for 2

persons, 84% for 3 persons, 100% for 4 persons, 116% for 5 persons,

and 132% for 6 persons. For families with more than 6 persons, each

person beyond 6 adds an additional 3%. U.S. Bureau of the Census

(1991) provides further details, as does the Federal Register on March

3, 1988 at 53 FR 6824.

In addition to their programmatic use in the administration of the

LIHEAP and the earlier Title XX Program of the Social Security Act,

the estimates represent the only intercensal state-specific family

income estimates produced by the Census Bureau. Consequently, these

estimates have been of interest to a number of general data users.

Until the recent publication of the historical series in U.S. Bureau

of the Census (1991), however, the estimates did not appear in a

regular publication series of the Census Bureau.

9.2 Program Description, Policies, and Practices

Throughout this period the methodology has relied on three sources:

1. Estimates of median family income by state from the decennial

censuses. Since the census asks income during the previous year, the

census medians pertain to income years 1969, 1979, etc. Although the

estimates are based on the long-form sample, the size of this sample

provides estimates with virtually negligible sampling errors at the

state level every 10 years.

2. Sample estimates of median income by family size by state from the

March CPS. Although the CPS estimates are available annually, their

direct use is limited by substantial sampling variability due to the

size of the CPS sample.

3. Annual estimates of PCI from BEA. These estimates, based on

aggregate statistics on components of income from administrative

series, have negligible sampling error. The PCI estimates are

measures of average income per person, however, and so are only

indirectly linked to median income for families.

U.S. Bureau of the Census (1991) describes each of these series in

detail. In brief, however:

1. The decennial census provides geographically detailed estimates by

sampling roughly 1/6 of the households in the U.S. to receive the

long-form census questionnaire. The census income concept includes as

sources: wages and salaries, self-employment income (including

losses), Social Security, Supplemental Security Income (SSI), cash

public assistance, interest, dividends, rents, royalties, estates and

trusts, veterans' payments, unemployment and workers' compensations,

private and government survivor, disability, and retirement pensions,

alimony, child support, and any other source of money income that is

regularly received. Capital gains (or losses) and lump-sums or

one-time payments such as life insurance settlements are excluded.

Noncash benefits, such as government noncash transfers (foods stamps,

Medicaid, etc.) and private sector in-kind benefits are also excluded

from the money income definition.

2. The CPS is a monthly labor force survey of about 60,000 households

across the country. Each March, the CPS asks additional questions

about money income during the previous year, using the same concepts

as the decennial census. The primary purpose of the CPS sample is for

national estimates. For example, the CPS provides a national estimate

of median income for 4-person families, which is published annually

along with many other estimates from the survey.

3. The BEA income series is based on a different concept of income

than the one used by the Census Bureau in the decennial censuses and

the CPS. The major difference is that BEA personal income attempts to

represent income from all sources, noncash as well as cash. (Appendix

A of U.S. Bureau of the Census, 1991, compares the BEA personal income

concept to the census money income concept underlying both the CPS and

the decennial censuses. Budd, Radner, and Hinrichs, 1973, provide

additional detail on this point.)

Another conceptual difference is that the BEA estimates of

PCI represent the ratio of estimates of aggregate income to the number

of persons in each state. They are not disaggregated by size of

family and do not distinguish the income of family members from

unrelated individuals and persons in 1-person households.

The PCI series is developed from a variety of government

statistics, including Federal tax records from the Department of the

Treasury, the insurance files of the Social Security Administration,

and state unemployment records collected by the U.S. Department of

Labor. The BEA produces annual estimates of personal per capita

income for states and other geographic areas. Thus, the BEA estimates

generally do not have associated sampling errors, unlike the CPS

estimates, since they essentially do not employ sampling techniques.

These estimates are described by Bailey, Hazen, and Zabronsky in

Chapter 3 of this report.

Before outlining the elements of the methodology, we first compare the

estimates for income year 1989, based on the March 1990 CPS and

published in March, 1991, with medians for 4-person families from the

1990 census. Figure 9.1 shows the geographic distribution of the true

increase in median income for 4-person families during the decade,

since the 1980 census.

Click HERE for graphic.

Figure 9.1 indicates that the greatest relative increase during the

decade in the median income of 4-person families occurred in the

Northeast region, where most states more than doubled their medians,

according to the census. Other areas of active increase include

additional states in the East and South Atlantic, and Tennessee,

Minnesota, California, and Hawaii. Figure 9.1 also shows that median

income in some areas of the country has grown considerably more

slowly.

Figure 9.2 presents the estimated increase since the 1980 census

according to the model. Although there are some differences between

the census and the model predictions, the comparison of the two maps

shows that the model is successful in capturing most real sources of

change in median family income. Some of the states are not classified

into the same grouping in Figures 9.1 and 9.2, but, in each case, the

difference is by at most one category. For example, states estimated

to be among the fastest growing group were either in that category or

the next one down, and so forth.

Click HERE for graphic.

Figure 9.3 shows the geographic distribution of the key predictor

variable, the increase in estimated BEA per capita income. Note that

the scale of percent income increase is shifted on this third map

compared to the other two; in general, the proportional increase in

per capita personal income outstripped the increase in the median

income of 4-person families during the decade. With the resealing,

however, the BEA income figures are quite successful predictors of the

corresponding increase in the median income of 4-person families at

the state level.

Figures 9.4 and 9.5 illustrate additional features of the performance

of the model. Both figures include a regression line from the-simple

linear regression as an aid in assessing fit, although each line is

not formally part of the model.

Click HERE for graphic.

Figure 9.4 compares the estimated increase with the actual increase in

median incomes, according to the census. Again, the predictions are

not perfect but, nonetheless, appear to capture most of the variation

among states in the increase in median income. Figure 9.5 shows that

the relationship between increase in BEA PCI and increase in the

census median income is essentially linear over the entire spectrum.

As previously indicated by the scaling Figure 9.3, Figure 9.5 provides

further evidence of somewhat greater dispersion in the increase in the

BEA estimates than in the census medians.

The figures provide a summary of the basic features and performance of

the model, and they may be of help to many readers. The remainder of

this chapter aims specifically toward a technical audience interested

in the exact form of the model, plans for further assessment, and a

brief description of potential enhancements. U.S. Bureau of the

Census (1991) furnishes a more detailed history of the program, the

estimates for calendar years 1974-1989, and citations for the

publication of the estimates in the Federal Register annually since

1983.

The current methodology has been in place since income year 1984,

although with minor refinements over this period of time. The

methodology is applied separately for each year, t, in the series.

(For simplicity, the implicit subscript, t, is not shown in the

following, except where necessary to avoid confusion. Section 9.5

will discuss the possibility of alternative models more attuned to the

longitudinal nature of the problem.) The primary elements of the

current methodology are:

Click HERE for graphic.

A key feature of the model is the multivariate combination of

estimation of the target variable of interest, median income for

4-person families, along with an auxiliary variable, the combined 3-

and 5-person family medians, even though the auxiliary variable is not

itself a subject of interest. In fact, the purpose of the

multivariate approach is to realize additional gains in the estimation

of 4-person family medians. Fuller and Harter (1987) and Fay (1987)

motivate the possible advantages of the multivariate approach for

problems of this sort.

The current model replaced an earlier version, whose major features were:

Click HERE for graphic.

Although Woodruff's original method was based on computing the density

of the distribution in the interval containing the estimated median,

the small expected number of sample cases falling into each of the

$2,500 intervals in the estimated CPS income distribution by family

size within states leads this approach to be unstable. Empirically,

experimentation with samples of 100 and 400 cases drawn from the

national income distribution showed that it was preferable to use a

braoder interval to estimate the density for samples of these sizes.

Specifically, in addition to the interval containing the median, the 4

$2,500 intervals immediately below and 2 intervals immediately above

are used to form a combined interval of width $17,500 for purposes of

estimating the density, d. The variance estimator is:

Click HERE for graphic.

9.4 Evaluation Practices

9.4.1 Comparisons to the 1980 Census

The availability of direct census estimates every 10 years affords a

significant opportunity to evaluate and recalibrate the estimation

technique. The current methodology grew from its predecessor,

described at the end of section 9.2, primarily as a consequence of

comparisons to 1990 census results. The conclusions of that

comparison, discussed in Fay (1986), were, in brief:

1) The earlier method yielded generally useful state estimates, but 1

estimate was in error by more than 10% and 11 additional estimates

were in error by 5% or more.

Click HERE for graphic.

9.4.2 Comparisons to the 1990 Census

Figures 9.1 - 9.5 compare the model to recently available estimates

from the 1990 census. Overall, the results of the comparison are

quite encouraging. For example, no estimate was in error by 10% or

more, and only 7 were in error by 5% or more. These findings reflect

only the first steps in a more complete analysis.

The next critical step, however, will be to react to a surprising

finding reported in Section 9.3.3, namely, that the CPS sample

estimates of the medians by state appear to differ from the CPS values

by more than sampling error alone would suggest. This is in contrast

to the comparison of the 1980 CPS and census. Consequently, some form

of nonsampling error is possible, but a more systematic study of

components of differences between the CPS and the census will be

required to isolate the significant source or sources of these

differences.

Figures 9.1 and 9.2 provide one suggestion of a possible source of

difference: relative to the census, the model underpredicted increase

in four large states: California, Florida, New York, and Texas. In

each of these states, the CPS sample estimates themselves fall below

the census values. All four states also have appreciable Hispanic

populations. Furthermore, preliminary comparisons suggest higher

estimated medians for Hispanics in the census compared to the CPS,

Hispanic income may be only one of several factors underlying

differences between census and CPS state medians. The outcome of a

more complete investigation should provide a firmer basis to separate

the issues of limitations of the model from possible nonsampling error

in either the CPS or the census.

Once issues of nonsampling error, are more firmly understood, the

census results should permit assessment of a number of features of the

current model:

1) The average error of the model predictions.

2) Whether errors are differential for certain classes of states,

e.g., small vs. large, rapidly changing vs. static, lower income

vs. higher, etc.

3) Whether errors cluster geographically.

4) Whether modification of the current predictors would yield

significant improvement in prediction.

The census data permit assessment of the current model but also offer

the occasion for consideration of more significant changes for

subsequent years. A number of these are described in the next

section.

In addition to relying on the census for evaluation, work on

alternative models, such as the hierarchical Bayes model described by

Datta, Fay, and Ghosh (1991), has addressed methods to obtain

estimates of individual and aggregate measures of performance from the

sample estimates when census data are not available. The 1990 census

data should help to calibrate these procedures for future use.

(Unfortunately, these procedures may be adversely affected by

nonsampling error producing differences between the expected values of

the CPS and the census medians at the state level, so that

understanding sources of nonsampling error is a critical step here as

well.)

9.5 Current Problems and Activities

Implemented a year at a time, the current model and its predecessor

has produced a series spanning income years 1974 to 1991 without

taking any advantage of the longitudinal or time series nature of this

problem. Several researchers have investigated longitudinal

extensions that attempt to address this aspect.

The current model relies simply on observed relationships that appear

to be quite linear, without taking advantage of any specific knowledge

about income distributions. Possibly, a more explicit parametric

model for the income distribution may represent a fruitful

alternative. On the other hand, the utility of such efforts would

have to be balanced against requirements of parsimony imposed by the

relatively small sizes of the CPS state samples.

As noted at the end of Section 9.4, another area of potential research

is to attempt to improve measures of error from the model for the

intercensal period. Recent research in fully Bayes procedures may

prove promising for estimation of error.

References

Budd, E. C., Radner, D. B., and Hinrichs, J. C. (1973), "Size

distribution of Family Personal Income: Methodology and Estimates for

1964," BEA-SP 73-21, Washington, DC: Bureau of Economic Analysis.

Datta, G., Fay, R. E., and Ghosh, M. (1991), "Hierarchical and

Empirical Multivariate Bayes Analysis in Small Area Estimation," in

Proceedings of the Annual Research Conference, Washington, DC:

U. S. Bureau of the Census, pp. 63-79.

Efron, B., and Morris, C. (1971), "Limiting the Risk of Bayes and

Empirical Bayes Estimators - Part 1: the Bayes Case," Journal of the

American Statistical Association, 74, 269-277.

Fay, R. E. (1986), "Multivariate Components of Variance Models as

Empirical Bayes Procedures for Small Domain Estimation," in

Proceedings of the Survey Research Methods Section, Washington, DC,

American Statistical Association, pp. 99-107.

__________(1987), "Application of Multivariate Regression to Small

Domain Estimation," in Small Area Statistics, An International

Symposium, R. Platek, J. N. K. Rao, C. E. Sarndal, and M. P. Singh,

eds., New York: John Wiley & Sons, pp. 91-102.

Fay, R. E. and Herriot, R. A. (1979), "Estimates of Income for Small

Places: An Application of James-Stein. Procedures to Census Data,"

Journal of the American Statistical Association, 74, 269-277.

Fuller, W. A. and Harter, R.M. (1987), "The Multivariate Components of

Variance Model for Small Domain Estimation," in Small Area Statistics,

An International Symposium, R. Platek, J. N. K. Rao, C. E. Sarndal,

and M. P. Singh, eds., New York: John Wiley & Sons, pp. 103-123.

U.S. Bureau of the Census (1991), "Estimates of Median 4-Person Family

Income by State: 1974- 1989," Current Population Reports, Technical

Paper No. 61, U.S. Government Printing Office, Washington, DC.

Woodruff, R. (1952), "Confidence Intervals for Medians and Other

Position Measures," Journal of the American Statistical Association,

47, 635-646.

CHAPTER 10

Recommendations and Cautions

During the design of a data system, indirect estimators rarely,

if ever, are considered for Federal statistical programs when

resources to produce direct estimates of adequate precision are

available. However, given an existing system, if direct estimation is

judged to be inadequate for a domain not specified in the design,

indirect estimation may, in some cases, prove to be a valuable

alternative. There are a number of reasons that direct estimators are

preferable to indirect ones and, if federal statistical agencies are

to improve the usefulness of indirect estimates, a number of important

issues, including those that follow, should receive additional

attention.

Traditionally, statistical programs are designed with only direct

estimators for large domains in mind; indirect estimators for small

domains are considered only after the design has been determined.

Planning for both direct estimators and indirect estimators at the

sample design stage should lead to improved indirect estimates.

The purpose of the analysis should be kept in mind when

selecting an indirect estimator. This is important with direct

estimators, but even more so with indirect estimators where there may

be implications for the choice between a domain indirect and a time

indirect estimator.

More coordination and cooperation among Federal agencies would

increase accessibility to auxiliary information for use with indirect

estimators.

Additional empirical evaluations of existing and proposed indirect

estimators are needed. Existing evaluations are generally limited in the

conclusions they are able to draw.

o Additional research on errors associated with indirect estimators is

necessary. Additional attention should be directed to the estimation,

not only of variances, but also of biases, mean square errors, and

confidence intervals.

o When indirect estimates are published, they should be distinguished

from direct estimates and accompanied by a clear explanation of model

assumptions and appropriate cautions.

These points are developed further in the sections that follow.

10.1 Sample Designs for Small Areas

The design of samples for the production of both direct and

indirect estimates should receive more attention.

Since the 1940's, considerable research has been conducted on the

design of samples for use with direct estimators. More recently,

indirect estimators have received the attention of researchers, but

most often in a way that treats the sample design as fixed and beyond

the scope of the research effort. An exception to this statement is

found in the work on rolling samples and censuses (for example, Kish,

1990 with discussion by Scheuren, 1990 and Hansen, 1990) where direct

and/or indirect estimators may be defined depending, on the population

quantity being estimated and whether the design requires a subset of

units to be observed in more than one time period of interest.

Present applications of indirect estimators generally evolved from

data systems designed for other purposes. Few, if any, existing data

sources were created with the production of indirect estimates as a

consideration. Sample designs considering the production of both

direct estimates and indirect estimates deserve much more attention

than they have received thus far. This is particularly true for

continuing surveys where information useful for design and estimator

evaluation is obtained periodically.

None of the programs described in this report were designed with

indirect estimation as an explicit consideration. However, indirect

estimators which benefit from observations in the domain and time of

interest have been enhanced by certain design decisions. For example,

the redesign of the National Health Interview Survey, discussed in

Chapter 8, will include stratification by individual states so that

the sample size within each state is controlled.

10.2 Use of Estimates and the Selection of an Estimator

Selection of an appropriate indirect estimation method should take

into account the Purpose for which estimates are to be used.

Indirect estimators should be selected with great caution and

perhaps even avoided in some situations. Indirect estimators may

perform poorly when the purpose of the analysis is to identify domains

with extreme population values, to rank domains, or to identify

domains that fall above or below some predetermined level.

A domain indirect estimator borrows strength across domains and

is justified under a model that assumes model parameters are the same

across domains. If the purpose of the analysis is to make comparisons

across domains for a given time period, an inconsistency between

objective and method can be avoided if a time indirect estimator is

chosen. Of course, depending on the application and the available

auxiliary information, this may not always be the appropriate course

of action. Similarly, if the purpose of the analysis is to make

comparisons across time periods within a given domain, it may be more

appropriate to select from among domain indirect estimators.

10.3 Auxiliary Information

More coordination and cooperation among Federal agencies would allow

expanded access to the auxiliary information on which indirect

estimators depend.

Regardless of how appropriate the conceptual and theoretical basis of

a particular indirect estimator may be, the estimator cannot be used

in practice if the required auxiliary variables, which usually come

from administrative sources or censuses, are not available. Without

auxiliary information related to the variable of interest and for the

domain and time period of interest, only the most crude indirect

estimators can be implemented.

For programs described in this report, the search for auxiliary

variables generally seems to have been somewhat ad hoc, with little

coordination or cooperation among statistical agencies. An integrated

data system for geographical areas would make auxiliary information

more readily available and would potentially lead to improved indirect

estimators. Such a system might also take advantage of recent

computational technologies. Although not previously discussed in this

report and designed with the objectives of mental health needs

assessment, policy, and research in mind, the National Institute of

Mental Health's Health Demographic Profile System (Goldsmith et al

1984) provides a variety of social indicators for geographic areas.

In addition, Statistics Canada has addressed this issue in their Small

Area Data Program (Brackstone 1987).

10.4 Empirical Evaluations

Additional empirical evaluations are needed to help determine

whether indirect estimators are adequate for the intended purposes.

The decision whether or not to use an indirect estimator is

rarely an COY One. Empirical evaluations play a critical role in the

decision process. The perfomance of an indirect estimator in a given

application depends on the variable(s) of interest and their

relationship to the auxiliary variables through the underlying model.

Generalization from one application to another is difficult so that

each application requires a dfferent empirical evaluation.

In practice, indirect estimators are considered for use in

situations where data are not available to support the use of direct

estimators. The data that, if available, would support the use of

direct estimators are the same data that would be most useful in the

evaluation of indirect estimators and models. In other words, the

need for indirect estimators is the greatest in precisely those

situations where data are not available for their adequate empirical

evaluation. For this reason, it is rare that a single empirical

evaluation of an indirect estimator is completely convincing. There

seems to be no satisfying solution to this problem. Resourcefulness

in locating data sources and the use of multiple empirical evaluations

will be required in most, if not all, situations.

Two approaches can be used for empirical evaluations of indirect

estimators. In the first estimators under consideration are used to

produce estimates; these estimates then are compared to a better

estimate or census value. The estimator that performs best using an

empirical average squared error or similar criterion is judged to be

most appropriate for the given application. The great majority of

evaluation efforts connected with indirect estimates have used this

approach.

The second approach is to evaluate how well the models associated

with the competing estimators fit the data. A principled approach is

needed; models should not only fit the data but also have a conceptual

basis. An indiscrinminate search through a large number of models does

not often produce appealing results.

Empirical evaluations of indirect estimators are critical, and

careful evaluations should include consideration of underlying models

as well as the corresponding estimators. In addition to initial

evaluations leading to the selection of an estimator, continuing

evaluations of the underlying models should be conducted for those

series that are published periodically.

10.5 Measures of Errors for Indirect Estimators

Care should be taken in the production of measures of errors of

indirect estimators. Estimates of variances alone may be misleading

to data users. Additional research on the estimation of biases, mean

squared errors, and confidence intervals is needed

At present, none of the programs described in this report provide

measures of error to accompany published indirect estimates. It is

difficult to produce meaningful measures of error for indirect

estimators. Expressions for indirect estimator variances and biases

under the assumed model are usually straightforward to derive, and

estimation of variances is usually possible. If the model leading to

an estimator is a good approximation of reality in a given

application, then the variance of the estimator derived under the

model should serve as an adequate measure of error. However, if the

model associated with the estimator is not a good approximation, the

estimator will have a bias due to model failure. If the bias is large

relative to the variance, the variance, by itself, will not be an

adequate measure of error, and an estimate of the mean squared will be

required. This is a difficult problem since an estimate of the mean

squared error requires an estimate of the bias. Bias in an indirect

estimator arises from model failure; that is, failure of the model to

adequately represents the variability of the variable of interest over

domains and time. Since the population quantity being estimated is

specific to a given domain and time, it follows that an estimate of

this bias requires data from that domain and time. If the available

data am inadequate to produce reliable direct estimates, it is

unlikely that they would be adequate to support acceptable estimates

of biases. Estimation of confidence intervals for indirect estimators

is also a difficult problem in practice. The existing research in

this area provides valuable results, but additional work is needed.

In the interim, measures of error as indicated by empirical evaluation

studies may be the only source of error information for users.

10.6 Publication of Indirect Estimates

A clear distinction should be made between direct and indirect

estimators. When indirect estimates are published, they should be

accompanied by appropriate cautions and clear explanations of the

model assumptions.

Direct estimates published by Federal statistical agencies usually

meet expected reliability and validity criteria. Even unsophisticated

users of statistics have come to expect estimates Federal statistical

agencies to be trustworthy in some sense. Rarely is enough known

about the structure of indirect estimators to produce adequate

measures of their quality. For this reason, it is misleading to the

public and potentially damaging to the reputation of Federal

statistical agencies to publish indirect estimates that are not

clearly distinguished from direct estimates and that are not offered

with appropriate cautions. In any case, a clear statement of the

assumptions required for the indirect estimator to be model unbiased

should be included with all published estimates. This issue has been

addressed differently by various programs. For example, the Bureau of

Labor Statistics produces estimates for a limited number of states

using a direct estimator and for the Remainder of the states using an

indirect estimator. The two sets of estimates are published in the

same table but separated into the two groups with explanatory notes.

The National Center for Health Statistics publishes indirect estimates

from the National Health Interview Survey in a separate publication

containing explanations and cautions.

10.7 Cautions for Producers and Users of Indirect Estimates

As evidenced by the large and growing literature on indirect

estimation methods, numerous researchers have been working on the

challenging problems facing those who must produce estimates with

inadequate resources. Many authors suggest new approaches or

variations of existing approaches, but few caution about the dangers

associated with the use of indirect estimation methods. The following

exceptions should be noted.

"When first one casts his eye upon the synthetic estimate, he

shrinks away in horror; with a second and then a third look, the

aversion begins to fade, until finally one clasps the estimator to his

bosom, and embraces it with affection. . . . The synthetic estimator

is a dangerous tool, but with careful further development, it has an

attractive potential." (Simmons 1979, paraphrasing Alexander Pope)

"A workshop of this sort, focused on a specific technique, can

spur development, but it can also be dangerous. The danger is that

from hearing many people speak many words about synthetic estimation

we become comfortable with the technique. The idea and the jargon

become familiar, and it is easy to accept that 'Since all these people

are studying synthetic estimation, it must be okay.' We must remain

skeptical and not allow familiarity to dull our healthy skepticism.

There is reason for some optimism, but it must be guarded optimism."

(Royall 1979)

". . .a cautious approach should be adopted to the use of small

area estimates, and especially to their publication by government

statistical agencies. When government statistical agencies do produce

model-dependent small area estimates, they need to distinguish them

clearly from conventional sample-based estimates. Before small area

estimates can be considered fully credible, carefully conducted

evaluation studies are needed to check on the adequacy of the model

being used. Sometimes model dependent small area estimators turn out

to be of superior quality to sample-based estimators, and this may

make them seem attractive. However, the proper criterion for

assessing their quality is whether they are sufficiently accurate for

the purposes for which they are to be used. In many cases, even

though they are better than sample-based estimators, they are subject

to too high a level of error to make them acceptable as the basis for

policy decisions." (Kalton 1987)

Indirect estimation should be considered when other, more robust

alternatives are unavailable, and then only with appropriate caution

and in conjunction with substantial research and evaluation. Even

after such efforts, neither producers nor users should forget that

indirect estimates may not be adequate for the intended purpose.

REFERENCES

Brackstone, G.J. (1987), "Small Area Data: Policy Issues and Technical

Challenges," in Small Area Statistics, New York: John Wiley and Sons.

Goldsmith, H.G., Jackson, D.J., Doenhoefer, S., Johnson, W., Tweed,

D.L., Stiles, D., Barbano, J.P., and Warheit, G. (1994), "The Health

Demographic Profile System's Inventory of Small Area Social

Indicators," National Institute of Mental Health. Series BN No. 4.,

DHHS Pub. No. (ADM) 84-1354. Washington, D.C.: U.S. Government

Printing Office.

Hansen, M.H., (1990), "Discussion of paper by Kish," Survey

Methodology, 16-1, 81-86.

Kalton, G. (1987), "Panel Discussion" in Small Area Statistics, New

York: John Wiley and Sons.

Kish, L. (1990), "Rolling Samples and Censuses," Survey Methodology,

16-1, 63-7 1.

Royall, R.A. (1979), "Prediction Models in Small Area Estimation," in

Synthetic Estimates for Small Areas (National Institute on Drug Abuse,

Research Monograph 24), Washington, D.C.: U.S. Government Printing

Office.

Scheuren, F., (1990), "Discussion of paper by Kish," Survey

Methodology, 16-1, 72-79.

Simmons, W.R. (1979), "Discussion of a paper by Levy," in Synthetic

Estimates for Small Areas (National Institute on Drug Abuse, Research

Monograph 24), Washington, D.C: U.S. Government Printing Office.

Reports Available in the

Statistical Policy

Working Paper Series

l. Report on Statistics for Allocation of Funds (Available

through NTIS Document Sales, PB86-211521/AS)

2. Report an Statistical Disclosure and Disclosure-Avoidance

Techniques (NTIS Document sales, PB86-211539/AS)

3. An Error Profile: Employment as Measures by the Current

Population Survey (NTIS Document Sales PB86-214269/AS)

4. Glossary of Nonsampling Error Terms: An Illustration of a

Semantic Problem in Statistics (NTIS Document Sales, PB86-

211547/AS)

5. Report on Exact and Statistical Matching Techniques (NTIS

Document Sales, PB86-215829/AS)

6. Report on Statistical Uses of Administrative Records (NTIS

Document Sales, PB86-214265/AS)

7. An Interagency review of Time-Series Revision Policies (NTIS

Document Sales, PB86-232451/AS)

8. Statistical Interagency Agreements (NTIS Documents Sales,

PB86-230570/AS)

9. Contracting for Surveys (NTIS Documents Sales, PB83-233148)

10. Approaches to Developing Questionnaires (NTIS Document

Sales, PB84-105055/AS)

11. A Reviev of industry Coding Systems (NTIS Document Sales,

PB84-135276)

12. The Role of Telephone Data Collection in Federal Statistics

(NTIS Document Sales, PB85-105971)

13. Federal Longitudinal Surveys (NTIS Documents Sales, PB86-

139730)

14. Workshop on Statistical Uses of Microcomputers in Federal

Agencies (NTIS Document Sales, PB87-166393)

15. Quality on Establishment Surveys (NTIS Document Sales, PB88-

232921)

16. A Comparative Study of Reporting Units in Selected

Employer Data Systems (NTIS Document Sales, PB90-205238)

17. Survey Coverage (NTIS Document Sales, PB90-205246)

18. Data Editing in Federal Statistical Agencies (NTIS

Document Sales, PB90-205253)

19. Computer Assisted Survey Information Collection (NTIS

Document Sales, PB90-205261)

20. Seminar on the Quality of Federal Data (NTIS Document Sales

PB91-142414)

21. Indirect Estimators in Federal Programs (NTIS Document

Sales, PB93-209294)

Copies of these working papers may be ordered from NTIS Document

sales, 5285 Port Royal Road, Springfield, VA 22161 (703)487-4650

1 A more appropriate method would be to use the of the net

migration rate for the total population to the net migration rate of

the school-aged population but component method II has traditionally

used the difference.

(wp21.html)

Page Last Modified: April 20, 2007

FCSM Home
Methodology Reports