[This Transcript is Unedited]

DEPARTMENT OF HEALTH AND HUMAN SERVICES

NATIONAL COMMITTEE ON VITAL AND HEALTH STATISTICS

SUBCOMMITTEE ON POPULATIONS

WORKSHOP ON DATA LINKAGES TO IMPROVE HEALTH OUTCOMES

September 19, 2006

Renaissance Hotel
999 9th Street, NW
Washington, D.C.

Proceedings by:
CASET Associates, Ltd.
10201 Lee Highway
Fairfax, Virginia 22030
(703)352-0091

TABLE OF CONTENTS


P R O C E E D I N G S (9:02 a.m.)

Agenda Item: Call to Order

DR. STEINWACHS: I'd like to welcome everyone to the second day of our Workshop on Data Linkages to Improve Health Outcomes. This workshop has been put together by the Subcommittee on Population Health of the National Committee on Vital and Health Statistics, and this workshop is being broadcast live on the Internet. So I would like to welcome those people who are listening on the internet.

And I thought before we got started officially, we might just go around the room and introduce ourselves, for those who are on the Internet.

And I am Don Steinwachs. I chair the Subcommittee on Population Health, and I am from Johns Hopkins University.

MR. IAMS: I am Howard Iams from the Social Security Administration Research Office.

MS. MADANS: Jennifer Madans from the National Center for Health Statistics.

MR. HARRIS-KOJETIN: Brian Harris-Kojetin from the Office of Management and Budget.

MR. BJORKLUND: Rick Bjorklund, Office of the Assistant Deputy Under Secretary for Health, the Veterans Health Administration.

MR. PETSKA: I am Tom Petska. I am Director of the Statistics of Income Division of the Internal Revenue Service.

MR. CHAPMAN: Chris Chapman, U.S. Department of Education, National Center for Education Statistics.

MS. OBENSKI: Sally Obenski, Data Integration Division, U.S. Census Bureau.

MR. PREVOST: Ron Prevost, Data Integration Division, U.S. Census Bureau.

DR. DAVERN: Michael Davern, University of Minnesota.

DR. STEUERLE: Gene Steuerle from the Urban Institute and a member of the committee.

MR. LOCALIO: Russell Localio, University of Pennsylvania, School of Medicine, a member of the committee.

DR. SCANLON: Bill Scanlon, Health Policy R&D and a member of the committee.

(Additional intros around the room.)

DR. STEINWACHS: It is a pleasure to welcome everyone today.

Our first session has speakers from Internal Revenue Service, Department of Education, Department of Veterans Affairs, and Russell Localio, who is a member of the committee, is going to serve as facilitator.

Russ.

Agenda Item: IRS, Education and Veterans Administration

MR. LOCALIO: Good morning, everyone.

I just want to introduce our first speaker, Tom Petska, our friend from the Internal Revenue Service.

PARTICIPANT: Is that true? Oh, of course that is true. Of course.

Agenda Item: Tom Petska, Director, Statistics of Income Division, IRS

MR. PETSKA: I have no comment on that.

Actually, I am very pleased to be here to speak on this topic, Workshop on Data Linkages to Improve Health Outcomes. A lot of people might think what is the IRS role in that? I hope I can say a few things in the next 10 or 15 minutes about that.

In a way I feel a little bit inadequate about speaking to this group for a few reasons.

One is that my division is centrally located between IRS and Treasury, and that gives me two bosses, the Director of Research Analysis and Statistics of IRS and the Director of Tax Analysis of Treasury, and they can - they sometimes agree - put it that way - on what we should be doing and how we should be doing things. So that is a little awkward.

Also, because I am Director of SOI, I get occasional questions as to, Can you tell me about Section 8267 of the Code and how that effects my small oil and gas operation?

And the bottom line is the Internal Revenue Code is 2,000 pages long. The supporting regs are over 10,000 pages, at last count, and I really don't have that encyclopedic knowledge. I am sorry.

But I do get questions like, What is the average amount of charitable contributions for my level of income? And that is something I can look up, even though I can't look up what is the status of your refund.

So, that said, hopefully, I can tell you a little bit about IRS tax data administrative statistics as a potential source for shedding some light on health outcomes and so on.

Before I say anything further, I would like to add my disclaimer that these are my personal views and not necessarily those of the Internal Revenue Service or the Department of Treasury.

A little bit about my organization. IRS is a large organization, 100,000 employees, an $11-billion budget.

The SOI program is about .4 percent of that, $40 million, which sounds like a lot of money, but, relatively speaking, it is small, and under 500 employees.

Our primary customers - We have two very intensive customers and those being the Office of Tax Analysis of the Treasury Department and the Congressional Joint Committee on Taxation.

I'll be talking a little bit about data access and disclosure, and let me just say that those organizations do have full access to all of our data, and they have a heavy role in directing our priorities and studies.

However, we do have many other customers, many in the federal statistics community, including the Bureau of Economic Analysis and the Census Bureau.

Okay. What kind of data do we have?

Well, I probably should have given a little bit more background to this slide. I am going to be talking a little bit about two types of data. One are the sample data from my organization. We sample tax returns and we have scientifically-designed samples. We edit these samples very carefully, and we weight these to national totals.

The content is based on our user needs, and so if the Treasury Department comes back and says, We want multiple schedules on depreciation by different types of class lives, we can't add that. It is a resource issue, but it is not a policy issue or program issue.

Now, separate from that is the relatively content poor and less edited data in the IRS Master File system, and we don't produce that data, but we are kind of a gateway to other federal statistical agencies and researchers who do have access for that.

It is also the main source of statistics at the sub-national level, because our SOI samples are not robust for most states and certainly not below the state level.

Okay. So those are the types of data we have.

Now, in each of these program areas - individual, corporate, partnership, estate and gift tax, tax exempt or non-profit organizations - we have pretty much these two sources, and, for the most part, the content-rich SOI samples are the preferred source, just because the data are of higher quality as well as the content much greater.

And just as a footnote, I should add that all the data that is filed on all tax returns and schedules is not transcribed, and that is one reason for our program is that I think there are something like nearly 100 schedules that an individual can append to their 1040 return, and only a limited amount of that data is transcribed. So, as far as content, we have that flexibility, although, as I said, it is a resource issue, and, again, all of our data are preaudit.

Well, two viewpoints on SOI. Well, one is that we try to be a cooperative, collaborative and efficient producer and user of data based on administrative records, and I think we do a pretty good job on that, for the most part, but, then, on the other hand, we are the - quote - tax collectors in disguise, as survey statisticians, and I'll talk about what that means a little bit further down the road, but, first and foremost, we are employees of the Internal Revenue Service, and that has certain legal issues in terms of what we do and in terms of our relationships with others as well.

Now, we have kind of generically three opportunities for data linkages, first being linkages involving solely IRS tax and information returns. Others are linkages involving tax data to surveyed records, and that is a very short topic, which I'll tell you why, and, then, lastly, I suspected that the kind of tone or focus of the conference would be what about microdata access by researchers and other agencies and so on, and so I thought I would spend some time on that, and I have brought my disclosure expert, Nick Greenia, who is sitting in the back there, who works with me and my boss, Mark Mazor(?), on a lot of interagency issues involving data access.

Okay. If you gave me the time, I could go on and on about this first one, linkages involving tax and information returns, for which we have control over the data. We have access to the data and so on, and first might be individual returns linking in 1099s and W2s.

For instance, you see an individual return. It's got a certain income, $75,000. It is a joint return. There's a husband and wife, apparently. Is it from one income or two incomes? Well, you can't determine that without looking at such things as the W2s and so on. So we do linkages like that to look at the whole picture of family economic income.

Partnerships. Partnerships are a major type of business, but they are untaxed. People or organizations, corporations, non-profits form partnerships. They report their financial activities on an information return, and then they distribute their taxable shares to the partners, and those could be expenses. They could be income and so on.

So to ascertain the total effect of partnership taxation, you have to link in partner tax returns, and one partnership could have 20,000 partners. So it is not a trivial matter.

Small business corporations are similar. They have the flow-through nature of a partnership.

The estate tax has become a hot political topic once again, and among the questions are, What happens to these bequests from large estates? The estate return shows who gets them, but it doesn't show the income and what happens to those individuals over time. We can link them up by matching their 1040 returns and following them over time on a panel basis.

There are a lot of issues in regard to consolidated corporations. Subsidiaries may file separately. To get a combined picture of corporations, you have to link these together, and we use the employer identification number to do that.

And then another focus of our primary customers, Treasury and the Joint Committee, are for panel files linking the same entity - corporation, individual, et cetera - over time or linking within a year, and we do quite a bit of that kind of work, which we can talk about as time allows.

Okay. What do these files have?

Well, first of all, high-quality linking variable. If you don't have that - We don't do a lot of research on using other things like names and addresses to link. If we don't have a high-quality variable, we probably don't have the resources to do a high-quality study.

Fortunately, most of our studies, most of our files, the variable, the employer identification number or Social Security number is very accurate.

Obviously, if you didn't have overlapping samples and you try to link records, you are going to get a few hits because of that. So we need at least one population file or samples that are substantially overlapping.

And, lastly, what about accounting periods? We do - at a lot of our linked studies, we want to align accounting periods. We want to take the partner and the partnership or the corporation and its employment return and we want to be very specific about aligning those accounting periods, or, in other cases, we want to show a panel-like study with different periods.

Okay. Well what have we learned from all these studies over time?

Well, a few things. First of all - and I think that has been the tone of day one of the conference is that matched files can be very rich analytically and so on, but, on the other hand, linking data files is never probably as easy as it seems to be. Data quality is never optimal. Even in our SOI files linking variables, even though our Social Security numbers and EINs are high quality, but there are times when, in some cases, that they are not as high quality as we would like and we have non-matching and so on.

Resolving these discrepancies, if they are important - and I think they are - often is a very labor-intensive effort, and so we try to avoid that to the extent possible, but we don't want to produce a file that is based solely on matches and ignore the non-matches. I think that would be a mistake, and linked files sometimes don't answer all the questions, and we could talk about that also. Okay?

But I think the one key thing is that you take this data, if you develop high-quality linking variables and you put it in a relational database, I think you are way ahead of the game. Okay?

Well, this, I said, was going to be a short topic, linkages of tax data to survey records, and it is short for two reasons. One, we do very few surveys ourselves. The only surveys we do are for corporate returns, particularly multinationals.

As you know, just as individuals can request filing extensions, corporations can, too, and most of them routinely do, and so if we have an early data cutoff and we need to get that corporate return in, that major U.S. corporation, that sample at a weight of one, we often do a survey and request preliminary data from them, and so - or, in some cases with multinational corporations, they're just not completed that accurately, the presumption being that everybody provides IRS accurate and complete data, but that is not always the case, and these multinational returns are very, very complex and sometimes, even despite their best efforts, they are not complete showing all their international activities and so on.

But then the last point is we don't do matches to other survey data, because, again, getting back to my point earlier, we are the tax guys, so people don't provide us microdata. I mean, that is the other hat we wear. We are the IRS guys and the perception would be that could these data be brought into a compliance type situation.

Okay. Provisional microdata to other agencies. Who can get what? Well, this is a very, very short summary and so on, but, first of all, tax administration. By this I mean, for the most part, members of the IRS, and sometimes Treasury as well and so on, who have a tax administration motivation. Taxpayer account processing, audit, compliance and research functions are all internal and they all can get access to most of these. Although, we have to be very careful in cases where there are sample data involved and could it destroy the accurateness and representatives of these samples.

Tax analysis. Treasury's Office of Tax Analysis, the Congressional Joint Committee, CBO and GAO all have roles in tax analysis and can get data, not 100 percent, but, for the most part, can get some identifiable tax data for specific purposes as articulated in the Internal Revenue Code.

And, then, statistical use. The Bureau of Economic Analysis can get corporate data. Census gets population data for individuals and for businesses. When the Ag census was moved to the Department of Agriculture, NASS, a few years ago, that data access was also enabled with them, and the CBO as well. Okay?

We talked yesterday - and I think the Census people presented this very well – that their goal, their mandate is to use existing data system to the maximum extent possible, such as administrative records.

Our mandate, unfortunately, is the opposite, though. It is provide only the federal tax information for authorized purposes and to the minimum extent necessary. So this is just a naturally conflicting mandate that we have.

Constraints on using IRS tax data. Well, first of all, it has to be for a use for authorized purposes, and this is defined in the Internal Revenue Code and in supporting regulations, the 10,000 or 12,000 pages that I mentioned earlier, and can be also further defined in separate documents, MOUs, such as the Census-IRS criteria agreement.

Again, our mandate is to disclose only the minimum confidential federal tax information necessary, and there are substantial penalties for unauthorized disclosure or inspection, and publicly released data must be anonymous. Although, we do have a public-use file. As time allows, we can talk about that, but we do remove identifiers and sanitize records in a subsample of our individual program, and it is used by a number of high-profile policy analyst groups.

Okay. Briefly, the authorization process for access to tax data.

Well, first of all statistical recipients - and this came up yesterday - need to be cited in Section 6103(j) of Title 6 (sic) of the Internal Revenue Code.

To change that, Congress must enact legislation.

The statute authorizes access purpose and may stipulate supporting regulations and so on. The regs - regulation detail may restrict uses as well.

And, as I said before, policy agreements can provide additional enumeration.

In summary, access to tax data is very restricted. Some possibilities include - and these are very limited - working as a contractor for tax administration purposes. We have had a few of these, but they are very limited. Working at an agency with current access, like the Treasury or like the CBO or the Joint Committee, or accessing limited business data via Census' Center for Economic Studies.

And to find out more, we can talk later or drop me an email or give me a call.

Thank you.

MR. LOCALIO: Thank you, Tom.

Do we have any questions?

MS. TUREK: (Off mike).

MR. PETSKA: Survey on Consumer Finances.

MS. TUREK: Yes. They use you as a sampling frame. They don't get any data items.

MR. PETSKA: I worked on that study several years ago, and where it is right now in terms of the firewalls, I am not clear exactly, but, basically, the high-wealth portion of their study is a list frame developed from our 1045. That is correct, and we have done this with all years of the Survey of Consumer Finances, going back at least to ‘83, and we did have some involvement in the first, the Survey of Financial Characteristics of Consumers.

Nick, do you want to say something about that?

MR. GREENIA: It is true that the Federal Reserve does receive some data, but the Federal Reserve is perceived as a 6103N contractor, which means that the purpose of that tax-data receipt is seen as fulfilling a Treasury tax-administration purpose.

As you may recall, when CIPSEA was submitted to Congress in 2002, there was a companion bill, and the companion bill had the Federal Reserve in there, so that they could receive tax data unrestricted for survey of consumer finance purposes.

DR. STEUERLE: Tom, you might clarify a little the fact that it is possible to apply to your agency to have data run by outsiders.

I am wondering if you might also comment the extent to which that is really restricted by resource constraints, because I am sure everyone who comes to you to have something run basically impinges upon your resources to some extent, because they always need some helping hand, but that is a possibility -

MR. PETSKA: Yes, that is a very good point.

DR. STEUERLE: - with respect to health data that may not be even with respect to any tax question, right?

MR. PETSKA: Yes, that is right. I mean, again, I have talked about restrictions on access to micro data, but at table-level data, we have disclosure rules to suppress cells that have fewer than three observations at the national level and so on.

But, for the most part, if we have an existing file that will meet your needs, we can enter into a small reimbursable contract to produce tables from those files and so on.

The problem gets in when there's matching required or content that we do not currently have.

For instance, a few years ago, we talked about the idea of non-cash charitable contributions. Could we produce some aggregate statistics to be published on that?

We didn't pick it up in the program back in those days, and so for us to edit those data, build it into the sample, weight it and everything else was a very expensive task.

Since then, we've gotten a push from Treasury and from Joint Committee to include that part of the program. So, now, we have, though, so we could produce additional - from that and so on.

So, again, we do have restrictions on staff, time and so on, but, for the most part, a tabulation from an existing file, we really try to service those kind of requests.

DR. DAVERN: Hi. Michael Davern from the University of Minnesota. I have a question.

We heard from Census yesterday about matching Medicaid records to the current population survey. Something that might be interesting as well would be to take a look at some of the W2 information - I don't know if they have access to it - which would say deferred compensation for health insurance coverage, for example, that a person paid pre-tax dollars into an account.

I was wondering if that was some kind of matching study that may be possible to verify how well the CPS not only measures Medicaid, but how well it measures private insurance coverage.

MR. PETSKA: That is a good question.

Nick, can you help me out there in terms of does Census get the W2 records now?

MR. GREENIA: As a result of a regulation amendment two or three years ago, they now get some limited data that includes deferred compensation from the W2. So that information is in scope at the Census Bureau.

Where we get into some issues is when Social Security data might be involved, and Social Security, as you know, has a very unique arrangement with the Census Bureau, they essentially access tax data and Census data as special sworn status employees.

So, for purposes of the criteria agreement, the policy agreement that Tom was talking about that enables access to new projects by special sworn status, we treat them as, if you will, employees for purposes of this agreement.

The sticking point is that that kind of access, especially if we are talking about any tax data that Social Security does not have access to, even when they are matched to Census Bureau and they do have access as far as we are concerned, Census Bureau views them as special sworn status employees, which means, for purposes of our agreement, they are viewed as Census employees.

So once that enters the equation, the work has to be done under the criteria agreement, which means that it has to be predominantly for Title 13, Chapter 5.

MR. LOCALIO: Howard, did you want to comment on that at all?

MR. IAMS: Yes, the process for doing that would be to go to the Census Bureau and apply for permission to use these data at your Census restricted data center and they have an application process that formally requests a purpose, and what Nick has emphasized about the Title 13 is - has to be done, what you have articulated would have a clear Title 13 purpose.

I think the problem you will encounter is that I do not think the code that isolates health purpose for the deferred compensation is available. It may be available for the last earnings year, but it was not available before 2005. So you won't be able to identify how the purpose of this deferred differs from, say, 401K.

I don't know if Ron - do you remember? We recently started coding for our matched data the reason for the deferred - what kind of account it was - whether it was 401K, 403B, 457, and I do not recall if the health account was separately coded.

MR. PREVOST: Yes, I don't believe it was, but we could check into that, that is for sure. Certainly, the project that we are potentially discussing here would have a clear Title 13 benefit. I mean, it is -

MR. IAMS: Oh, it would. The question is whether the data are there.

MR. PREVOST: Whether the data are available is the main question.

MR. IAMS: The deferred compensation is identified, but there is a box that identifies what the deferred compensation reflects, and I don't recall if health account was one of the codes. It has only been available in the last earnings year.

MR. LOCALIO: Well, I think we need to go on.

I want to thank you, Tom, very much for your presentation. We gotta go on or we are going to run out of time. I am sorry.

Next, we are going to hear from Richard Bjorklund from the Veterans Administration. While he is getting his presentation set up, I just want to say that, yesterday, we heard several people comment about their potential dealings with the Veterans Administration and how those dealings were cut short by that announcement of a potential breech which never happened.

I do want to say that I got a letter from the Veterans Administration, as a veteran, saying that there had been a potential breech, and then a letter saying that it did not happen, and I think we would all be interested to find out what have been the - if you have any comments about what have been the repercussions of that from your perspective.

Thank you.

Agenda Item: Richard Bjorklund, Director, Veterans Administration

MR. BJORKLUND: Well, just to answer that question quickly, the repercussions have been a very tightening of all of the policies and procedures for distributing data both within the organization and also to other federal agencies or contractors or researchers.

In fact, researchers outside of VA who once had access with stipulations to VA data have all but been restricted from accessing that data now.

So the security and privacy procedures have been tightened extraordinarily, and I know some of you from CMS are here, and we have been working diligently for months to get specific agreements in place and protocols for transferring VHA to CMS. That is happening shortly, but it has taken a great deal of time, too.

Well, let me launch into the presentation, the nature of my presentation is more general in nature than what Tom was talking about.

I am going to be talking about more strategic direction that our organization is taking regarding linked data and talking specifically about a project that is underway.

First, we link with our internal data a number of independent activities. We have an annual survey of enrolles where we try to identify perceptions, interests, preferences, behaviors of veterans, and we link that to their inpatient and outpatient clinical records.

We also do customer-satisfaction surveys of fairly excruciating detail, and we link that also to our administrative data both clinical and cost data. So we can get a comprehensive assessment of the performance of veterans organizations.

Today, I want to talk specifically about a very large project that we have been undertaking for the last 18 months, and it regards the integration of VHA, Medicare and Medicaid data in the production of a user-friendly system, and I am going to be talking about the opportunities that we envision for improving healthcare outcomes, some of the barriers to implementation that we have observed going through this 18-month process, more about what the process was and some of the challenges going forward.

First, to put this in context, I want to spend a little bit of time talking about the VHA organization.

First of all, VHA is a component of the Veterans Administration. It is one of three major components. The other two being cemeteries and veterans benefits.

We have approximately 156 hospitals, 876 outpatient clinics, nursing homes and domiciliaries.

The VHA budget is about $35 billion and would rank it amongst the Fortune 50 organizations if we were a private-sector organization. So we are a very, very large organization and we are a very big player in the U.S. healthcare system.

Recently, VHA has been mentioned as providing some of the best healthcare in the country, and it has been management's objective to be the world-class healthcare provider for some time, but to continue to be the world-class provider, we need to continue to be on top of our game, and that is to identify opportunities to improve quality, cost and access, and, in addition, we think, by identifying these opportunities, it facilitates what we refer to as a learning organization.

When these opportunities are identified, it challenges our employees to think about the best solution to the opportunities, and, hence, old cliches like not invented here are quickly becoming disassociated from the culture of the organization and in its place is a constant search for better ways to do things, better ways to achieve superior outcomes and looking beyond our organization to the outside world to identify ways to improve that and essentially becoming globally smart, and we think that integrated data provides opportunities to facilitate our overall objectives of maintaining our world-class status and identifying opportunities, and, specifically, the areas that we think have the greatest potential here are for best practices, and that is comparing VHA with the private sector along both outcomes and cost dimensions.

So for any of our 156 hospitals, we would be able to identify where the biggest opportunity, the biggest clinical opportunity for improvement is or where the biggest cost opportunity is and what the tradeoffs and the metrics linking cost and quality are.

So we, essentially, are beginning to use - we have internal resources devoted to developing risk-adjusted outcomes models, severity-adjusted cost models. We also have resources dedicated to identifying how veterans make decisions when they select a VA facility versus a private-sector facility, and we can compare things like quality, cost, access, benefits and service characteristics of our facilities versus those in the private sector and look at the impact of decision making.

As part of this particular effort, we were also able to identify fraudulent billing practices that occurred where healthcare plans, physicians, offices, et cetera were billing, double billing both VHA and CMS for the same set of services.

We think that when more timely data are available or integrated into our plans that physicians will be able to utilize this data online for treating patients.

And, finally, strategic opportunity identification where we can look at - from the corporate level, we can identify where we think the biggest opportunities are, whether they be in cost or quality of access, and identify corporate-level strategies that would be part of the corporate-level strategic plan for the coming year.

In terms of barriers to implementation, we have talked about the number of opportunities, I think, by the description that I have given. The size of these opportunities are potential huge. So why haven't we taken advantage of these opportunities in the past?

These are some of the barriers that we have identified as we went through this project. First, there were few people that have knowledge about these three data sets and the ability to access the data.

Secondly, integration is very difficult and time consuming. Medicare and Medicaid and VHA data are three separate databases that were developed independently with different purposes, different data and different data definitions, and so if you can imagine every time in the past when we have tried to do an analysis of data where we had to integrate this, it was each time the data had to be integrated for that single purpose.

Data sets are very large and generally require a higher level of programming skill, and, generally, that is SASS.

Investment in hardware and storage media can be an important consideration, depending on the number of users.

Potential users of this data have different needs, and it is - those needs have to be carefully considered in designing a system. One system will not satisfy the needs of all users.

The size of the potential demand for this integrated data is unknown within our organization, and, hence, the risk of investing in a large system that is very expensive and it may not produce a payback.

And, of course, privacy and security laws and regulations add in a very large dimension to managing this set of data and is something that is becoming increasingly more important, being elevated in terms of its priority in making investment decisions.

Next, the decision makers, by and large - and I guess I am talking in general - do not have the experience of using data that is outside of the organization. Historically - and I think this is true within most organizations - the focus has been on internal data, and, quite frankly, I would suspect that - just estimating - 70 percent of the information for solving most problems comes from internal information.

And so our managers and executives are not - do not have the extensive experience in requesting the kinds of information that comes from the integrated data.

And, finally, and another important consideration, is the economics of such a database, and when we talk about the economics, we are thinking more broadly, not specifically, at the costs, the dollars and cents numbers, but more broadly in terms of the fixed cost and the variable costs of these activities and whether it makes any sense to outsource those variable costs, those costs that could be converted from fixed to variable.

So the process in this project, first, we undertook a survey of users in our organization to try and identify both the size of demand and the timing of demand and also customer uses of integrated data, and those are potential uses, and we learned that demand would grow slowly, but would, over time, begin to increase rather dramatically as people learned how to use the data and how to access to the data.

We hired a contractor who had experience in integrating Medicare and Medicaid data, but no experience with VHA data, and asked them to integrate the data and develop a user-friendly system.

The user-friendly system we considered key, because it would expand, exponentially, the number of users, and, specifically, historically, our user base for this integrated data have been researchers and data analysts.

We wanted to expand that to what we refer to as the casual users, the directors of hospitals, the directors of our visims(?) or regions, our chief medical offices, et cetera, but they are not sophisticated users, and so the user-friendly system had to be simple enough for them to access the data, and we thought, as we expanded the user base, that we would be increasing the value that the organization received from the data.

The pilot test that we put together consisted of data integration of the three data sets that I have spoken about and systems design.

We had three white papers written which were basically analytical, short analytical papers by the contractor. He worked on three issues that were top priority issues for the organization at the time and presented some white papers.

We also did tutorials to researchers and data analysts about the system. We asked them to come up with a short research topic and to use this integrated system to address the quick research questions that they came up with.

After the tutorials and research projects were completed, we conducted a customer-satisfaction survey amongst the users to identify strengths and weaknesses of the system and of the integrated data. We also did a data validation study, where we validated the data that came from this integrated system with the raw CMS data that we have in our files.

Generally, systems design, we have in mind a multiphased project. Each phase would consist of design, use and assess. So we would be coming up with a first phase having our users assess it, going back to the design table with the contractor, redesigning it again and going out, having users use it and accessing it and until we felt comfortable that the new system could be rolled out to the entire organization, and I have talked about the three customer groups, the researchers, data analysts and the casual users we were trying to look at.

In terms of Phase One, we used the five-percent sample of Medicare data for one year and 100 percent sample of VHA and Medicaid data for the same year and merged those three data sets.

As I mentioned, we did a customer-satisfaction survey, and it pointed to areas that were strengths of the system, but also some shortcomings in the system. So we are prepared to make improvements should we decide to go forward.

VHA, Medicare, Medicaid data were integrated using contractor assumptions; that is to say that VHA staff were not intimately involved at this point.

Intermediate data products, and these were basically SASS data sets that were produced from the raw data, were compared to VHA Medicare files and no significant differences were found.

Issues that came from the satisfaction survey, one was spending more time learning the system and/or making the system more user friendly were raised. There were technical questions raised about how to make the system faster, and, for example, with bigger machines, more memory, processing one year's worth of data at a time.

It was thought that managing risk associated with HIPAA, the privacy act and security regulations might be reduced via using a contractor and contractor's customized software, and, in the future, more involvement of FHA staff was needed.

Question remain about whether there is sufficient demand to justify the investment, technical and other challenges - whether technical and other challenges can be overcome.

And some of those challenges are the user-friendly nature of the system. Can it be made more intuitive. Reducing the learning curve of researchers, data analysts and clearly for casual users.

We think that a reporting system linked to the output of the user-friendly system would satisfy the needs, for the most part, of casual users.

And addressing processing time issues is also another one. Processing time was mentioned by our technical folks. Another side of the story was mentioned by one of our researchers who said that his time from the point where the project was initiated and where integration of the data was called for to the time that he received analytical results was cut almost by a fifth.

Now, at the same time, what we were hearing from some of our technical folks that it was taking up to 24 hours to run requests for large data sets.

So there is some benchmarking that is required as we go forward and some internal agreement as to what we are going to measure and how we are going to benchmark.

MR. LOCALIO: Richard, we have to move on to the next speaker, if you could conclude as quickly as possible.

MR. BJORKLUND: Okay. There are cultural differences issues that we need to overcome. I have mentioned the economics and the importance of outsourcing, and, finally, some organizational issues.

This has been a green-house project. We are not entirely sure whether it is part of a - the next phase should be part of a planning and policy office.

MR. LOCALIO: Thank you very much.

I want to introduce our next speaker, and, then, while you are setting up, maybe entertain a question.

Our next speaker is going to be Christopher Chapman of the National Center for Education Statistics.

Do we have a quick question for Richard on his presentation?

DR. SCANLON: This is maybe a comment as much as a question.

Since you raised sort of the issue of Medicaid data again, we heard about it before, it creates for me sort of a bigger issue, which is the quality of administrative data, and while we are interested in terms of linkages to be able to expand our capacity, there is a question of did we move too far in terms of reliance upon sort of administrative data, and while averages may sort of turn out for populations to be the same, when we do validation studies, when we get down and we start to slice things more and more, we may be on relatively thin ice, because the data are not good.

I raise this because of sort of prior work that I did at GAO. Medicaid data is always suspect, and there were efforts that we had to do where we had to go out and collect new data because the kind of information that comes to CMS is not necessarily sort of accurate.

It becomes even more problematic as Medicaid moves more and more towards use of managed care and the variability, in terms of what the managed-care plans report to the state, is increasing and leaves you big gaps, and so I guess this is - maybe there's not an easy answer to this, but I think it is an issue that we should be thinking about.

It is as much as sort of how much - in terms of trying to protect sort of privacy, whether people can be identified, it should be a concern about sort of linked-data sets should be - what is the extent of their strength? I mean, what can they be used for and what would be pushing their limits too far in terms of reliability and accuracy?

MR. LOCALIO: Did you want to respond quickly?

MR. BJORKLUND: Yes, in terms of the Medicaid data, yes, we agree. We have concerns about that.

Our primary focus in our best practices is with the Medicare and the VHA data.

At the corporate level, we try to what we say - what we refer to as dumb down the data. So we downgrade it from ratio and interval-scaled data to nominal and ordinal data, so as to eliminate some of that error, but, clearly, these are concerns.

MR. LOCALIO: Thank you.

Chris, why don't you proceed. Thank you.

Agenda Item: Chris Chapman, Program Director for Early Childhood and Household Studies, National Center for Education Statistics

MR. CHAPMAN: Sure.

Hi. My name is Chris Chapman. I am from the National Center for Education Statistics, which is part of the U.S. Department of Education.

My presentation is really going to focus more sort of on our experiences at the center with using administrative record data that have been collected already through National Center for Health Statistics.

I guess before I get too far into this I should also make a disclaimer. I am speaking here not for the department or from organization, but more as a data user.

That said, let me sort of jump in and discuss a little bit about the kinds of data that we typically get at the center regarding health.

Not too surprisingly, we focus most of our data collection on trying to get information about students and other individuals like teachers that are key to the education system, and apart from individual-level data, we also collect information directly from institutions themselves, in particular schools.

Most of our experience there, in terms of gathering health information, has been at the elementary and secondary level, trying to determine which students actually have individualized education programs which are specifically designed to help students with disabilities get the kind of education that they are going to need in order to function in society later on after school.

The data sources that we normally get our information from are parents, students and school records. Okay. These are not health-system type data collections. They are relatively general, and we rely on parents to have relatively good information about medical evaluations of their children, and we rely on students to be relatively knowledgeable about their health conditions, and for school records, as I mentioned before, we really focus in on the IEP data that schools gather and keep for their students.

However, it would be good and would be useful for us to be able to get more information about student health linked into the school record systems.

This next couple of slides, I am going to briefly go over the types of data collections that we've got in place, so that you'll have a better understanding of what we have available and what we usually work with.

This first slide focuses on the early-childhood longitudinal studies. These studies are actually in my program office.

As you can see here, we've got two cohorts. There is a birth cohort and a kindergarten cohort. The birth cohort focuses on a group of children who were born in 2001 and the kindergarten cohort focuses on a cohort of students who were in kindergarten during the 1998-1999 school year.

The reason why I want to start off with this slide is the ECLS-B, the birth-cohort study, really is, I think, the center's most extensive experience using health-record systems - okay? - in particular the sample for the study was drawn directly from the birth-certificate record systems that are available through the National Center for Health Statistics.

And apart from using the birth-certificate data, that data set also involves some direct health assessments. Our field interviewers actually did data collections on birth weight, height, cognitive growth and motor-skill development as the child progressed from birth through at least into kindergarten now.

And then we also had, as I mentioned before, many of our data sets, parent reports of diagnosed disabilities and overall health of the child.

The kindergarten cohort data collection had much of the same kind of health information, except for the birth-certificate data, and those data would have been useful to get.

We were not as experienced, I don't think, as some of the other organizations here with actually taking an existing data set and trying to cross link it with administrative record systems. So we did not undertake that.

And apart from that information, the kindergarten cohort also collected data directly from schools about IEPs for the sample children.

This next slide has some information about a high-school cohort that is comparable to the ECLS studies in that we are tracking a group of students over time. In this case, it is tenth graders, and we are tracking them through early adulthood.

Again, we - the health-related information we have in this collection were reports from the students' schools about their IEP status and health-related programs that the students were in.

We have also asked the parents to provide us information about diagnosed disabilities that might not have impacted their IEPs, and then we also asked the students themselves about their health status.

The National Household Education Survey collects data about populations from preschool through adulthood. Here, our only data source is information that we get directly from the parents and the students themselves.

The type of information that are on these data sets that would allow us to cross link to the administrative record systems is relatively limited. The sample draw for this particular data collection are telephone numbers, and we have not, to date, found a good way to even cross link the telephone numbers with strong, address-matching records, which prohibits our ability to link it into some of the more detailed administrative record-data systems that are out there.

This next slide summarizes our post-secondary data collections that we have collected to date. Then we continue to field.

The biggest one here is the National Postsecondary Student Aid Study or NPSAS.

All these studies, again, rely on us getting reports directly from the students about their health status.

The NPSAS collects a lot more detailed information than many of our other studies, but, nonetheless, is still a self-reporting system.

Okay. I wanted to get back to the ECLS-B for a moment here, because, as I mentioned before, that is really our primary experience with this sort of activity, where we are trying to actually link in existing administrative record systems with our survey-data systems.

The birth cohort did this quite efficiently by starting out with the birth-certificate data system that is available. So, as a result, we have a very rich database available for these children right from their birth, and the data set is actually rich enough or the birth-certificate data is actually rich enough that we treat that initial birth-certificate collections point is actually an initial data set, even though we didn't do any surveys. We just basically took the data off of the birth-certificate data and we linked it into the student record, and we have been tracking it ever since, but that is really our only experience to date using the statistics with our survey data.

Staying with the ECLS-B, I don't want to minimize the health-record systems that are out there. Without them we could not have even done this study. We wanted to make sure that we had a representative sample of children at birth, and the most efficient way to do that was to use the birth-certificate-record system.

In the next few slides, I am going to go through some of the ramifications of that, one of which is that because the administrative record data are so rich and it is relatively easy to identify individuals using them, even in a relatively small end-sample study, with the ECLS-B, in particular, we have gone to a model where we do not have a public-use data set. Okay?

If researchers in the room are interested in using the data, they need to apply to the center for license, and, then, we'll grant you the license, and we have a relatively stringent - procedure to make sure that the data are not inadvertently released and that no individually-identifiable information is ever published.

That said, much like the VA, we have done a lot of work to try to figure out ways to make the data more user friendly to the public. You know, you can get a restricted-use license if you are a researcher and do a lot of very interesting analyses, but a lot of the times the types of people who want to get access to the data are school administrators or child-care providers, who just want to get a general snapshot of what the population looks like. So we have been working with an online data tool that will allow people to get access to the underlying micro data, and we have done similar studies that other agencies have done to make sure that those reports are accurate and also to make sure that the types of data you can get out of the systems cannot drill down below groups that are smaller than 50 in number. Okay? That is our primary -

Yes, 50. I know. Some people think that's pretty big. Some people think that's way too small. We have gone back and forth over the years. To date, we haven't been able to - it with 50. We haven't really tried to drop it down any further, but right now, that is where we are at.

In - reports, however, we will and do produce tables that have cells that are based on ends as low as three. If it gets below three, we go on data-suppression mode on the cells and collapse them, so people can't even figure out that we only have three or fewer cases in a cell.

At the center, most of our data collections are actually done through sample surveys, and, as I mentioned, with like the in-house - the National Household Education Survey, the sample frames themselves are often limited to the extent that you have identifiers that you can easily use to link into existing administrative records systems.

So, to some extent, some of these crosswalk activities might be of relative utility to us, but I started to try to think through, well, how could we access these rich data sets that are out there now and help them inform our studies, and we have some experimental articles that have been put out that look at self reports and crosswalk them with the administrative record data on like health records or medical records, and those studies have been relatively useful, in terms of us improving our survey items.

One way that we can use the administrative record system is to do relatively small-end studies whereby we actually can sort of cross link and purposefully design our study to cross link the survey data with the administrative record data to see just how accurate self reports are and to try to figure out ways to improve those self reports.

And another area of research that we probably should consider is taking a look at linking the administrative record data that are out there on health statistics with our own school-based administrative records systems.

We have relatively extensive administrative records systems that we have for both elementary and secondary schools, and also our postsecondary schools, and, right now, the type of data that we have really focus on disabilities, and that has to do with some legal requirements that the department has to help service students with disabilities, but thinking beyond that, we could, if the data were available, use health statistics and health data to try to figure out, well, are there students who are not necessarily of disabled status who could benefit from additional services? And, right now, we don't have any way to really get those data, and I think we could use the administrative records systems to do that.

I have a feeling that this first bullet was pretty extensively covered yesterday, but one of the key issues that we have and would have with trying to do some more crosswalks with records systems is actually getting the correct identifiers in our surveys.

In order to do that, we have to have a really good understanding of what information is available in the different records systems that are out there that we might link into, so that we are not collecting data to crosswalk that only will crosswalk with one record system and not another. We don't want to waste resources there.

And then we also need to have some help interpreting the data that we do get from the health databases.

I just heard a little bit of a back and forth here about what exactly is in the Medicaid data sets. We'd have to get a better understanding of the strengths and weaknesses of the administrative records systems to use them properly. We don't have that kind of expertise in house right now. I mean, we really focus on education types of issues.

I think that is it. So not too far over.

MR. LOCALIO: Well, thank you, Chris, and we have time for a couple of quick questions before we have to take a break.

Don.

DR. STEINWACHS: It would help me to get a little bit better idea, on your longitudinal studies you are picking up information on individual children.

MR. CHAPMAN: Um-hum.

DR. STEINWACHS: I guess linking in some of those information on the schools and those resources.

MR. CHAPMAN: That is right.

DR. STEINWACHS: Are there other national data collections that the Department of Education does that tracks children or are they all these - are they special studies or is there sort of a statistical system that -

MR. CHAPMAN: There is an ongoing collection effort, basically, where we start a new cohort every so many years of a particular population, so - Then that ranges all the way from high school up to college.

DR. STEINWACHS: And those would be nationally representative and -

MR. CHAPMAN: They are nationally representative data. That's right. So drilling down below the national level really isn't an option. The cost gets prohibitive very quickly with these types of collections.

DR. STEINWACHS: Because, on the health side, there are some interesting issues these days as people are very concerned about the amount of drugs and medications being given to children, whether it is for Attention Deficit Disorder or other problems, antidepressants. There are anti-psychotics now being given to small kids, so on, and, in concept, you might think creatively about or we might think creatively, I guess, about are there ways in which you could take information like out of Medicaid or other sources that could tell you geographically populations that are getting high rates of these and link them to school districts or things that would tell you something, and I was just wondering whether or not it was possible with the kind of national data, and I guess maybe it's probably a longer discussion -

MR. CHAPMAN: Right. The first answer is maybe. The second answer is we start those - Especially with the studies that we do of children who are already in elementary and secondary school system or who are already in college, we start that sample design with basically school frames. So if there is some way that we could link a student ID, especially once they are in college, when we start getting Social Security numbers, then, the linking process becomes relatively straightforward.

But for the younger children, as you were talking about, we might be able to do some linking activities through the addresses that the schools would have for the students and then crosswalk those into the databases that are out there on health statistics. I mean, we do think about that stuff.

DR. STEINWACHS: Thank you.

MR. CHAPMAN: Yes.

DR. STEUERLE: This question is a question I am going to ask later, when we get to our final session. So you might just answer briefly, if you have some answer, but I am involved in so many projects within particular silos of government organizations, but within the research community itself. So I am involved with one group that is trying to study children's outcomes that pretty much is now focusing on early childhood education and even beyond, and has sort of at least taken up, whether correctly or not, this model that the earlier we intervene with children the more the return on the investment, and then I am involved with groups like this, which is interested in healthcare, and then, at Urban Institute, we have another group that is helping to work with some of these longitudinal studies you are creating at Education, and sometimes I wonder how much do the health, the economic and the education researchers really get together when they design some of these samples.

I guess it is probably unfair to throw this all on you, except that it's so many cases where we are talking about outcomes and opportunities and mobility and issues like that. People always seem to come back both to early intervention and to education, and sometimes I don't know how to link them.

Give you a common example. For instance, some people now think we really should start at minus nine months, starting to measure what is happening to the well being of a child, because it could be that drugs and alcohol, depression, whatever other illnesses within pregnancy could have strong educational outcomes down the road.

So the question for you is to the extent you get together, you start designing these models, how easy is it to bring in somebody from HHS and how easy is it for them to come, and how easy is it to bring in people from some of these very different worlds and try to really design the models and the longitudinal studies you have?

MR. CHAPMAN: Okay. I think at the beginning I made a disclaimer that I am speaking for myself. I am going to do that again here. Not speaking for the agency.

The ease that I have experienced so far has been great. It is relatively straightforward to contact Health and Human Services or the National Center for Vital Health Statistics and say, We are developing this study of children from birth, and we are going to track them through the first couple of grades or at least through kindergarten. Can we start to have some meetings with staff in your agency that might be interested in related topics?

And for those early childhood longitudinal studies, we have had a lot of input from Health and Human Services, and, obviously, we had to get the birth certificate data for the - cohort, but we have also done some work with - in our own Office on Special Education - to get better measures and to think through measures in health.

I think what we run into isn't always necessarily a coordination problem. Although, those certainly do exist. We also run into just response-burden problems, and it is good to see OMB in the room, because we can only spend so much time with the student in a school setting or so much time with a child in their house or with the parent in their house without running up to really serious burden issues.

And they are good issues. I mean, we need to consider them, because, from our perspective, we want to get as much data as we can on education. So we focus on assessments and we focus on educational resources in the household and in the school, and, then, we know, from a lot of research, that there's health issues that relate strongly to educational development. So then say, Okay. Well, we better make sure we get some of those health statistics in there, but it is rarely the case that we can make it the primary focus of the study. So we are limited to the types of data we can collect, even with really good collaboration.

Does that answer your question?

MS. GENSER: I am Jenny Genser from Food Nutrition Service.

I wanted to ask if your surveys contain information on receipt of school lunch and breakfast, and also if you have obesity data, because that is a big, big health-related issue that wouldn't show up in an educational plan.

MR. CHAPMAN: Right. It varies across our studies. I like to keep going back to the longitudinal study that we have for the early-childhood populations, because that is the one I work on day to day.

In that particular study, we actually work with USDA to make sure that we had proper items in there to ask about weight, but apart from that, we also have some direct assessments where we actually weigh the child and we do a body-mass index measurement that is part of the data collection.

That said, I don't want to say that that happens regularly and across the board in all of our collections. I mean, it is actually relatively unique to our early-childhood studies.

MS. GENSER: (Off mike)?

MR. CHAPMAN: We don't ask the parents regularly whether or not children are in free and reduced-price lunch programs for similar reasons that are causing problems for the CPS item on lunch receipt.

In order to really get at that, you need a good 10 questions, and we don't always have time to nail that down, but, at the school level, we do, in our school surveys, ask what - how many students in the school are actually getting free or reduced-price lunch and whether or not the program exists in the school.

DR. STEINWACHS: I want to thank the panel speakers very much and hope you'll stay with us and continue this dialogue. We are at the point where we promised you a break. We will deliver a break at this time, but what I always say is five minutes and figure that it probably stretches a little longer than that. So please take a break and come back in about 10 minutes or so.

(Break).

Agenda Item: Maximizing the Benefits from Linked Data: Access for Research and Related Issues

DR. STEINWACHS: We are sorry that Dr. Citro is not going to be with us today. She is ill and sent her regrets, very much, that she couldn't be here.

So we have just about an hour-and-a-half and split it among three speakers, and very happy to have Joan Turek, who deserves a very large part of the credit for bringing together this group, and we did ascertain that Joan is the one who knows everyone, and so certainly the right person to know, and so over the next hour-and–a-half, we'll be hearing from three key speakers on areas of Maximizing the Benefits from Linked Data: Access for Research and Related Issues.

Joan.

MS. TUREK: Thank you.

When they were setting it up, I asked to facilitate this section, because I am a major data user, and I am probably very obnoxious about wanting access to my data and not wanting anything to happen that could limit that.

It is unfortunate that Connie couldn't be here. There is a disc available either from Richard Sussman at NIA or from her, that has all of the reports of the studies that they have done on data sharing.

The first one, in 1985, was called Sharing Research Data. The last one, in 2005, was Expanding Access to Research Data. So it looks like, over the last 20 years, we haven't solved all the problems.

But we have three very good speakers. We are going to start with Brian Harris-Kojetin from OMB, who is going to tell us about their activities to improve access to data, and then we are going to talk to two major users about what they really want to get.

I think it is important that we have the data available and we have high quality data, but it is equally important that it is available to the people who wanted to use it, and I think that the value of the data is really dependent on our ability to get access. If you just collect it and stick it in a box somewhere, you may as well save the money.

Brian.

Agenda Item: Brian Harris-Kojetin, OMB

MR. BRIAN HARRIS-KOJETIN: Good morning.

As the other speakers said, I have a similar kind of disclaimer. In fact, if I say anything inappropriate, the Director of OMB may well disavow that I even work there.

But I have been listening here for the past day and - well, yesterday and this morning, and I am not sure what I have to contribute to your discussion. I don't have any data sets. I don't link data. I don't directly use any of the data sets, but I'll share with you some work our office does in terms of - kind of related to this in terms of the confidentiality issues and some of the legal issues.

Another disclaimer is, of course, I am not a lawyer. I do play one on TV every once in a while, but - and I fake it in my job a fair amount, and, as you'll see, but, again, I have to disavow any of those kinds of things that I might say that sound like I actually can make such a policy.

There's a few laws - there's a couple of issues on the charge questions here that I thought I could say something about and see if it is at all helpful to the committee.

There are several laws that have some impact on data sharing. Here are the ones that I am familiar with.

One of the things that came out in the presentations by the folks from Census yesterday, you heard Title 13 mentioned quite a number of times. Certainly, a primary consideration is whatever statute - whatever authority - legal authority the agency has that is originally collecting the information, whether that is being gathered for statistical purposes or whether it is being gathered for administrative purposes, the agency has to have some defining authority to gather that information, and, oftentimes, their statutes will specify what are appropriate uses for the information and if there are any confidentiality provisions for that, and sometimes these are very vague and, you know, Go out and gather data on health or on the economy and do good works and disseminate it. Other times, it is very specific.

One administrative data set that many of you may be well aware of is the National Directory of New Hires, and those of you familiar with it know that there are - Well, with the exception, I guess, of SSA, every other agency that has access to the data has access to it for a specific purpose that is very carefully specified, and this is - So that is one example of what you are allowed to access the data for, how it is allowed to be used, and it can be specified by each agency that may be allowed access to it, and if you are not explicitly allowed access to it, then, even though you have the grandest intentions, you can't have access to it.

Folks yesterday also mentioned a couple of broader laws that apply across government agencies. The Privacy Act and some routine uses of information came up. The Paperwork Reduction Act was also mentioned by a couple of folks.

Those of you not as familiar with it may be interested to know that under the functions of the statistical policy and coordination functions that are codified in the Paperwork Reduction Act, the director actually specifically authorizes the chief statistician to promote sharing of information collected for statistical purposes, but consistent with the privacy rights and confidentiality pledges.

What I am mostly going to talk about this morning, the thing that I mostly know about, so this is why I assume I got invited, was CIPSEA, and which many of you are, I believe, aware of and maybe you know everything I am going to say.

CIPSEA is the Confidential Information Protection and Statistical Efficiency Act of 2002, which is why we call it CIPSEA instead of all of that.

It is composed of two major titles. First one is confidential information protection, which was - which you can see the purposes here is really to strengthen public trust in pledges of confidentiality, prohibit disclosure in identifiable form, control access to in uses made of statistical information and ensure that information is used exclusively for statistical purposes.

CIPSEA provides a nice statutory floor for the use of information collected exclusively for statistical purposes.

The second part of CIPSEA is the statistical efficiency part that applies only to three designated statistical agencies - the Bureau of Labor Statistics, the Census Bureau and the Bureau of Economic Analysis.

The goal of this subtitle was to reduce paperwork burden on businesses, improve the comparability of economic statistics, specifically mentioning BLS and Census, comparing their establishment - and increasing the understanding of the economy.

Quite a number of you, I am sure, are intimately familiar with the history behind this, why CIPSEA was sought after for literally decades. This went through many evolutions. There were many bills that came so close, some that passed the House and then died, and we finally got CIPSEA in 2002.

As Nick noted earlier, if he is still here. I thought I saw him, but maybe he stepped out, there was a companion bill - a Treasury companion bill for CIPSEA that - for amending 6103J - that did not - I don't know if it was ever even introduced, but has not been passed that is really key to some of the data-sharing provisions within CIPSEA, but that has not gone forward yet.

But one of the reasons why this has been a very important law is that there is a real patchwork among many, many different agencies that do some kinds of statistical activities. Now, there's 10 agencies that are often referred to as principal statistical agencies that that is really their sole mission is to do statistics, but there are many others representative in this room and outside this room that do some kinds of statistical activities as part of their other mission that may be regulatory or providing services.

And there have been many attempts over the past number of years to strengthen and try and standardize the statutory protections for the confidentiality of individually-identifiable data.

As I was saying before, every agency has their own specific statutes and has variations in confidentiality protection. Some, like Title 13, are extraordinarily strong. Other agencies, prior to CIPSEA, like BLS, had practically no authority whatsoever to - legal statutory authority - to base a promise of confidentiality upon, and so this really - CIPSEA provides a ground-level kind of foundation of protection for information gathered for exclusively statistical purposes under a pledge of confidentiality. So I have been saying this, now, uniform protection. It covers all data that an agency collects for statistical purposes under a pledge of confidentiality. There's very strong penalties. This is similar to penalties that some agencies already had, like Census under Title 13, and CES, under their statute, a $250,00 fine and/or five years in prison.

It also specifically says that FOIA requests are exempt, because it defines them as a non-statistical purpose.

When we talk about - there's a few key distinctions that are important in CIPSEA. One is between statistical and non-statistical agencies. CIPSEA provides a definition of statistical agencies, those that are predominant - whose activities are predominantly the collection compilation processing or analysis of information for statistical purposes.

Statistical agencies have - are given some special privileges, and also some requirements, extra requirements under CIPSEA. Specifically, one area that I think many of you have been interested in is this ability to designate agents, which may be a contractor or an external researcher or - this is similar to Census' authority to have special-status employees. CIPSEA specifically provides this authority only for statistical agencies, not all federal agencies.

The other key distinction that we talked a fair amount about in the discussion yesterday is this statistical versus non-statistical purposes. CIPSEA really puts in statute this functional separation between statistical purpose, defined here, and a non-statistical purpose and draws a bright line between these two uses; that is, any information collected for statistical purposes cannot be used for a non-statistical purpose, and a non-statistical purpose being using the information in identifiable form that effects the rights, privileges or benefits of a respondent, that we talked a little bit about yesterday afternoon.

So for the Census Bureau to get administrative data that were used for a program that were used to effect - were originally gathered for non-statistical purposes, to take that across the firewall there and say, Now, we will use it for exclusively statistical purposes. However, it does not go back. CIPSEA is drawing that same bright line there. Even if your intention was to get - give it to Census, get better race codes and say, then, Can we have it back in our administrative records? Sorry.

So what requirements does CIPSEA impose on agencies? Inform the public, basically, that CIPSEA has to - can only really take effect here is - You are only collecting something under CIPSEA if you are adequately informing the respondents that you are going to use the information for exclusively statistical purposes and keep that information confidential, and we've got some forthcoming guidance that talks about this specifically as a CIPSEA pledge, and, of course, safeguard the information and protect it. CIPSEA is law for protecting the information. Honor that pledge that you make to respondents.

In terms of data sharing, as I said, and as many of you are well aware, the provisions are very specific. Only business data are covered. Only three designated statistical agencies are authorized for this business data sharing. So that is all that CIPSEA is itself authorizing. It is important to point out that CIPSEA is not altering existing laws that may permit other data sharing among federal agencies, but CIPSEA did not itself authorize any further data sharing than this business data sharing and - between BLS, BEA and Census.

So some implications here, which I think is - which you are most interested in that you may, again, be well aware of already.

For federal agencies that are acquiring and protecting confidential statistical information, CIPSEA may offer some new protections for those agencies that didn't have strong legislative protection already.

It does not - It specifically does not restrict or diminish any existing protections. So if the Census Bureau is gathering information under Title 13, for example, and they can only use that information for a Title 13 purpose, CIPSEA does not restrict or diminish that at all. CIPSEA does not say, Oh, but why can't you do something that - CIPSEA would let me do any kind of statistical purpose. It does not effect that is how the lawyers, I understand, are interpreting it now.

For federal agencies providing access to confidential statistical information, CIPSEA does permit statistical agencies - remember, only statistical agencies - to designate agents, to perform exclusively statistical activities. This is not a requirement for statistical agencies to do this. It is a may. They may, if they so choose, do this.

This will require from them policies and procedures for access and control, responsibilities for providing security and employee training and these things take resources, as everyone in the room is well aware. RDCs are not cheap, any other means of - Chris Chapman, if he is still here, can tell you licenses are not free, and -

Implications for researchers, just to kind of wrap up here. One of the things I wanted to make clear was that some people had viewed the language in the law regarding agents as opening up researcher access to all data collected by statistical agencies, which it really does not do. It does provide a means for statistical agencies to designate agents, but encloses stringent requirements on those agencies and the agents to protect the confidentiality of the data.

It is important to remember, as I was saying, that CIPSEA doesn't diminish any of these existing protections, and so it is not going to remove some of the barriers that currently exists, and it also doesn't provide a right of access to federal statistical data. Researchers who obtain authorization to access the confidential data, for exclusively statistical purposes, have to share that responsibility to maintain and uphold the confidentiality of any data they access.

And, as you all are well aware, different agencies have - vary in the sensitivity of the information that they have and they may not be able to provide access to their data or all of their data or may have to do so under varying circumstances or in different ways, and so researchers seeking access to those data will have to conform to the agency requirements and respect those confidentiality provisions, even if those are more limiting and restrictive than those that they may have for - in their own institutions or they may encounter with some other data sets, but I think that was - I had to share with you this morning.

MR. LOCALIO: Thank you for your presentation.

I just have to say that one of the problems that I mentioned yesterday, if you were here, is that people refer to, The lawyers have done this, and the lawyers have done that, but the lawyers are not here, and they do not understand the problems that we are discussing. In fact, I am not sure that they care about the problems that we are discussing.

MR. HARRIS-KOJETIN: I disagree with you, and some of the lawyers we work with, but they are very, very well informed on this issue.

MR. LOCALIO: What I find, and I have reviewed some of the legislation, I have a copy of CIPSEA here that I have been carrying around with me for the last three years, and I have a copy of the statute that NCHS uses, and they are in conflict. They are vague. Some of the things are vague, and it doesn't seem there has been an effort to reconcile them.

Is there any further effort, post-CIPSEA, to say how these statutes are going to work in practice? Is there any effort to evaluate the implications of these statutes, in terms of who gets what, when, where or do they just pass it and then they say, Well, this is going to work?

What is the evaluation component of - Does OMB have an evaluation component to figure out whether this is working? And I am not talking about the second provision, the data sharing among the three agencies. I am talking about essentially the first one.

Thank you.

MR. HARRIS-KOJETIN: Very pertinent question and I don't know that the evaluation is strictly in the law, but I think that is something we do care deeply about or I do, since I am speaking for myself.

We have got guidance forthcoming on CIPSEA. Some of you may have heard me say this for the past two or three years. Nick Greenia, in the room, I think has just given up asking me when it is coming out.

It is actually coming out very soon, but you can't really trust whatever I say on that issue, since I have said it has been coming out soon now for two or three years.

So we do have some fairly lengthy, in terms of relative to the statute, like about 30 pages of guidance on implementing CIPSEA that we will be issuing that will help agencies that are using that.

We have gone through quite an intensive interagency process to help develop and inform this. Again, if you folks in the room were on an interagency team that has helped give OMB input into doing that, and so as that gets out there, as agencies are going forward, we have had a number of questions that have come up from agencies in terms of what impact it has on their programs, how they need to - how it will effect their operations. We have been dealing with those on an ongoing basis, and, again, have used that to help inform the broader guidance.

So it is an evolving process and it is something that every agency is going to struggle with a little bit in terms of what does this - what kind of changes and what kinds of things does this do for us, and we have some reporting requirements for agencies to get back to us on how they are using CIPSEA, how they are using specifically the - the statistical agencies are using the agent's provisions, and so we can evaluate and monitor this and see where things are working.

Many folks are, of course, very interested in the data-sharing portion of that and hoping that we could go back to Congress after some time, after we prove how good of a job that the three designated agencies have done sharing business data, to see if we could explicitly expand that authority, which is where I thought you might go, even though you pulled back from that.

DR. STEINWACHS: I think, just clarify that. There was a reference in there to business data, and so that was the sharing among BLS and the others of information on employers in business activities in the U.S.?

MR. HARRIS-KOJETIN: Exactly. Yes, economic data, and so the business lists between Census and any of the economic surveys that Census and BLS did.

DR. STEINWACHS: Thank you.

MR. PETSKA: Can I make a comment also?

Again, speaking from my own personal view, I wish that CIPSEA would have gone further and kept in the tax component of that, because when CIPSEA, when the early discussions of CIPSEA were unfolding with Brian's boss, Kathy Wallman, at OMB and other representatives of the federal statistical agencies, there was a lot of valid research purposes that were articulated for sharing of data, including tax data and so on, and once CIPSEA started to be formulated, it became very clear that the - one of the more controversial aspects was tax-data sharing, and the question is - I don't know if it is because of congressional committees. Clearly, there was not strong support on the Hill.

Randy Krosner(?), in the Council of Economic Advisors office, was pushing this and so on, but when he left the administration, it seemed like that piece had no possibilities at all.

I spoke to him a couple of months ago at a conference in Cambridge and his comment was, CIPSEA, as it was, was the best deal we could get, that we would have liked to have expanded data sharing to more agencies, including the tax component, but it was clear that that bill would be dead on arrival, which is unfortunate.

MR. GREENIA: I am Nick Greenia from IRS and I just wanted to add a couple of things.

First of all, I wanted to address the previous question in terms of the evaluation, because the guidance that is, we hope, going to be coming out next month - Brian, is that right? - in the Federal Register -

MR. HARRIS-KOJETIN: Absolutely.

MR. GREENIA: The guidance is actually - I was on the other agency committee for that. So I can speak to that a little bit.

I think the outlook is unknown. I think the answer to your question is it is not clear, and one of the reasons I say that is because there's a lot of flexibility in terms of how agencies can safeguard the data, how they make the data accessible to researchers, and I think what may come about is, if you will, a de-facto evaluation, which is that researchers and Congress, if there is another data-sharing effort, are going to look at the experience and they are going to see that, You know what, it depends on the sensitivity of the data, in terms of what sort of safeguards are prescribed, and procedures, and - are on, and there is going to be a lot of flexibility, and there is going to be a lot of variability in terms of how the data are accessible and protected.

And I just wanted to add something to what Tom Petska said, since Connie Citro is not here, as you may know, since that report on the data-sharing workshops was released last Friday, and if you would like to get a - I am doing a little plug here for Senstat(?), of course - but if you want to get an idea of some of the difficulties, including the recommendations facing tax data for purposes of data sharing, I would highly recommend you read the article coauthored by Mark Mazor and myself on tax data and some of the many, many issues that have to go into that.

And picking up on what Tom said regarding why the tax-amendment bill did not go anywhere, as you know, the tax-amendment bill accompanied CIPSEA in July of 2002 to Congress, and CIPSEA proceeded to the floor of Congress, and the J-bill, the amendment to the tax bill, went to Joint Tax Committee, and there were a number of reasons, we think, that the tax bill foundered. Tom has put his finger on one of them, which is the leadership vacuum, but there are some other lessons that we think are valuable as well, including freezing the items in the statute itself, as opposed to allowing infinite regulations to stipulate item content in the future.

So I highly recommend you take a look at that article for tax data.

MS. TUREK: Thank you.

I have one question also.

We have been talking here about new forms of data, that would be the survey and administrative data linked together.

Has OMB begun to look at the implications for users and to look at whether or not we would need any kind of new statutory language to permit this kind of data to be shared? Because, I mean, it would not be identifiable, presumably, but there was the risk-disclosure issues.

And will OMB take a position or will it look at what should be done to help users get access to this data once it is available?

If you do a SIPP that's got a match to administrative records, will you look at whether or not we could have a public-use tape?

MR. HARRIS-KOJETIN: I thought David was going to look at whether you can have a public-use tape or not.

You'll have a public-use tape.

MS. TUREK: Thank you.

I mean, I just wondered if OMB had a role in this or -

MR. HARRIS-KOJETIN: We could have a role if we need to have a role, if it seems that that would be helpful.

Obviously, when you are bringing in the linked data, there are other things that go along with it. The example I was giving before, an agency certainly can promise confidentiality to a statistical - Well, a statistical agency can promise confidentiality to another agency, and so BLS, for example, does this all the time when they gather information from states. They take data that states may not consider confidential, but say I will use this for exclusively statistical purposes and I will keep it confidential, and once it goes back there, then BLS intermingles it with their other information, and, in essence, they have elevated the level of protection required. Just like whenever anything gets intermingled with IRS tax data, it gets elevated to that status of protection.

MR IAMS: Could I make a comment?

I am Howard Iams from Social Security.

I really think Tom is correct and Nick is correct. You have to have legislative authority to permit a broader sharing than currently exists. The agencies do not have the authority to pass data to other agencies for use at those other agencies for statistical purposes, and this is separate from disclosure and confidentiality. The agencies just cannot pass it and cannot use it without some sort of legislative authorization, and Brian can - I don't know - perhaps disagree on some instances, but I don't think that the agencies that are interested in this can go further than what they are doing now, and a lot of the limitations that you are hearing about are created by this legislation.

Now, the disclosure raises a - Well, let me finish - My train of thought would be that if you are a CIPSEA-authorized or a CIPSEA-compatible agency, which I think I am in - We are a statistical outfit in a big administrative organization, but we just do statistics and policy research.

There ought to be, ideally, permission to share confidential agency data with such a group for them to use it however they wish for whatever purpose they want, not just Title 13, not just Title XYZ, but that it is a statistical-analysis function that might have policy implications, might not, but it is for a statistical purpose. It is not going to go and administer somebody's benefits or effect some individual's rights and whatever, as CIPSEA is defined. That would open up a whole lot of sharing that outfits could do and a whole lot further analysis with this type of information than is currently possible.

Once you bring in the disclosure, confidentiality issues, you raise a whole lot of other things, and my only comment would be, the thing that I think undermines almost any public-user file is geography.

We put a copy of our new beneficiary survey - It is sitting on the web. It's got two surveys. It's got earnings from our tax records. It has hospital records from the Medars(?) file. It has our benefit information.

We cleared this through - with Pete Saylor's(?) help at IRS through their requirements. It meets all their confidentiality requirements. It meets Medicare's, CMS's - at that point it was HICFA. It meets SSA's. The key is there is no geography. It is a big country out there. There are a whole lot of characteristics that you could say are unique, but you really can't tell, because it is a big country out there.

If you know what state someone is in, it is over. It is not a big country. It is a small state.

Now, we have a national program. For Social Security, it doesn't matter what state or what locality you are in, but if you are dealing with TANIF(?), you want to know the state.

I, being selfish, think that they ought to put out all this administrative data on a national file with no geography, and if you want to use geography, you should have to go to these research data centers. That is what the University of Michigan does with their Health and Retirement Survey. There are restrictions. Users can use it in University X in Alaska, Hawaii, whatever. The only place you can do geography is in Ann Arbor, Michigan. That is the price you pay to have geography with those data. If you've gotta have geography, you will never have a public file with confidential data. It will not be possible in today's age. My judgment.

MS. TUREK: I find that fascinating.

I think we ought to go to our last two speakers who are both users, and actually will be from a very different perspective, I think.

Heather Boushey is an economist with the Center for Economic Policy Research and Dr. Deb Schrag is - I guess you are on the staff of the Memorial Sloan-Kettering Cancer Center, and so we have here an economist and a medical doctor who are both heavy data users. So I think it'll be really interesting.

Heather.

Agenda Item: Heather Boushey, Economist, Center for Economic and Policy Research

MS. BOUSHEY: Great. Thank you. Thank you so much, Joan. Thank you for inviting me to speak here today. It is a pleasure to have the opportunity to talk to you about the way that we use data.

Before I talk about the main points that I want to make, I want to tell you just a little bit about myself and my organization and what we do, because my understanding is is that is why I have been invited here to speak today, to talk about how we use the data that agencies make available.

I am an economist. I work at a think tank here in town called the Center for Economic and Policy Research. We are a very small shop. We have four economists, about a staff of 15, and we do research on economic issues facing people here in the United States.

We are very heavy users of the CPS and the - the Current Population Survey - and the SIPP, the Survey of Income Program Participation - although, I don't know, in this audience, if I need to spell out those acronyms, but I am so used to doing it.

We make use of this data in a very timely manner, both to effect media debates and policy debates around pressing policy issues.

We work on very short time frames, and, because of that, we have spent the past five years taking both the SIPP data and the CPS data, both - we are working on the March(?), but we have done this for the org - and creating what we call Uniform Data Files.

As many of you know, if you use survey data, they do these fabulous things at Census and BLS where they'll change the name from year to year or they'll do little things that mean that you can't just write one piece of code and pull out the data from every year.

So you have to invest a lot of time if you want to know what has gone on between 1973 and today or ‘79 and today. If you want to have a time series, you have to sort of make this huge up-front investment.

So we do that, and we have made all of this publicly available on our website, all of our code and our uniform extracts, but we have made this investment so that when there is a debate on the Hill or when the media says something that is inaccurate about what is going on in the economy, we have a data set that is up and running, that is on our desktops, and we can then comment on it in days or weeks, rather than months or years, and I know all of you know just how complex this work is, and being able to do the work up front and have it available for timely analysis is a critical part of our mission and how we use the data.

Which is not to say that we don't have longer-term research projects. We do projects that take years, but it builds on this uniform data file and it is always with this goal of policy work.

We do not do any linking with administrative data, because it is certainly beyond the kinds of timely work that we could do.

I do have experience working with administrative data in another life, when I was at the New York City Housing Authority. So I do have some understanding of just how complex some of those issues are, but I have not matched it, just so that - I don't want any questions about that. I have never done it. Don't want to. Sounds complicated.

So being able to have data on our desktops that we can use quickly and accurately and that we have confidence in because we have already done all the background work has been critical to our success, and it is that major point that I want to relate to the main points I want to make to you here today.

We are concerned about timeliness of the data, and we are concerned about our access to it. The question that Joan said about public-use files is critical to what we need to know.

And we are also concerned about - and I don't know how germane this is to this topic, but we are concerned about maintaining access to survey data. I think - and I'll talk about that just for a few seconds at the end.

So my first two concerns about timeliness and accuracy of the privacy issues, of course, they are linked, and so the major questions that we have are will administrative data that has - survey data that has administrative data matches lead to delays in releasing the data? Will it be less timely than it is now?

And, second, will it require new security measures, kinds of like the things that have already been talked about requiring us to go to special locations or special sites to use the data, and that will significantly both delay our ability to use it, but not just delay it by days or months, but could delay it by years, because we wouldn't be able to have it sort of up and running and ready to go when we have an issue.

Now, I have not been doing this kind of work for maybe as long as many of you have, but I have heard tales from people who got their Ph.D.s in the ‘80s and before that it used to be the case that if you used survey data, you had to go to special computers, because you couldn't do it on your laptop, and I hear they had these things called cards and that it was very time consuming, and what I find - the point here is that I think that the way that we are able to use data now has transformed the way that we are able to engage in policy debates, both at the national level and the state level.

The fact that we have access to this data, not just on our desktops, but I have access to it on my laptop at home, I can do it on the train, that we are able to do these very complex kinds of work in a much faster way than we had been able to in the past.

I think you can - There is some correlation between that and the rise of think tanks like mine and private research organizations that are effecting policy debates, both at the state and local and national level, and I think that that is an important accomplishment that technology has given us, and we don't want to sort of move backward in any way. So I think that that is just a very critical, critical point.

To give you a couple of examples, when I have been asked to testify in front of Congress, both in the House and the Senate side, I had no more than two weeks' notice, and, in one case, I had just five days' notice.

These are very short time lines, where, if you want a specific number from your data, you need to just be able to go to your computer. You don't have time to go and to sign up and to wait.

But the second kind of concern we have about timeliness is that we often only have a few months lead time to know what the issues actually are. We spend a lot of time thinking about what kinds of policy issues are going to come up, what do we need to prepare for over the next year or two, but we may not know whether or not Congress is going to vote on minimum wage this session or next session until a few months out.

So having to go through application processes is simply just not viable for those of us engaged in policy debates. It is perfect, it is fabulous for academics, and we build on their work and use it, but we need things that are, of course, shorter.

And on this issue, I might add that this is something that our organization is on sort of the more progressive end of the political spectrum, that we have been working very closely with the Heritage Foundation on these issues about access and timeliness, because they are just as concerned as we are, and it is certainly a point that transcends, I think, political boundaries and sort of goes beyond left and right, which I think is a very important point, and especially this issue about having independent organizations be able to access government data to discuss pressing policy issues is one that we all, both on left and right, agree on.

I basically am making the same point over and over again. So I won't be that much longer here. We need access to timely data. So, hopefully, that has gotten through here.

So the final question in the set of questions that we were given ahead of time to focus on was what are the potential costs to the public from failure to take advantage of these opportunities, and I have a couple of comments on that.

First of all, one of the largest projects I am engaged in right now, we are looking at take-up or effective coverage of benefit programs in 10 states. We are doing this for advocacy purposes, for public policy, and we are doing it using the SIPP and the CPS and the National Survey of American Families.

Now, we know that each of these data sets has significant problems with how people report their benefits, that there is under-reporting of benefits. It would be absolutely fabulous to be able to have data that is matched, so that you can look at eligibility for public programs and then get the numerator to be an actual estimate of coverage. That would be a significant improvement, and, right now, we are working on this project, because, in the states, many of the state groups that we work with are very concerned about take-up of public programs, and because there is no real place that makes a lot of this accessible across a wide range of programs, because the eligibility rules, on the one hand, are so complicated, if you want to look at eligibility, you have to use survey data, and, quite frankly, the SIPP is the only survey that I have used that has enough questions to really get at the complexity of eligibility, which, of course, as Mr. Iams said, this is all at the state level, this game is all at the state level, but we really don't have a good numerator, because we don't have administrative matching. Being able to have that matched data does have - I mean, significant policy implications that we could be using right now. So that would be wonderful.

Of course, all these issues about privacy, I leave it to you all to sort that out, but we would love to have access to it.

But I do have a couple of concerns about sort of the move to the matching - particularly, and, of course, my perspective is thinking about this matching either SIPP or CPS or ACS, one of the surveys that I have used, with administrative data.

My first concern is, with my limited experience using administrative data, I think that there are concerns about accuracy and how high the - what the gold standard is.

It seems to me that we need three kinds of data. We need administrative data that tells us one thing, but there are biases and there are - obviously, there's problems and there's errors in that data as well as there is with survey data. The biases run in different directions.

We need survey data to tell us about the wide populations, the full scope, and I think we also need qualitative data to tell us some of the why questions, but that is a whole different group of people.

This question about whether or not the administrative data is always perfect and what we are going to gain from matching and how we talk about that, especially with the public and especially with the people that are trying to convince how important this is. I think it is an important note to just note that some of the caveats and some of the potential problems with that, in terms of accuracy.

Second, administrative data is clearly no substitute for survey data, and this cannot come at the expense of these surveys.

Again, looking in the issue that I look at, eligibility for public benefits, this is something - and we need to know how many folks are eligible for Medicaid and who aren't receiving it or who are receiving it. The only way we can do this is through surveys that ask a ton of questions of people at a subannual, a monthly level, because this is the way that people access these programs.

I mean, and just to go off on that just for a second, one of the things we learned from our work with the SIPP - and looking at take-up - is that people are - People move up - their incomes move up and down month to month, and when they access the system is not necessarily the month that they don't have any income, because it takes months - weeks or months for them to even make it to the office or get on line or get all their papers together to receive Food Stamps or another benefit program. You need access to the survey data that provides you with those dynamics. They are not substitutes.

And then my final point, which is going back to my theme here, if the cost of matching is that we lose in terms of timeliness or the ability for the public to access public-use files, then I think that is a serious concern and one that we should spend a lot of time focusing on.

And I think, having said that message about 12 times here, hopefully, it has come through, and I think I will stop there. So thank you very much for allowing me to speak to you.

Agenda Item: Deb Schrag, Memorial Sloan-Kettering Cancer Center

DR. SCHRAG: So I am Deb Schrag. I am a physician and health-services researcher at a big cancer center in New York City, Memorial Sloan-Kettering Cancer Center, and I also appreciate the opportunity to speak to this group.

Unlike Heather, I don't do anything in days to weeks. I am more representing the perspective of academics, and we do everything on a months to years time frame.

I am going to - I guess, I have to say that I had a completely different talk when I came here yesterday, and perhaps it is being towards the end of the workshop, I revised my slides and got rid of most of the data examples and slides that showed the results of various linkage projects I have been involved in, and put in what I'll call more philosophical slides, because I think some sort of more conceptual framework - Maybe it is just - at the end of a workshop, it feels like a conceptual framework tying together all these different enormously complex issues that we have heard about for the past two days is sort of in order and maybe there'll be some discussion in that regard at the end.

Again, I represent an end user, not my institution or any agency.

So types of research questions, examples of linkage attempts, challenges that we have encountered and, of course, a wish list to add to Heather's.

So I guess I am here representing academic health services researchers, and we examine, obviously, relationships between need, demand, supply, delivery and outcomes of healthcare.

The big topics for us, I would say, over the - since - in this decade - and I think that these are going to remain front and central on people's research agendas - are disparities in healthcare, access and barriers, technology dissemination. Quality measurement is a big one, and, ultimately, efficiency of healthcare delivery. So that includes all the cost issues.

We talk about data - and I think that this is an underlying theme of many of the presentations that we have heard in this workshop - is that these data are layered.

We start out with source populations, basically, United States citizens, who have IRS data. They work. They don't work. They do save for retirement. They don't. They exist in specific geographic regions of the country, and on top of the source populations are diseased populations. I happen to work in cancer. Other people work in mental illness or psychiatric disease or malnutrition, all kinds of examples of what one consider a diseased population or a population with a health concern of interest.

On top of that are providers. Typically, these are physicians, but other - nurses, other types of healthcare providers as well, and, on top of that, are healthcare delivery units, facilities, whether they are clinics, hospitals.

The issue with federal data is that federal data is better at the bottom of the pyramid. Federal data has a lot of information about source populations and some - For example, Dr. Breen is here from the National Cancer Institute. A lot of information about populations who get cancer. It tends to be a lot less rich as you go up to the top of that pyramid and have a lot less information about providers and facilities.

So when we try to link, very often, in my experience, what health services researchers are trying to link is the rich, rich government data at the bottom of the pyramid with more granular detailed data about providers and facilities at the top of the pyramid that often resides outside the public domain, if you will, and I think we have heard allusions - you have heard references to some of these data sources from the speakers yesterday. AHA data was mentioned, AMA data, and I'll give you some examples.

We talked about evaluating the quality of healthcare. Really, we are interested in health outcomes, and the main ones - main health outcome we get out of big federal databases are typically just very basic things like who lives, who dies and who gets particular diseases. So mortality and incidence, basically.

And the inputs that we want to go - that we want to relate to health outcomes are community attributes; person attributes; health risks and behaviors, which come from the big surveys; the structure of delivery systems and the processes of care, processes of care. I would probably put Medicare data in that bucket. Medicare data is what exactly are we doing to these people - all of us - that lead to these health outcomes.

As we think about linking data, I think it is helpful to think about what the frameworks are for putting data in different buckets, and, now, obviously, some data belong in multiple buckets. Medicare data also has mortality, which is an outcome, but I think that sort of as we conceptualize these linkage exercises, it is helpful to think about in what domain the data sets belong.

The other thing that I think is helpful to think about, and I always think about when I am contemplating any sort of linkage project is where it lies along the spectrum of pure population-based data.

So federal-agency data are best for big, broad population-based analyses. So a cancer example is I want to know something about everyone in New York State with lung cancer, and I can go to the registry and Census data, but, very often, increasingly, we are interested in quasi-population-based data, where we want everyone in New York State with lung cancer who is covered by a particular private insurance provider - Oxford Insurance Plan - and so I think it is really also important when we talk about linkages to be clear. Is this true, pure population-based data? Are we trying to link federal agency or state agency data with some external data source that resides elsewhere? We need some kind of nomenclature for where those boundaries are.

And, then, of course, there is non-population-based data. It may be that - You know, my research institution is always coming to me and saying, Well, you link these data and you work with these large population data sets. We want to know why cancer patients in New York State are not all coming to get their treatment from us, since we are the best center, and you have all these data. Why can't you do that?

And I say, guys, that is marketing. That is not health-services research. That is not an appropriate way to use these data, but explaining sort of analyses at the population, the quasi-population and the non-population data, we need some kind of taxonomy, and I think simply having that taxonomy or the government help develop standard taxonomy for these types of activities would be a really helpful place to start and would educate the end-user community.

Obviously, the health services research strategy - I mean, we just all want to get our hands on as much data as we possibly can as quickly as possible, and we want to juxtapose all these various data sources.

So the kinds of things that we work on, really, I would say the focus and theme of our research is to look at what we call the implementation gap, which is the difference between clinical efficacy and effectiveness.

So most healthcare - What works in healthcare is discovered and described in these very neatly, nice-packaged little clinical trials where we take 100 people and give them blue pills and 100 people and give them red pills, and we decide that the red pills are better.

What we get out of that is efficacy. These red pills work, but that really doesn't tell us anything about what happens when we unleash red pills on the population.

When we unleash red pills on the population, we are trying to measure effectiveness, that gap between efficacy and effectiveness - me and others call the Implementation Gap, and we are really trying to get at what the reasons are for those gaps and trying to identify important sources of variation and particularly those that we can do something about, and we want to know whether the reasons for the gap are endogenous to patients, doctors, healthcare systems, background population. So that is really the unifying theme of the research and why access to these data are so incredibly important.

We told you a little bit about this data source yesterday. Very simple example, a kind of chemotherapy given after an operation for a particular stage of colon cancer, and this is a big deal. Fifty-thousand Americans get this condition a year.

Do patients, in the Medicare population, who are insured and this kind of chemotherapy is covered. Do they actually receive these treatments?

Well, we went to SEER-Medicare, and, very quickly, within a week, were able to answer the question.

Now, of course, it took a while to get the data. Once we had the data, the analyses took a week. So, again, I spend 90 percent of my time trying to get data, manage permissions and so on, and actually analyzing it is a lot less time consuming.

But we identified a very simple finding, which is that there is a very steep gradient, and that, although we treat most young Medicare beneficiaries with this kind of chemotherapy, we really don't treat the older folks.

Well, this very simple finding, made possible by linked data, really sparked a whole set of subsequent more detailed analyses to go back to physicians and patients and to conduct interview studies, to really hone in on what the reasons are that underlie this important healthcare-delivery pattern. Okay?

Now, one of the problems here is that the older patients were never included in the randomized trials, those little efficacy studies. So doctors don't know what to do. So there is uncertainty, and this is what actually happens.

Okay. So there is a real circularity where we do these population-based data analyses with linked data, and that, basically, catalyzes subsequent studies to really get at the underlying reasons.

So we have this nice linked-data set, but we really wanted to know - We said, Look, these people are not getting a kind of chemotherapy that they really ought to. What is going on? Are they just not being referred to medical oncologists who have the therapy? Are they refusing the therapy, even after they go to medical oncologists and are old people just saying, Thanks, but no thanks?

Well, to do that - Sorry. This is Census data that just shows that we also - Census data can be helpful because we see that if you are married, you are much more likely to get the right kind of treatment, chemotherapy, than if you are widowed or single, and these are all adjusted for age, and these are very, very strong findings. So having Census data can be very, very important, and we can figure out who is at risk for getting inappropriate medical care.

So why doesn't everybody get chemotherapy? Do people refuse? Do they see a medical oncologist? We could use the UPINs on CMS claims, and CMS claims have access to some specialty-code information about the types of doctors patients receive, but that data is not particularly complete, updated or accurate. Maybe Gerry can comment on it. There are other better data sources for figuring out provider characteristics.

So when we use just the specialty datas, we could see that among the patients who got chemotherapy is represented by the green bar. Most people saw an oncologist. The top 20 percent there in the chemo bar, those people got chemotherapy, but, apparently, did not see an oncologist. All those people saw internists. Well, that is just because CMS doesn't know the difference between an oncologist and an internist, because the data is not coded well, but we wanted to get our hands on better data sources.

The people who didn't get chemotherapy, most of them made those decisions without seeing an oncologist.

So we are able to basically do these kinds of analysis to say, We, in the healthcare system have a healthcare delivery problem, patients are not making informed decisions not to get a treatment. They are making uninformed decisions, because they are not even going to see the relevant providers.

People look at this data and it wasn't all that compelling because they say, You can't even figure out who is the medical oncologist. So, then, we wanted to get AMA data.

It took essentially 18 months to get the AMA data, which is much more complete, to be able to do the linkage to really prove that to a higher level of satisfaction. Very complicated to do.

Ultimately, successful, and, then, the green bar went all the way up 99 percent, and the green bar in the no-chemo group went up to about 40 percent, and our conclusion was, essentially, that the mechanism was people were not appropriately being referred.

So a wish list from an end user would be linkage of UPINs on claims data to files that describe position characteristics.

AMA data is better than CMS data. Data from the American Board of Internal Medicine, the American College of Surgeons, all the specialty societies, is still better than AMA data, and the untapped resource is state-level data. That is the most complete and the most difficult to obtain.

So I have a license to practice medicine in the State of New York. They know lots about me. They know whether I have ever committed a felony, been in jail, all the tests I have taken. They maintain it. I pay them $250 every two years to update that license.

So we don't have that data to link to federal. So with respect to health, it is really critical to know physicians, physicians' characteristics, distribution of physicians.

Next on the wish list is pharmacy claims. The analysis I showed you, we want to know were people not getting intravenous chemotherapy because they are getting oral chemotherapy? Are people not getting supportive medications? Are they sticking to their therapies? Are they getting appropriate pain control?

So the wish list would be Part D data. We heard about that yesterday. Medicaid data for pharmacy claims, and private claims data sets. There are enormous pharmacy-data clearinghouses that are very - that have not been widely linked, but are very important for health-services research.

So, again, the example I just gave you involved taking federal data set and trying to link it to external data sets that are not federal. So I think developing some kind of taxonomy or framework for what these linkages are - Are you linking federal to federal? Are you linking federal data to state data? Are you linking federal data to private data with a broad public relevance?

And I put AMA data, AHA data. Those are private data maintained by non-profit - Well, yes, AMA is a for-profit organization, but they are big organizations that have - really control, monopolies on important data sets that pertain to health that have broad relevance for many researchers in and outside of government.

And, then, there are custom data, where you have your own personal data set about the patients in a particular region or with a very specific set of disease that you have that you then want to link to.

And I think developing some kind of taxonomy to understand what the activities are would help frame development of coherent policies and rules that researchers could understand.

So an example of a study I am working on - and Gerry Riley has been extremely helpful to us here - are to look at capacity to deliver mammography in the United States.

Women in the United States, age 40 to 80, need mammograms. A lot of them are unscreened. There are big racial disparities, and the lack of available facilities, mammography screening centers, and lack of radiologists are potential explanations for suboptimal use.

So the question here is does lack of capacity explain geographic variation and racial disparities? And does capacity predict breast-cancer incidence and mortality?

Well, these kinds of analyses require geo coding. They require knowledge of where the facility is, and you can get that from FDA accreditation data. Where are the radiologists? Again, that is physician data. Where are women unscreened? BRFSS, Medicare data are informative there. And where are there high rates of breast cancer? SEER.

To do these kinds of analyses, we want data ideally at the Census tract level, but if we can't get it, we'll go less granular to the Zip Code or county level.

Trying to do a project like that and figure out where to start to go to obtain the permissions is extremely complex. Approval from one agency or many. So some kind of central clearinghouse and more clearly delineated set of procedures would help.

To get these kinds of projects done, we are really dependent on personal relationships with key individuals who sit in specific agencies.

For example, Gerry, in this case, helped us get access to the FDA accreditation data for mammography facilities, but people who don't know Gerry can't do this, and that doesn't really seem fair. Although, we are very happy we know him.

Finally, I want to talk a little bit about area versus person-level data.

Access to granule-level data helps most health-services researchers, and privacy security concerns obviously involve less risk when you are talking about area as opposed to person-level data.

I think, again, in all these discussions, it is really clear it really would be helpful if we delineated between what we are talking about, and I think Howard was alluding to this before.

So, again, area-level data at the bottom, state, county, Zip Code, Census tract. On top of that, you often have anonymized patient data, which can be linked to unit area, and, on top of that, what is the most vulnerable is individual patient data, and I think one thing to think about, in terms of linking federal-agency data sets, is to release them - again, I am not pro-public-use data - but to make them available to the research community with appropriate bars and jumps and hoops you've got to jump through is to make them - make area-level data available without making individual patient-level data and have discussion among the agencies of what the hoops are to get state, county, Zip-Code, Census-tract level data with higher bars the more granular you go.

So, again, this would really help us with very repetitive common tasks that we end up performing again and again when we try and make maps and figure out where are the patients, where are the providers, where are the disparities and where are the mortality rates.

Wish list in that regard would be access to chloropleth maps by various geographic units, very useful for common-data elements and Census and survey-data results. It could be a shared resource for investigators, and ARC, GIS and other software packages that have really just become available in the last three or four years have catapulted the possibilities in the ease - how easy it is to do this light years ahead. I would say even in the last three years. Others might wish to disagree.

I think I am going to skip this and talk, finally, about Medicaid, and if I had to put something at the top of my wish list it would be for the federal agencies to help us figure out how to tap in more effectively to Medicaid.

I think the states just don't have the organizational capacity to get this going, but the federal government does, but the states really care. The largest component of their budget, talking about healthcare, I guess, at this workshop, Medicaid funds healthcare for the poorest, sickest members of society. It is really an untapped resource. I think CMS has done an enormous amount of work over the past decade trying to get the data into some common file structures and make it a little bit easier to work with, but it is still an untapped resource.

I think when we talk about Medicaid data, we want to distinguish between two things. One is enrollment. Who are poor people enrolled in Medicaid versus the administrative data which is what is done to those people, are they getting EKGs or chest X-rays or what units of healthcare are they actually consuming, and it may not be so complete for the latter, but very informative for the former, and I think we really have to not let the perfect be the enemy of the good when we talk about Medicaid data.

So this is just an example of a study where we tried to link cancer-registry data from the entire State of California to Medi-Cal, which is Medicaid claims for the State of California, because we wanted to know about delivery for cancer care to poor patients, and we were wondering what would the yield be from a linkage of California cancer registry data and Medi-Cal.

So what we did is we started, and you'll note on the left, with all incident cancer cases reported to the California Cancer Registry, which is very complete, and we took 98 cases, and if you look at the cervical-cancer example, there were 1,690 women diagnosed with cervical cancer in the State of California in that year. About 80 percent of them were 18 to 64 at diagnosis. We don't care about the older ones, because we have them in Medicare, right? So that is about 1,350 cases.

What proportion of them were enrolled in Medicaid? About 21 percent. If you look down at hepatoma, it is about 35 percent. So these are cancers that are associated with infectious diseases. These are important health problems, Hepatitis B and C and the HPV virus can be prevented. That links, therefore, back to health surveys.

You know, we think that 21 percent of the population of a state who has a particular cancer or 34 percent is a meaningful fraction and that figuring out how these people are cared for and whether they are getting antecedent appropriate care and appropriate care subsequent to diagnosis is important and that these kinds of linkages could be leveraged further.

A lot of problems were encountered. So for the example of Medicaid, we have big problems figuring out the duration of Medicaid enrollment. So we had enrollment files for two years, and we looked over two years - in this case, ‘97 and ‘98 - only about half of the cohorts were enrolled for the whole 24-month period, and a lot of people were in and out, and the way these denominator files are maintained is very confusing.

We had about 74 percent of patients who were enrolled during the month of diagnosis. Some were first enrolled after and some before.

When we looked inside the claims to see how often a diagnostic and procedure code inside the claim corroborated the diagnosis that we found in the California Cancer Registry, the answer was about 80 percent of the time. We had access to a year of data. So if we looked over the whole year, it was about 70 percent. If we restricted the analysis to people diagnosed in the first half of the year, it was 80 percent, because it can take a while for these codes to catch up and appear in the claims.

Is this perfect? No. Is this useful? Absolutely.

So SEER-Medicaid, we attempted a link in California. It took us two years more to obtain the data sets.

The denominator file structure limits our ability to identify cohorts of the chronically poor easily. Challenges are retroactive enrollment, chronic versus episodic poverty, spend-downs - that is, people for whom illness precipitates enrollment - variation in state thresholds in generosity when we try to do this with other states, and I won't go into it, but, believe me, are trying, and definition of an HMO or managed care. Just some of the challenges.

But we think that sort of a coordinated federal approach that is helping will help the states and some of the states would undoubtedly be receptive. This is only going to get done if federal agencies get involved.

So our wish list would be consistent definitions in the Medicaid enrollment files - What does managed care mean? When are claims itemized? When aren't they? - Linkages of Medicaid data files to state discharge abstracts - So most of the states maintain hospital discharge registries. You can't link that information - geocoding of where Medicaid beneficiaries reside. Linkage to pharmacy data, and linkage to Census tract and socioeconomic variables.

Priorities. Coordination of procedures for obtaining access to data and the review process - and I am not in favor of public use, but a series of hoops and whether you have to actually go sit in Ann Arbor or you just have to describe how secure your computers are, whatever. There need to be various stages and procedures.

Standardization of reporting rules. So SEER, which I work with a lot, has - SEER-Medicare - you can't put in any cell N less than are equal to five to protect patient privacy, but other agencies have different rules and have to be less than 10. Some standardization, and, I guess, harmonization would be helpful.

Central clearinghouse.

Anything the federal government can do to help us federate state data would be fantastic.

Making chloropleth maps available for common tasks that we do again and again in a common-base-type system, and working with the states to fulfill analyses of Medicaid enrollment and claims files. Those really would be an end-user's wish list.

Very ambitious. You can ask, right? So that's - I'll stop there.

MR. LOCALIO: Thank you both for comments. I certainly can relate to both of you in your tasks.

Heather, I just want to say to you that your particular point about having access to data and why you need access to data is something we do understand. It is something that I have raised previously at our committee meetings and our subcommittee meetings, although, in terms that are somewhat more colorful.

And I do want to stress a point that you made, that we have been talking about technical issues of access. We have been talking about protecting privacy, but the point that you brought up is there is the other issue, that it is important for organizations, other than government, to have access to data, because there are opinions other than the government's opinion about data. Let the data speak for themselves.

Unfortunately, people who work in government are not always free to say what they want and report what they want, and the example that I have raised at committee meetings before has to do with HHS, has to do with Medicare Part D, has to do with an analyst named Foster, who wanted to release information about the cost of Medicare Part D, and he was told by Mr. Scully(?) that he was going to be fired if he did. So he did not release those data. That is well known. I got that information from the press.

Now, I, on the other hand, if I had that information, I would have just said, this is it, and nobody would have done anything to my job. In fact, I am encouraged to report things like that if they are of interest.

Now, you may have a particular point of view in your organization, but there are others that you mentioned on the opposite end of the political spectrum that have their points of view, but I think we have to stress that it is important in this entire discussion to know that the data need to be told. They need to be told, and even though we have to bear in mind it is not just researchers getting access to data. It is people getting access to data, so that they can let the data speak, and I just want to emphasize that do not feel that your point has not already been raised or has not been heard in this committee.

Thank you.

DR. STEINWACHS: Probably my education, Deb, chloropleth?

DR. SCHRAG: Oh -

DR. STEINWACHS: I thought this might have some surgical, medical procedure, and then I decided, no, it didn't quite sound like it, because -

DR. SCHRAG: No. It is the technical term for - You have all seen them. You see them in the newspaper all the time. They are, essentially - I wish I could explain the derivation of the word. Unfortunately, I can't.

Essentially, they are those maps that you look at that have, typically - they can get very fancy, but the typically have population density. So, for example, you might take all the zip codes in the United States and rank them by anything. It could be number of graduates from high school. It could be number of foreign-born persons in the Zip, the Census-level data. Right? So you essentially - and you can get fancy, so you can look at the relationship between the number of foreign-born persons and the incidence of stomach cancer. Can look at the relationship between not speaking English in the home and mortality from a particular cancer, engagement in a particular - Right?

And, sometimes, the maps are shaded, you know. So, typically, they are shaded red to blue, and then they have dots on them. Those are chloropleth maps, and standard data files are really helpful to make them.

DR. STEINWACHS: Helping my education.

Let me take you to another thing. You raised the idea of having data clearinghouses, and there is a kind of function in the private sector that many times corporations use where their claims data, pharmacy data and so on go into a place that standardizes it, makes it into something that is analyzable. CMS, in the past, did some things with Medicaid data, in the old days -

DR. SCHRAG: Resdac. Resdac. We work with Resdac, which is a research data clearinghouse center. Maybe Gerry could probably talk about it, but they don't have everything that we want. Okay? We love Resdac. They are on our speed dial, but - We know them by name. They know us, but there is a lot that they don't have. They are chronically under-funded, et cetera, et cetera.

DR. STEINWACHS: So I guess there were - and you are getting at it - sort of two questions in here.

One is what would you see a clearinghouse doing? And the reason I was - Sometimes, it is just you have the data there. The other is you actually make the data more useable, and so one of the issues across the states on Medicaid data - and you were pointing this out and so on - is that -

DR. SCHRAG: What would you do?

DR. STEINWACHS: - make it more useable for researchers, there may be a variety of things you would do that is not necessary for the state.

So maybe getting both of you to sort of comment, what would a clearinghouse do, and are there some examples that you think are ones that could be looked at as exemplary?

DR. SCHRAG: So I'll give you a specific example of the kinds of things Resdac, which already does a good job, could do even better if they had a broader mandate - I think they just need a broader mandate - is they could say, If you want access to the Medicaid enrollment files, these are the hoops you want to - you need to jump through. If you - an anonymized version of that. If you want the actual unencrypted data file, you need to jump through a few more hoops. If you want the enrollment files plus the claims files to figure out who had obesity surgery, you need a few more hoops. Really laying that out in a very clear way.

So one hoop just to get the enrollment file. That way, I can figure out how many poor people there are in a particular area. Maybe that is all I need.

A few more hoops, a little bit more difficult, and I do think you need to set up barriers in researchers' ways, so that they are really clear about what they are going to do with the data and what they need it for.

If you make it too easy for me to get it, I'll just say, Give me everything, when I often don't need everything. Sometimes, I just need little bits of it. So they can really help researchers figure out what it is that they need at what level and setting up a progressive gradient of barriers.

And Resdac does some of it, but I think that they could do more. Gerry probably -

MR. RILEY: (Off mike) - that Dave Gibson talked about yesterday, and they are trying to get the data assembled into a format that is easier for researchers to use, to pre-identify people with certain kinds of conditions - searching through the claims forever to sort of identify people with common conditions and things like that. So that might be one example of where this data base will go beyond what Resdac normally does for people.

As far as the Medicaid data goes, there has been a great deal of effort put into just trying to get very basic - you know. So I haven't worked with Medicaid myself, but there's been a lot of work in our office, particularly by Dave Bah(?) and other people to try and just get consistent measures of enrollment and claims data and so forth, and they made a great deal of progress, but I think the data are just starting to be used now on a much more wide-scale basis than they have in the past. Can't speak too much about that.

MS. BOUSHEY: I would just add one quick comment. I mean, I think, in terms of many of the surveys that we use, using the administrative data capacity to correct and amend many of the program participation elements would be wonderful, and I could imagine you could do that without - one could imagine just doing that to the public-use files, rather than - and making those available in the same way that you do now, rather than - and sort of eliminating that step, so that researchers didn't have to, which then means we never have to see the Social Security numbers or whatever.

MR. PETSKA: Can I just comment on that?

I thought some of that is done already using the tax data to edit the CPS and the SIPP, which would go into public-use files and so on. I believe that has already been done.

DR. STEUERLE: Just a quick comment and a question. Now, my 2-1/2 minutes is down to 1-1/2.

Heather, I just want to say I - like Russell, I fully identify with your statements about timeliness, particularly in the policy process. I mean, so much research is oriented towards developing the status of things two years ago or three years ago when Congress is constantly in the midst of making changes that have enormous impacts, minor - not a minor example - an example being the recent drug benefit that spent over - depending on how you do your present value calculations, over $1 trillion, and often with almost no data input relative to even some of the other things research.

But I guess there is one gap in our data that has always bothered me for a long time. Deb, you are the only - really the only speaker who has referred to it, and so I am going to put this question to you and you may not have an answer. You can tell us afterwards, that is to do with trying to integrate in the provider data.

Because my background is largely in areas like budget, I know the way or at least I study a lot the way that economic systems work, and I know that part of what goes on has to do with the cost of these systems, and part of what you are talking about in the way some parts of the country provide benefits, some don't, largely relate to cost, and, sometimes, even the incomes within those particular geographic communities.

But an important side of this is, if we don't get at some of who is getting these benefits, these costs, we are not very far in our - While we do develop modest data on expenditures and what we are buying, we develop almost no data on who is getting the money.

So, for instance, HHS doesn't even do what I would call simple - not so simple - quite a few people who do it - an input-output analysis, so when costs rise by 10 percent, we know who is getting it. Is it 20 percent more to doctors or do we have 10 percent more doctors and 20 percent more practical nurses, and how much of it is going for administrative costs? How much is going to the pharmaceutical -

And we don't even have that, and you are the only one who really mentioned the provider data, and I am just curious whether you have any suggestions for ways of getting at some of the cost side of this equation by looking at the provider data.

DR. SCHRAG: I think you can get at the cost side, because you have the UPINs and you know which UPINs are doing what to whom. So I think that you can get that.

If you want to look at things like physician income, that you need separate survey data, but, procedurally, in terms of access to data, the biggest thing that I think needs to be fixed and doesn't seem to me to be that complicated to fix is that detailed information about providers is not within the government - the government doesn't have that organized at all.

The AMA makes, I think, $50 million, some humongous number, and most of it is sold to private companies, but they sell the AMA database, and people use that AMA database, which profiles all physicians in the United States, and they sell it again and again and again.

The government needs to do it itself. The states know who is a physician and what their characteristics are, but the federal government doesn't - I don't know. It doesn't suck up that information from the states. Maybe there are complicated reasons why not, but that would really help a lot, and that would be a good place to start, and I think it would help the states, because they have interests in these kinds of data for fraud and other things that they really care about, and cost-related issues.

DR. SCANLON: We don't know enough about providers. We don't know enough about cost, but we do know an awful lot about both. I mean, and maybe the analysis doesn't get presented sort of widely enough, but it is known. I mean - what Russell mentioned, in terms of Rick Foster, the actuary's office really does a lot of work on the issue of cost, and we do know sort of where Medicare costs are going. We do have the Medicare Payment Advisory Commission, which is looking sort of at the cost reports that the providers file, and, in fact, this, in some respects is much better data than what the private sector will have.

You do not want to know what VHA response rate to particular items on their surveys are, okay? As opposed to, if you are a Medicare-participating hospital you have to turn in your clinical report.

This is the kind of thing, I think, where we need to move incredibly sort of forward in terms of improving the comprehensiveness and the quality of the data we have, but it is not that we have been sort of completely sort of static here or sort of - or have ignored the problems. We really - we know a lot.

There is a large group sort of, over at GAO. There is a large group sort of in the Office of the Actuary. There is a group sort of at MEDPAC. That are all doing these kinds of analyses of exactly what I think you are talking about.

And the idea of using propriety data, these organizations don't necessarily want to share sort of the data at the level at which we would like to have to be able to access their qualities and be able to be confident about them.

One of the things that I worry about here - and this is a democracy, and, as Russell said, we would like to sort of have the information out there.

Information can be used for good and bad purposes - okay? - and I think one of the things that people within federal agencies probably would think about in terms of release of data -- is this going to potentially cause harm, because it is not very good data? Would they release data where there was only a 30-percent response rate on their survey? That would be astounding.

When I was at GAO, our response rates, the requirements were that we would be in the 60, 70, 80 percent range before we would use a number. That is not true of data that are coming out of private surveys. Private surveys, they are happy to be able to say, We did a survey. Here are the results. Okay? And, then, they can turn around and sell those because there is a market for them.

DR. SCHRAG: But the point is is that the states - I completely agree with you that the private data - The AMA data are terrible. Absolutely. Nobody should use it. The states have good data. It is just not accessible.

DR. SCANLON: You are right. It is not accessible, and it is potentially not uniform. I mean, that is the other key thing about starting down the path of saying we are going to go to the states. We've got the 50 states plus the District of Columbia. If we are talking about Medicaid, we've also got five territories that run programs. Try and assemble a consistent database from those. It is incredibly challenging.

MS. TUREK: That is easy compared to TANIF, or, in some states, the data is collected at the county level.

MS. BOUSHEY: Yes, or childcare subsidies.

MS. TUREK: What?

MS. BOUSHEY: Childcare issues, county level.

MS. TUREK: Thank you all very much. I imagine after our last session we can continue talking about users' needs forever.

Anyway, for now, everybody go and have a good lunch.

(Whereupon, the workshop recessed for lunch.)


A F T E R N O O N S E S S I O N (1:18 P.M.)

Agenda Item: A Broader Perspective on the Role of Linkages

Fritz Scheuren, Vice President for Statistics, NORC and 2005 President of ASA

DR. SCHEUREN: I am going to talk about the past because I am an advocate of Deming, and Deming says you should only talk about something you know about. Well, I don't know about the future, and I probably will know it when I see it, but then it'll be the past. So I will talk a little bit about the past.

This is a very interesting subject, record linkage. Deeply connected to health, quite expanded as these days were on other topics. Very deeply important subject, and so let's see how people thought about it 40, 50, 60 years ago or more. I call it the Once and Future King. So you may have seen that line before. I don't know who - Somebody may have used that.

The Book of Life. This is a concept that we are putting together a book of people's lives. That is what we do with linkage. It could be contemporaneously this way, but it also could be this way, and all of those variations in dimensions have occurred in these meetings. Great idea.

Started out - Dunne(?), I think is the one who used this phrase. Started out right around the ‘30s and ‘40s that this phrase began to appear, and just as we had the ability to do the kind of large-scale record linkage that we all now do.

I am going to put this one up. This is the Chalk River Nuclear Power Plant in Canada. We were talking about Gil Beebe(?) a little while ago, with Nancy. Gil was an advocate of record linkage, because he's an epidemiologist. This is really where things started, and Howard Newcombe, who did an awful lot of the fundamental work on linkage worked at Chalk River and it was all epidemiological.

Social Security did a massive amount of linkages, epidemiological linkages, to look for various carcinogens in various industrial processes. A massive amount of work done. Great stuff. Not talked about anymore. Great stuff. Still can be done, I believe.

Most of our problems of that kind have been seen and acted on. We don't have the kind of problems they have in China when they do a cancer map in China. They find that most of the cancer problems in China have to do with differences in the way food is prepared in China. So we don't have that problem. We all eat McDonald's, which is to say we all have a higher level, but -

The theory showed up with Ivan Fellegi. Howard and Ivan are Canadians. It is not an accident that I am going to talk about Canada. An awful lot of the best work on record linkage has been done at Statistics Canada. Fantastic tradition. Ivan is still Chief Statistician at Statistics Canada. Great man.

And we ran a conference in the names of these two people here in Washington in 1997 on linkage, an international conference.

We had done an earlier conference, which I think is the - one of the things that Jean associates with me. I was organizing both these conferences, was in ‘85 - which is more a local conference on record linkage.

Both of those are at the FCMS website, but I think they are huge PDS. You have to really be serious if you want them, but there are people here - Listen, at this time of the day, on the second day, I figured there's six people here. I was wrong. Now, don't all leave. Wait for him.

This wonderful piece of work by Marks - that is the same Marks. Carol Krotki and Bill Seltzer. Bill Seltzer is still with us. He was the Chief Statistician at the UN for a long time. Now, he does a lot of important work in other areas, including human rights.

And here is a tremendous book, Bishop, Feinberg and Holland. These are all contingency-table books. They are really valuable, and they are worth knowing about in order to understand the error patterns in the data search you are using. If you do not understand the error patterns - non-sampling error patterns - you really haven't grasped the whole thing.

When you are doing linkage, you actually really can improve considerably the quality of your data in many, many ways, and that has been talked about quite well here, but you there is no free lunch, not that there are many economists in the room. Gene took his badge and put it away today. He is not an economist today, right, Gene?

One of the things you can do with these systems is you have three or more, you can do multiple systems estimation. If you have two, you are doing dual systems estimation. This is the traditional pattern that has been used around the world in evaluating censuses. It is very modeled appended(?), and the bureau has moved carefully away from that towards multiple systems, okay? Very carefully. I wish they went a little bit faster, but, anyway, very important work.

What about content - A lot of work has been talked about here about that. I am going to give some names to you. This is Mitsuo Ono, who used to be the head of the Income Branch at the Census Bureau. He did some very important early work on matching income-tax returns to the CPS, okay. I think it was the 1970 CPS. Yes. I think it was - Maybe it was ‘72. I thought it was the ‘70 census.

Dorothy Rice. If you don't know Dorothy Rice, you better leave right now. Just leave right now. Major hero of mine. She used to be at Social Security. That is when I used to work for her, and then she was the head of the National Center of Health Statistics. She is a tremendous individual. A great force.

I interviewed her a year ago. The article appears in the September issue of AMSTAT(?) a year ago. Not this - well, it is two years ago now. This is September. Two years ago. Worth reading. I interviewed her, but listen to what she said. Don't pay attention to what I said.

Joe Steinberg is the person who started the linkages at Social Security that we have been talking about the last few days, including getting the Social Security number question on the October 1962 Current Population Survey. That is when it was first put on the survey. Of course, it is not on there anymore. It was taken off. He did a great deal of work, very good work. I came on after him and tried to finish what he did. He went on to be Assistant Commissioner at the BLS.

And Ben Bridges was my boss at Social Security. I needed to put his name on here because he used to invite me to his house a lot. So good guy. Really good guy. Well, he's a really good guy, and he had the patience to read everything I wrote and fix it. Since then, it has been bad.

Then we ended up producing a whole series of products called the Interagency Data Linkage Series. Very dated now, but full of mathematical and statistical ideas that are still valuable and have not been recaptured anywhere else. I am sort of proud of that, pleased with it.

There are some other things that happened as a result of that work.

I am going a little too slowly. I can tell that from the three people in the back who have woken up now and said, Where is the next speaker.

One of the key champions of augmenting survey data with administrative records is Gene Steuerle who is here. Okay?

Another one is Howard - Howard Iams. I hope you listened to what he said yesterday. Dead on. Dead on. Dead on.

And Julia Lane. I happened to come in - I know Julia from other worlds, when she tried to recapture the essence of the Continuous Work History Sample state by state. Amazing individual. She gave a great talk yesterday.

Two more things. Gene Rogot, whom you don't know. He was at the National -

PARTICIPANT: (Off mike).

DR. SCHEUREN: Pardon me?

PARTICIPANT: (Off mike).

DR. SCHEUREN: Yes, he convinced people, and CHS and the Census Bureau to match the CPS's to the National Death Index. I asked yesterday if that was continuing. It is continuing. Okay? That is a really interesting process, and that has been published. You published it, didn't you? It was published. A Million Deaths. It was a publication a few years ago.

It is a marvelous piece of work, because you can look at social-economic differentials in mortality with that, which is my interest in mortality, by the way, social-economic differentials. I am not going to go down that road today, because that is just too much fun.

And Joe Peckman, a lot of people know Joe here, at least Gene does. The idea of - matching data when you can't do it exactly is a valuable idea. It is a heuristic that I urge you to use with great caution. I have written considerably - I know. I have written considerably on its weaknesses. Okay? It is not always weak, but if you are desperate, do it. If you are not desperate, wait for the real thing.

Let's talk about optimizing of systems. That is - What we have been doing is taking the existing systems and making them better, but what if we were optimizing systems, what would we do?

Well, this is not a real long list. What we want to do is we want to prevent the survey errors from occurring - all right? - to begin with, if we can. Very hard in a day when non-response - item and unit non-response are so high. We want to build in a detection system so we know there is an error and we need to fix them - repair them, and, of course, one of the best ways to do those - the last step is to replace them with data from a better source. So the linkage is an enormously important quality-improvement step.

Now, it was said by somebody this morning the quality of the thing you link to may not be perfect, okay? It certainly wouldn't be, anything I have ever done.

One of the things that's going on is that there is a tradeoff between response variance and response bias. Administrative records are typically biased. They are not measuring the right thing, and if you think it is something that it isn't, you've got a bias, even if it was perfect - okay? - but they do get rid of a lot of the response variance, which is very characteristic of surveys.

If you have ever done any linkage yourselves and you have compared essentially equivalent - never exactly - equivalent concept from an administrative source or an operating source in a more general world, which is the one I am in now, with a survey, you see this enormous variation in the survey results. Rounding errors. All kinds of things going on in the data. Maybe the right signal is in there, but an awful lot of noise.

Playing with Matches is a book. This is a plug. Three people in the back, the one who has fallen asleep, you know, you should wake up now, because you have to buy this book when it comes out. It'll come out next year. Okay?

It is about data quality. It talks about all the traditional ways we have looked at data quality. Everyone in this room who has done - handled data has used these techniques in various ways, and it talks about linkage, all of the aspects of linkage, most of which have not been talked about today, but, fundamentally, linkage, in my opinion, is really to replace one data source with a better one. Okay?

If you want to study error patterns, that is good. That was done a lot. I don't think that got us very far, frankly. What got us far was replacing bad data with good data or better data. There is no good data.

Okay. A couple of more slides. I guess I got four. Three. I'll make it three.

Privacy and confidentiality. One of the big problems if you are in an administrative agency or a statistical agency is you really don't understand the language the same way. There is a conflict of principles, really, between the two.

If you are at the Census Bureau or NCHS, you are a part of a culture of confidentiality. If you are at the IRS or Social Security - Social Security is a bit on both sides - you have the culture of privacy. You focus on the privacy of the person. You are - that person's data is sacred to you.

Those two values do have an intersection, but it is sometimes very hard to find, depending on the setting you are in.

I think more work like this meeting, more joint work, designed work for joint goals can help deal with some of this, but it has existed for all the decades that I know about and have heard about in all of these different processes, and I don't think it is going to go away any time soon.

I want to make a comment about - I am using an industry coding example. The statistical agencies say to the administrative agencies, We can't give you back the data after we clean it, okay? Well, because we would violate the trust we made, and so I think, though the statistical agencies need to look at the point of intervention where they get the data, and look at whether they could get the data at a different point and thereby aid the administrative agency - and the industry coding example, which I am not going to cover, is a perfect example of a great deal of waste with Census coding things and the IRS coding things and the BLS coding things and the states coding things - okay? - that are none of them done well. Okay? All of them might be done better, if we were to fix the system.

Legal and bureaucratic. Lot of discussion about law and practice links. You are all discouraged by it. Get a lawyer in the room. I mean, get some lawyers here. Really - need to do this again, and experiment, and I believe in the need to do continuous measurement of what is going on.

That is one of the reasons I really like this group. I didn't realize that you don't meet as much as you might, and, of course, I hope - I mean, there's great ideas here these last two days. I hope you are stealing each other's practice, best practice. Don't steal each other's worst practice. Steal each other's best practice. I don't think I have to tell you that, but, sometimes, you are not necessarily sure you know where it is.

One of the things that NCHS does, which I absolutely think is fantastic, is they have an IRB. All right? Every agency here who does linkage should have an IRB. If they don't, there is a real issue there. Okay? I really, really think that is what we should do, and I am not going to name names, because I know some of the agencies here who don't, but they should. It is very important that you be held accountable by your peers, okay. Should be held accountable by other stakeholders, too, but by your peers, because your peers can help you fix some problems.

I have been subjected to IRBs in lots of settings, in private settings. Somebody was talking about using the National Survey of America's Families this morning, which is a survey I worked on for a couple of years. Doesn't do any linkage, of course. One of the problems with it.

Let's talk about learning linkages. Let's think about - it as a learning system. We have been doing it forever, but we haven't thought about it as a learning system. We could have. Just didn't.

We need to continue these conferences. Fundamentally, keep talking. Keep listening. Collect and publish a summary. I don't mean 100 pages. Forget that. Two pages. Five key points. Okay? Four contacts with people. Okay? Really.

If you got into somebody's remarks the last two days, talk to them on the phone. Get going, okay? Only do the things you are interested in. Don't do anything else. Doesn't matter. There's enough interest in this room so a lot of good things will happen.

I have said that.

I want to see diagnostics developed for linkage. I am a big fan of diagnostics. I have learned a lot about diagnostics from regression. Many of you who are economists do regression, log, linear, logistic regression, if you are in epidemiology world or standard regression if you are in some of the other worlds in this room.

Build diagnostics. We need to do this. This is very important. Not hard. Fun, actually, and then you can get rid of some of the other errors in here and get some of the misleading things out.

And here is one that you won't agree with. I put match in first, because I wanted you to find people at equivalent levels and swap staff. Two, three months. Right? As long as you can stand it in another agency. It is fundamental, really fundamental. We are not learning fast enough. This is a shame. All these smart people and we are not - We are all living in our various stovepipes - okay? - smoking something. No. No. No. Wrong generation. You never did that, did you?

Okay. What's happened here? Somebody has taken over here my computer. They say, End of show, here. That is what they told me, End of show. I am almost done.

Thanks for the memories. I have gone back to something I used to do, and some of it I still do, and best of fun on our road. I am saying - including myself - our road ahead.

Thank you.

DR. STEINWACHS: Thank you very much.

DR. SCHEUREN: You are welcome. Sorry for too much past.

DR. STEINWACHS: So, Mike, did Fritz set the stage for you?

Agenda Item: Michael Davern, Assistant Professor, University of Minnesota

DR. DAVERN: He certainly did. Yes, it is very hard to follow Fritz, of course. Now, everybody can really go back to sleep back there. Right.

He had most of what I had to say here. I'll give some examples, I suppose, or some of what I had to say. I don't really know the names or the history quite as well. I almost left the room, unfortunately, when he said, If you don't know this person, why don't you leave the room?

DR. SCHEUREN: Dorothy Rice?

DR. DAVERN: Yes, I know -

DR. SCHEUREN: You know Dorothy.

DR. DAVERN: So, first of all, I would like to thank Joan for inviting me here to be a part of this. I think it is really important, and I am looking forward to giving you my thoughts after a couple of days here or at the end of it, and I get to be the last speaker, so everybody is eagerly anticipating the last slide. So I have it duly marked, so you'll know when it is coming.

Basically, administrative data and survey data are really kind of collected for different purposes, right? We all know this. Survey data are collected for research, for the most part, and research file, administrative data are to administer our program, and we want to put these two things that are sort of at odds together to do really good health research, health-outcomes research.

So I am going to do some ramblings and musings. I don't really know much about administrative data, other than what I have learned from people who are in this room. I know Dave Ball was here yesterday. He taught me an awful lot about administrative data from - on the Medicaid side, and so if I get something wrong, feel free to correct me, but I am just going to give you my impressions are of what is going on.

So I am going to stick with what I know, which is survey data, and I know survey data fairly well and been working with it for a long time, and, then, I am going to see how administrative data is sort of like survey data, in some ways, and see if that can be a useful exercise, and talk about much of what - People have already brought up a lot of the issues I am going to talk about.

But here is what I have: I have several concerns with survey data for health research, and then I am wondering how administrative data compare on these issues, and then I have issues in merging the two sets of data or matching the two sets of data, and then work left to do to fulfill the great potential, I think, of these merged and matched data sets.

I won't spend much time on the data stewardship, privacy, confidentiality stuff. It has been covered well, I think quite well, elsewhere, and it is extremely important.

So I am going to start at the end just in case I don't make it to the end. This is what I want people to understand from at least my point of view.

There is great potential to health research to be done with these linked survey and administrative data files.

Survey micro data are in the public domain, and that is really important. Heather talked about it quite a bit here today, and the importance of having that out in the public domain for policy research.

There is also the importance of having it in the public domain for the strengths, and especially the limitations of this data are extremely well known. Sometimes that is thought is a weakness of the survey data, that we know a lot about it and we know a lot about its limitations.

We don't, unfortunately, have that same kind of information about the administrative data, because, obviously, it is on - it is not in the public domain, and researchers can't do research on its quality.

So because these data are not in the public domain, it is really imperative that the limitations be thoroughly investigated by the people who are entrusted with these data, more so - You know, a lot of it is going on, but if we are going to put these data sets together, we really need to have this information.

And there needs to be documentation and research on these linked files, in other words, metadata - you know, data - information on the actual data elements themselves, how they were collected, how they got in the data file, where they came from, and a lot of information on the process of how that information was produced. That needs to be put out into the public domain, so if I am reviewing an article for a journal and someone is using this linked-data file, I know how that variable was produced.

If I don't know how that variable was produced in the administrative data file, it makes it hard for me to review an article and know if that correlation or regression coefficient they found is actual or just something that was created as a part of the administrative process.

Certainly, NCHS, Census, NCI, Social Security, AHRQ, everybody here has these agreements in place and ideological people to start producing this work, both the documentation and the research on the data itself. I think that is really key.

We need to have that research done. We need to get all this information out into the public domain, and these are the people who can do the work at the moment, because it has taken a couple of years for our - the project I have with Census to really hit the ground running, just because of all the agreements that have to be in place, and it is impossible for researchers on the outside to do this kind of work.

Survey data have extremely well-known limitations. Okay? Just to give you some of the highlights - or low lights, as you may see - survey data concerns that we have are sample frame coverage error. We have talked about that quite a bit, and Fritz brought it back up again. We have sampling error and variance estimation. You have non-response error, both item, non-response and unit non-response. It is becoming worse, certainly, with both of those.

We have measurement error, things like collecting data from mixed modes. There's a lot of people who study whether or not if a piece of information was collected through a self-administered questionnaire. It is different than if it was collected through an interviewer.

They have data processing amputation editing, and there is always need for better documentation of metadata on the survey data side of things. There is no doubt about that.

And all these things, I think, are extremely well known about surveys. Survey data are dirty, messy and not for the timid, and I highly recommend that - you know.

So when I am talking about administrative data, as a survey researcher, I know that survey data are messy. I have made a living off of pointing out to people that the survey data are messy. It is something I publish quite a bit on. So, in general, I think that knowing the survey data are messy is a good thing.

And so how are the administrative data unlike or like survey data with respect to these main issues that we are concerned with survey data?

And there is, of course, a great variety of administrative data, and I am kind of throwing it all in a bucket here at the moment.

Sample frame and frame coverage, not really a problem, obviously. Survey data, you know, it covers the entire enrolled population.

Certainly need survey data, as pointed out, to know about the unenrolled and potentially enrollable populations and take-up rates and all that kind of stuff. So the survey data provides you with that, but there isn't really much of a problem from an administrative data point of view as far as a frame of the population being covered.

It is important to note, though, that I did work - I was working with the Veterans Administration in Minneapolis on a study that they were doing of Post-Traumatic Stress Disorder, and I was doing a non-response analysis on their survey, where they had sent out a questionnaire to people about PTSD, and what we found was that we had a response bias when we looked at the administrative data, that it seemed that people who didn't have PTSD were much less likely to respond. So we thought this was something quite interesting, and then it turned out, when we dove into it, that it was the quality of the contact information that was really producing this.

People who have PTSD received cash payments for having PTSD, and, as a result, the quality of their contact information was very, very good. We had good addresses. We had good phone numbers. We didn't on the others, and that was what explained the difference. There was very little difference after we controlled for that factor.

So when you are using these things as administrative data survey sampling frames in a way to mix - as another way to mix the two, we need to be careful of those kinds of things like contact information.

Sampling error, of course, is not much of a problem. Could be if you are drawing samples from the administrative records to use for research, not a big deal.

Non-response error is a bigger deal or missing data. Certainly, I know the Medicaid data the best that I have been working with. Item non-response on those is a major issue, largely because, I think, item non-response here isn't the same as it is in surveys. The mechanism that produces it in surveys is someone doesn't give you a piece of information or they don't know. So they refuse or they don't know, you know that.

I think it is maybe more systematically missing in administrative data if it is missing for some reason, and that is an important thing to keep in mind when you are using these data for research purposes. It is probably more likely to be not missing at random versus missing at random in kind of the statistical ease of things.

Age, program codes, race and ethnicity, we have been back and forth with - The CMS people have been wonderfully open about problems with their data with us, and, as a result of that collaboration, we have learned a lot about the data, but it is important to know that this is - it seems to largely be systematically missing. You know, some states are missing race and ethnicity information, and others are - have very good race and ethnicity information, or, at least, it is filled - the variable values are filled in.

So some of this data is missing systematically, TNF flagged by county, and we found out - we wanted to use a TNF flag on the MSIS, the Medicated Statistical Information System, and found out that it wasn't really all that good, because it was systematically missing. Race, ethnicity by state were systematically missing in the MSIS, and some states had much more missing data than others.

Identifying data can also - ID data can also be missing systematically, which is really important for doing linking, obviously, and we need to really do a good job and figure out where these data are missing in systematic ways, and it can be a large source of sample loss for the merged data if ID data are missing systematically.

So administrative data have important information for health research that is missing. I mean, that is the bottom line of the missing error, and I think it tends to be missing more systematically, perhaps, in this survey, and so maybe some of the techniques, like - imputation or things are not as possible with the administrative data.

There are certainly measurement issues with administrative data. Certainly, administrative data are, as we heard from Social Security yesterday, are the standard for knowing whether someone is enrolled in a program and how much someone received in benefits. There is no doubt about that. That is right on.

However, there's other administrative data that is desired for research that is on these files that may not be as linked to the program as - and may not be as well measured, and there's probably a lot of error associated with these things, as Fritz brought up and other people have talked about.

Administrative data can be collected through many modes during more than one wave of interviewing with several instruments used, and it is all kind of mushed back together, okay?

You have interview - and the survey researchers will tell you all this stuff matters - okay - that they do a lot of research into interviewer effects and to self-administered questionnaires versus non-administered questionnaires, and so you have all this stuff going on.

You have people completely filling out, where the interviewer actually fills it out completely for the enrollee and then submits it and just has them sign it, and, certainly, I have done that for tax information for people who don't speak English. I helped out quite a bit in doing that over the years, and so I fill out the form completely for them and just have them sign it and send it in, after they have given me whatever W2s that they have.

So you have all these kinds of things that are going on, and so it is important to try to track that, as best we can, to try to figure out what the source of the information is and create this metadata and do the analysis on this kinds of stuff.

And interviewers have a wide variety of training and skills. For example, you can have a tax - If you have an accountant fill out your taxes or you do it yourself or other kinds of things, there may be quality-of-data issues involved with that, and so it is important to be able to - You know, when you are beginning to link this stuff, it is important to know about that when you are trying to use these data for research purposes.

Medicaid enrollment data can be drawn from a wide variety of sources, including county level, state level. You know, it is coming from all over the place, and you have no idea how that variable got to where it is when it is on the MSIS. It has gone through a lot of hands before it is into that MSIS, and it is really important for people to understand that that is an incredibly different situation than when the Census Bureau goes out and collects a survey with an English version of the instrument and a Spanish version of the instrument and so those are things that could be going on that are being drawn from all over.

Administrative data forms. So the forms you actually fill out, are generally not as user-friendly. I am always frustrated by them. That is one thing that I think administrative people could learn - data people could learn from survey people is how to get a form that is actually user-friendly and has layout in an easy way for people to see the race and ethnicity information, and they can circle more than one or they can fill in the box for more than one, so that it'll be comparable on the survey data and things of that nature.

So research is really needed into the mode effects and longitudinal panel conditioning. As data is collected over time and things change, instrumentation effects and all that kind of stuff can certainly creep into the data and it is something that I think we really need to go in and take a look at. Survey research has a long history of this kind of work.

I know administrative data has done it, but the thing about survey data is you have these wonderful journals, outlets that get out to the public. You have Public Opinion Quarterly, Journal of Official Statistics, and this work gets out there. You have all these - and, certainly, America's Fiscal(?) Association, JSM meetings you have all these - you know, historical record of the problems with these surveys that have been created, and it would be nice to see - Certainly, some of that work is done on the administrative side, but seeing this kind of work, looking at these kinds of questions, I don't see very often in the administrative data, and it would be interesting to see, looking at like interviewer and mode effects and if there's reasons to suspect that the data may not be consistent.

Also, it is important to remember that people have different motivations for filling out administrative data than survey data. Okay? I think that is really key. You might want to have one income for your tax record. You might report another one to your CPS interviewer. You might report another one to the Medicaid agency, so you can get enrolled in Medicaid. There's a variety of these things. You can think of a creative caseworker in Medicaid being just probably as good as a tax accountant at hiding income and knowing how to put the family formation together, and so these are things we should be thinking about as far as motivations of people for filling out these data when we begin to look and see and cross classify by this stuff.

Also, if there is data that is not accepted in some administrative data systems, so do data-entry folks just enter it, pass by that screen, even though they didn't ask or collect that information? So you are always curious about that. So it is always good to check out if you can source out who put that piece of data and how it got into that system. It is really a key thing to be able to do research on.

So that is it, and, also, data editing and imputation. This is something that is absolutely essential, I think. When we are putting together these linked-data files, we need to have incredible metadata on them for researchers to be able to use them well. There is very little documentation in the public domain regarding the collection, editing, imputation procedures of administrative data and enrollment data relative to survey data, and I think that that really needs to - If we are going to link these things up and create these files, the first thing we need to do, for researchers to use them effectively, is to write the documentation. I know it is not anybody's idea of fun.

What I am doing, as a research project with the Census Bureau, where we want to get to the answer, but we have put together a huge team of 20-30 researchers who have put a lot of effort into a project over two years, and all that will sort of be left by the wayside, all the knowledge we collected along the way, and we'll just get a research paper out of it that gives the technical results of what we are looking at, but we won't have created the metadata, I think, at the end of it. This is what is typically left. I mean, as researchers, we just want to get to the results, and we want to pass along - and not do this really tough part of documenting and writing this stuff up.

So putting these linked-data files together also means we have to create the metadata to go with them. I think that is essential and needs to be done, and all this kind of research needs to be taken - take place so we can do that.

So, basically, how does administrative data compare to survey data for research purposes? Survey data, micro data and research into critical sources of error are all in the public domain. Survey data are very strong, because there are so many known problems. Okay. I have already talked about that, but I do think it is an extremely important point that is often missed by researchers, and similar research needs to be done on administrative data.

Certainly, the quality of administrative data will vary greatly from centralized data collections or more centralized, like SSA, IRS or Medicare to Medicaid or state-based programs, and so, certainly, I have been throwing them all into one pile, but I expect that there will be great variation with respect to some of those things.

The issues with the linked-data files that we have certainly been dealing with over the last two years in the project that Ron described yesterday with Census is there's universe issues and measurement error on both sides of - the administrative data side and the survey data side, and it is essential to understand the differences and concordance between these data sources.

The universe issues, when there is missing linking information, that is not good, right? So we need to really figure out why it is missing and who is missing and how it could impact our analysis.

Do we have differential sample loss - missing ID - because someone refuses to give their Social Security number on the CPS, for example, or could not - we couldn't find their Social Security number, so it couldn't be validated? So we have differential sample loss. There's two real sources there. Ron showed it was about 27 percent of the cases, I recall, total that couldn't be linked.

Administrative data has missing linking key information, and it is differential. He showed - I'll show you here in a second that it was differential, and it was systematic, and so there needs to be - when we build a common universe, we have to do it carefully and figure out what was going on.

As you can see, here is a systematic. The red and black aren't good for ID information.

This is one of those maps, I believe, that Deb was talking about earlier, right?

DR. SCHRAG: Yes.

DR. DAVERN: So here we are, and I actually didn't know that was the name of them either, but I have one in my presentation.

As you see, California and Montana, if you were doing an analysis, you would have systematically missing data that could impact your analysis, depending on how you are using those data. So it is key that you find out who is missing and why when you are working with these linked-data files.

Developing these linked universes, there's all kinds of reasons. This is a slide Ron had, too. There's not a valid record. They refused to have their data linked, and then you have the big group of - Most people are in the big group in both the MSIS universe and the CPS sampling-frame universe, but there is also not a valid record. There's people in group quarters. There's people who have died before the CPS interviewer gets to them, but they were enrolled in Medicaid that year, and those kinds of things. Enrolled in more than one state is another issue. Also, on the CPS side, you have births, people who were born not in the calendar year that the data were collected for, but are included in the CPS interview.

So you have measurement error here going on. There's conceptual differences. For example, a person can be on Medicaid, but not receiving full benefits, so is the person really insured? That is an important question we need to ask, and, in some cases, we determine that, yes, they kind of look like they are getting a full range of benefits, some of these people who are partial benefits. Others, aren't. So you do have conceptual differences when you are linking these files to think about. So they are on the MSIS, but do they actually have health insurance as we think of it?

You have misreporting in surveys - person is on Medicaid, but reports some other type of coverage or reports that they are uninsured, and, certainly, Ron showed that yesterday.

You have misclassification of administrative data. Race data are often missing from the Medicaid file, and are important for, of course, disparities research, and when they are there, they may not be collected systematically in every state the same way.

You also have systematically missing variables, as I have talked about.

So the potential for the merged data is great. We have talked about a lot of these things. You have improving the accuracy of survey data using enrollment data. You can improve the accuracy of sample frames.

One thing we haven't talked about is the Census Master Address Files, greatly improved by the delivery-sequence file, and, then, their relationship with the U.S. Postal Service, and so those are great ways to improve Census' sampling frames or anybody's sampling frame is by looking to the administrative data.

Using merged data to create small area estimates was covered. Incredible potential, I think, there.

Improved administrative data race and ethnicity information needs to be done, especially for health disparities work.

There is great benefit to using information on imputation models and editing from these linked-data sets that both the administrative data site and the survey data site can use. So survey data can do better imputation and so can the administrative data. Even if they can't get the merged or linked data back, they can learn a lot about their own data and the patterns of missing this data.

So this stuff will greatly improve health policy simulation and health research, and engage our errors. Don't be afraid of them. Engage. Go out there. Do the research and document them and try to get the best stuff out there.

And just to wrap this all up, there's a couple of - We talked a lot about problems. Other people have brought these up with recency and those kinds of things.

Certainly, these agreements all do their duty, and they restrict access, and, as a result, we are dealing with old data, in a lot of respects. So that is one of the limitations of this stuff. Hopefully, that will pick up and be a little bit more timely.

Data are not in the public domain, and ability to conduct research into quality of administrative data for research purposes is limited for people like me, who like to do data-quality analysis.

So it is imperative that the agencies entrusted with those data really do a good job for looking at data quality for research purposes not for administrative data purposes, which are two different things.

So - and be careful, of course, about reaching conclusions based on asymmetrical verification. Jill Colon(?) from AHRQ had this at the session we were in, the joint statistical meetings about a month ago.

The example here is that we compare Medicaid enrollees who are linked in the CPS to the MSIS. We know that they have Medicaid, but, in the CPS, they don't report that they have Medicaid or they report that they are uninsured.

We had 15 percent of the CPS data reporting that they were uninsured - which is a problem. They actually had health-insurance coverage, according to the MSIS in the past year.

Well, if you multiply that times the 40 million, simply, who are on MSIS, you come up with six million, and you say, Well, obviously, there's not 46 million uninsured in the United States. There's 40 million. That is a dangerous conclusion for a couple of reasons.

What is going on here is it doesn't allow us to verify - the linked-data files don't allow us to verify if uninsured people report that they have coverage. So we have only verified - we are only able to verify one piece of the puzzle. We are not able to verify the others.

So we know if someone on Medicaid said they were uninsured. We don't know if someone who is insured says that they were - or someone who is uninsured said they were insured, and I think it is likely to happen, given that there's 10 questions that ask what type of health-insurance coverage you have, you eventually give in and say, Oh, sure, and especially by the time they give you the last one, which says, Are you sure that you are uninsured? So there's likely some of that going on on the other side.

So it is important not to jump to simple conclusions when you are doing these analysis and when you are working with these linked-data files, and taking a look at the sample loss is really going to be key.

So I am going to start back to where I finished, which is I think the strength of the survey data is that it is in the public domain. There's lots of researchers taking a look at it for research purposes. We really know its limitations.

This is not true for the administrative data. So it is really imperative that people who are entrusted with these linked-data files put together both the documentation and the research on them, so that researchers can reasonably understand the data that other researchers are using to inform the debate, as well as the research that are used - you know, the data that they are using.

There are certainly all kinds of standards out there, data documentation initiatives, one that I am familiar with, DDI standards for survey data. Perhaps something similar for administrative data would be a good thing. Research into sample losses is really key on these linked-data files, and understanding the measure error, so -

And that is it.

DR. STEINWACHS: Thank you very, very much.

Agenda Item: Wrap-Up

DR. STEINWACHS: We have time, now, both for questions and comments to the speakers as well as to go into a broader discussion.

I wanted to invite people who are sitting back in the audience, we have seats up here, and I would very much welcome you to join, and that way, also, in making comments, you have a speaker in front of you, instead of having to look for a wandering microphone. So please come up and join us.

Comments and questions to the speakers?

Is anyone ready to take the quiz to see if they know all of Friz's friends?

MR. DENBALY: I have a question for Mike.

I am just checking to see if I understood you correctly. If administrative data are used to guide the agency to run their program, why should the evaluation of the data be for research purposes or did I misunderstand you?

DR. DAVERN: I think that the administrative data have been evaluated quite well for their programmatic purposes, and so the programmatic data, I think, is very good and solid.

There's all kinds of other information that gets carried on these files, like - that gets collected at enrollment time or other things that I think should be evaluated, and those are the things researchers are really interested in a lot of times.

The health-disparities researchers want to know about the race and ethnicity data, but the people who administer Medicaid don't necessarily care all that much about that data.

As a matter of fact, when you enroll, some states, on their forms - enrollment form say, This is a completely optional question. Like I think New York has one, and so they have incredibly missing data on their Medicaid files on race, ethnicity.

Other states just have a blank that says, Race, fill it in, and how that gets recoded into the MSIS, I have no idea.

So those kinds of things are, as far as the quality of - So those are the things I am talking about for research purposes.

The administrative data are very good at administering programs. There is no doubt about it. They have been properly evaluated, but I do think it is something different to think about from a research perspective - from a health research perspective. They need to be evaluated for that purpose as well.

MS. GENSER: Hi. Jenny Genser, again.

I wanted to give a case study on some admin data that I work very much with, and that is the Food Stamp Quality Control Data, which I have been working with for about 15 years, and what I have found with the Food Stamp Quality Control Data, what it is is administrative data. It is a sample of Food Stamp recipients that we use to measure payment error, and we found consistently that the data that are required to determine whether a person is receiving the correct benefit is generally quite accurate, but other variables that we have collected - such as race, ethnicity, citizenship status - may not necessarily be as adequate, because it doesn't effect the QC error determination.

And our office has worked a lot with the program to make sure that these data are higher quality, because we use it so much for our analysis for a cost estimation for researchers, and, in fact, it is data that is on the public domain with documentation.

Example might be with citizenship status is a few years ago we were finding a lot of people who were - the race was coded as a Native American. They lived in Oklahoma, and their citizenship was naturalized citizen, which we knew that it didn't make sense.

So that is just a case example you might want to just - if you are interested in looking at quality of admin data that the Food and Nutrition Service has done a lot of work with the Food Stamp Quality Control Data.

Now, in terms of these overall Food Stamp data, that is run and administered by states and counties. So we don't know that quality.

DR. BREEN: Just to build on that, when I was talking with the Social Security Administration Quality Control person a number of years ago about the possibility of using that data and matching it with the SEER registry data, she was - We were discussing - I thought it would be good to use IRS data to get the tax information for information on income and then the Social Security information to get information on earnings and that that would give us a pretty decent picture of people's economic well being, which we had nothing of on the SEER data.

So we were discussing that and she had been there a long time and mentioned some of the issues, but one of the things she said was that before Watergate, when their data were routinely analyzed, especially by people at the Bureau of Economic Analysis, she said, Our data was in much better shape, because the researchers would analyze the data and they would come back to us and they would say, Well, you've got a problem here because this data or that data is not very accurate.

So, in fact, the administrative data - and I want to just put this on record for the committee in terms of our recommendation and cost to the public of not doing this - the administrative data gets better by researchers outside the agency using it, and it gets better even for administrative purposes.

And I don't know if Deb wants to mention this, because I know she has been using, and Gerry, too, the SEER Medicare data, which has improved over the years as a result of researchers working on this data. It is better for research. It is better for administration.

MR. LOCALIO: Michael, I commend you for your comments about metadata. That is not often discussed.

I just want to let people know that it is not just an issue for the type of data we have been talking about here, whether it is administrative data from various agencies - It is a terrible problem with - and clinical data that you get on people's health.

Different organizations collect the same data in very different ways, and it means different things, and if you think you can use a nationwide health-information network and aggregate data that way and everything'll be defined the same, you have much to learn.

So I think - I wanted to comment that the metadata issue, the lack of documentation, the lack of understanding of how things are collected, generated, is a real problem, no matter what the source of data is, survey, administrative data from agencies or whether it is clinical data.

MS. TUREK: I've got a question, I guess, more or less.

There really are kind of two types of state data. One is the data that the state collects - chooses to collect itself to serve some function it is doing, and the second is the kind of data we feds tell the states that they have to collect to get benefits from a program like TANIF.

Many years ago, I used to work with school-district data, and Department of Education used to ask for variables. They didn't get a lot of what they wanted for political reasons, and, I mean, the guy who was the head of the Council of Chief State School Officers told me that any time the feds tried to collect data to valuate them, they were going to be up on the Hill getting that data out of the system.

So I would suspect that the data that a state collects because it wants to and it is serving some kind of state function, like the Vital Statistics, probably has a better chance of being better than Medicaid data, and it might even have a more rational set of variables, and I would like to hear somebody who knows more about this than me talk about the differences.

DR. STEINWACHS: Anyone care to respond to that?

Makes sense to me, but be nice to have -

MS. TUREK: Have you got any ideas, Jennifer?

MS. MADANS: Well, I guess I would agree with the general premise, but maybe Vital Statistics isn't the greatest example, because states vary very, very much in how they use the information, other than the registration information.

I think so much of what Mike was saying that the -if you look generically at the vital registration system, the registration information is very good, because that is really why it is collected, and there is a lot of emphasis on that, and that is what states use.

Some states use the other information that tends to be more of an interest to researchers, partners and to some states, because they have more programs that deal with that, and so there is variability in the quality of that.

I think the bottom line is don't assume any data are good until you have done an evaluation and then publish that evaluation.

So I completely agree on the metadata and spending the time to do the evaluations.

MR. IAMS: I am Howard Iams from Social Security.

I have worked with AFDC Quality Control Data and Social Security data, and my conclusion is if it is used to administer the program to calculate a benefit, it is usually pretty good, and if it is not, you are using it at risk.

And the QC system that I worked in for six years - three years - states would code whatever they wanted in the fields they didn't care about. Regardless of what the coding form said you were supposed to put there, they would collect some data item they wanted and just stick it in the field.

And at Social Security, a lot of care is given in calculating eligibility and a benefit, but some of the other information is not going to be as carefully done.

In terms of mortality, in Social Security, we have several sources that are reporting someone is dead, and when we pay a small amount for a death certificate, and the funeral director is sending in the report to get the money that we pay, because it is signed over to the funeral director, that is pretty accurate.

When the states are sending in reports, which we pay for as well, they are not quite as accurate, and, actually, it costs us money when we stop paying that small amount of money that goes to the funeral director, because someone doesn't report a death and we keep paying benefits, and Congress, to save money, will come along and eliminate that benefit. Why pay $250 to report a death? And it will - They did it back in the ‘80s, and it cost a lot of money, and they went back and started paying it again.

MR. RILEY: I guess one - situation with the Medicare data, on the claims data, the conventional wisdom is that if it doesn't - if their data are not reported on the claims to set a proper payment rate, then you probably shouldn't trust it, but, on the other hand, data that are reported for payment purposes can be gamed as well.

The provider has an incentive to try and upcode and do things like that that might increase their payment amount. So you have different incentives at work as to how seriously people take the information and if they are trying to game it to some advantage, so that you definitely need to consider the incentives of the person who is providing the administrative data when you go to use it.

MS. TUREK: I am going to ask a very naive question, then. If, in fact, we can only trust the payment data, what do we gain from putting the administrative data in if we don't know if it is very accurate? I can see the address files, but on the MSIS data, the enrollment, yes, but is that all we should use off of these administrative data files is the enrollment data?

MR. RILEY: Well, there has been work to try and validate some. I mean, you have different levels of accuracy for various elements that appear on the claims and enrollment files.

For example, there has been work to look at diagnostic information that is reported on physician claims, and that is not used directly for payment purposes, but I think the consensus of the validation studies on that variable is that it tends to be accurate enough to be useful, but not anything you'd want to make a life-or-death decision on.

So, I mean, there's levels of -

MS. MADANS: The hospital diagnosis -

MR. RILEY: That is used for payment purposes. Again, we have noticed, over time, there is a certain creep.

DR. SCANLON: I was just going to agree with his point that, I mean, it really comes down to the quality of the data at the individual variable or field level. I mean, because - and this is my conventional wisdom, too, for years, that if it wasn't for payment, it was potentially sort of unreliable, and, then, as you pointed out, we have created incentives for how you report things for payment, to the point where I would say it is more than conventional wisdom. We have an empirical test.

I mean, the introduction of the prospective payment system for hospitals in 1983, when it suddenly became sort of such significance for payment, we saw this dramatic shift in diagnosing reporting, and the story was, well, they had become better at it, in terms of capturing all the diagnoses that sort of were associated with an individual, or, potentially, as Gerry says, they recognized the value in sort of reporting sort of things that maybe were marginal to begin with.

In terms of if this - we have this problem, I think, this is kind of, I guess, my sort of where I come out is what is the purpose - I mean, what is the use we are going to apply this to, and sort of how much data error can we tolerate? Because, in some respects, we are worse off operating with no information. We have to think about sort of there's errors sort of within this information, but we still may sort of improve our decision making if we use that data and we take that error into account.

Policymaking sort of always has to recognize that it is operating on a set of information that is flawed, but it needs to still move forward.

DR. STEUERE: I have a comment, then a question.

The comment is, Fritz, they say is what age is one's memories turn to myths. So I appreciate the fact that your memory of my involvement had turned to a myth as to what I had actually achieved, but thank you anyway. It was a nice myth.

My question had to do with incentives, which is where we were going just a minute ago.

As an economist, any time I see something where we feel like we are doing something incompletely or not as well as we can, I am always brought back to the question of what are the incentives of the system, and I wonder if the two of you - either of you - have some recommendations on how we actually change the incentives of our various systems to reach some of the goals that you mentioned, and I'll mention one, but you could go into other areas.

My sense is that when it comes to data, we of social science have it totally upside down. If Madam Curie tried to publish an article or had tried to publish an article in any of our social-science journals, she would be immediately rejected because all she did was gather some data and run some simple correlations.

And my sense is that the people within our statistical agencies who would gather data, put it together, document it, do all the things that we have talked about receive almost no rewards in our academic settings, even though one could argue that they are the ones doing the basic research, and those of us out there who are using the data sets are really the ones doing the applied research, which actually gets the - especially in academia or in our journals - tends to get the credit.

So I guess that is one area to go, but I guess within the agencies, too, I am just curious whether either of you have given any thought to are there recommendations that we could think of as a committee that says to HHS or other agencies says, Here's at least a couple of areas where you really need to think about incentives of the system?

I think one you mentioned was bringing in outsiders occasionally to constantly review the data sets as they are developed or however we do it for whatever reasons, because they are doing research and that is their incentives or because -

But I am just curious whether - just thinking about incentives, do you have any recommendations?

DR. SCHEUREN: Well, you heard yesterday from Ed Sondeck(?) about two things he thought out to be done, and I'll repeat them again, if I got them right.

Despite the fact that I have aged considerably between yesterday and today, Gene.

The first of them was this notion of having a small fund to get analysis done.

And the second one - which could be done by an outsider - and the second one -

DR. STEUERE: Could you just expand on a small fund for whom to do what?

DR. SCHEUREN: Well, let me repeat - Let me -

DR. STEUERE: (Off mike).

DR. SCHEUREN: He was talking about for NCHS, right? That he would be given a small amount of money that he could spend at his discretion - okay? - to bring in outsiders to work on the data, and he would also be given resources to have internal analysts work on the data, which is what I think you were alluding to just a moment ago.

That would be a fundamental, and it would be earmarked, and it might not come from the agency. The problem with it coming from the agency is it gets in the base, and then they squeeze you again. So it has to come from an NSF or some place outside that, outside the base. Otherwise, it, in the end, doesn't matter more than a year or two.

That is a very good idea. I thought it was an excellent point. He made it yesterday afternoon, and I would like to see that followed up. Talk to Ed and develop it more. You asked me to develop it, but you were sitting next to him when he said it, so - I know you don't remember it, but - I am teasing you now, Gene.

You don't remember a lot of the good things you did, Gene. That is the problem, because you are such an humble guy, and I am a student, so I try to write down what you say, so that I can repeat it or recall it at various points, and I hope I did a pretty good job of recalling it.

I think that - I want to make a point.

DR. STEUERE: You said you had two things -

DR. SCHEUREN: And staff, and increase the staff to do analysis. Earmarked staff to do analysis -

DR. STEUERE: Earmarked - (off mike).

DR. SCHEUREN: That's right. They have to be earmarked, because you get a crisis - And if you are a producing organization, there are always crises. I mean, it is just unbelievable. There is always a crisis, because the systems are so structured that they are always at the point of a landslide, if you know your complexity theory. They are always just at the edge of one more thing wrong - okay? - and the whole thing goes. Okay? And it is really true. Right, Ron? I mean, it happened to me a lot.

I have a story about one of the censuses in which the IRS was supposed to produce a piece of information - it was an industry(?) code - and it wasn't being used for administrative purposes, and so one really smart service center director who was trying to save money, says, We won't key that.

Well, we didn't catch it for a while. That was a big mistake. Big, big mistake. Shame on us. Shame on me, in fact, because I was supposed to have seen it done, provided some oversight, and it was too late when we found out about it. Too late. Too bad.

We didn't have the right systems to monitor it, and everyone was trying to save every penny they could and beat the other guy in the next service center over, and bad mistake. Bad mistake made. It really wrecked the Census Bureau. I apologize again publicly. I apologized to many people at the bureau. I'll apologize again publicly to the bureau people here.

I want to make another point that was underlying something that my good colleague said about metadata and about a subset of metadata called paradata(?), which is about the process of the metadata.

Actually have written about this and given talks on this, and I think it is fundamental to making the linkages learning systems. If you don't do this - okay? - then, you are gone.

Now, there is really good software out there to do this. I don't know of any statistical agency in the U.S. Federal Government that is using it. The Canadians are doing it, okay? Some private-sector firms - I think RTI is doing it. We are doing it at Newark(?), although not in a very large way. It is being done at Brigham Young University. They run exit polls since ‘82 or ‘84, and they do a wonderful job of documenting things historically.

I mean, the real problem with documentation is when the person who did it isn't there anymore, if you can't understand it anymore, if you don't have the Rosetta Stone somehow, it is gone. Very important, and when Howard talks about these long-term systems where you add years and years of data - okay? - and the initial starting point - however good it may be, wherever it came from - the Census Bureau or somewhere else - is not well documented, you really have a serious problem, and, moreover, if you are documenting these administrative systems well - okay - if you are documenting them well, which means you have to have money to do that and time, then, you will discover errors quicker.

If we had in place the system that I would be recommending today, we probably wouldn't have made that mistake for you those years ago. We'd have made some other mistakes for you, but not that one.

MS. GENZER: I wanted to go back to the Food Stamp quality-control example, because Howard Iams was talking about problems which I have seen, too. If it is not connected to the eligibility benefit, it is not as accurate as one would like, but agencies don't have to stand helplessly and wring their hands because you can improve the accuracy, even of those data.

We have a contractor, Mathematica, who, each year, takes the quality-control data and edits it, and does data assessments.

What our agency does, then, is meet with the program staff, interagency, and says, Look, Houston, we have a problem. We've got naturalized citizens who are Native Americans in Oklahoma. Does that make sense to you? It doesn't to us.

So, then, what we did - and this was in ‘01, ‘02, ‘03 time period - we had meetings with the regions and then with the states, talking about these are all the areas that we have problems with. We were not getting good information about vehicles.

Well, one of the reasons, we found out, was that states that have exempt all vehicles or exempt one per household from the vehicle test from the Food Stamp Research Test, they weren't collecting vehicle data, if they didn't need to.

We realized in our office that, since we couldn't rely on that data, we said, Okay. We'll drop this vehicle in exchange for having an assurance that the states will make some improvement on the quality of the immigration data, which we have to have to do our analyses and our reform simulations.

So you can work with the states who are putting together the data to get higher quality admin database. So there are steps that you can do to improve the quality of your admin data.

So our agency tries very hard to make sure it is available, and part of the reason is that it is in the public domain, and if you could go home tomorrow and download the Food Stamp Quality Control Data from the Internet, since 1996, so -

DR. SCHEUREN: Compliments to you.

MS. GENZER: Yes. And the documentation, too.

DR. SCHEUREN: Yes.

DR. DAVERN: I just wanted to - The incentive question was really interesting, and, certainly, what has helped us get this project underway was a combination of things.

First, it was the Assistant Secretary for Planning and Evaluation, Mike O'Grady, was very interested in our project with Census, so that got the HHS side of people interested, and then Robert Wood Johnson kicked in some money, but - and so it was money and it was power, basically, both, that brought that together, and brought together a team of a lot of people working on this particular project. So that was the incentives there.

But I do think that the best way to improve the quality of the data is to use it for research, and as soon as it starts getting used for research, these things are going to get done. So the more - Certainly, through access - Because the data-collection agencies, the administration data-collection - the administrative agencies are going to - They don't like to make mistakes, and if you can show them or document the mistake, they are going to fix it. That is what -

DR. SCHEUREN: They don't like to be embarrassed publicly.

DR. DAVERN: Well, right. Well, they don't like to be embarrassed publicly, and if you work with them and you are reasonable about it and understand that mistakes happen and you go in and you improve the system. That is what we are trying to do. You constantly want to work on improving the system.

So, in some way, whether it be through - Certainly, NIH could fund researchers to do this stuff, perhaps giving subcontracts to Census or NCHS. You have to actually work with the data, because we can't see it or if it can be done at a research data center where we can actually get in and have access to it, and a Title 13 benefit, for example, would be producing metadata.

So if you go in and work and do the research with this stuff, you'll provide it, too. You'll have to provide a piece of that metadata.

DR. SCHEUREN: Good. I like that. That is very good.

DR. STEINWACHS: Very good.

Other things?

MR. PREVOST: Well, I was musing. There is probably smoke coming out of my ears, this late on the second day of a conference.

But one of the things I was thinking about - and, often, as researchers, we don't think in this way - is that as you are looking at data from either your own organization or another organization, you compare them and you say, Hum, something is different here. It is not working, and somebody is - quote/unquote - wrong, whatever that means.

When you do find that things aren't matching up and you suggest a change to one of those organizations, we need to think about not just a statistical environment, but we need to think about how can we measure that agency's or that entity's return on investment, okay?

If we can start showing that, yes, when Researcher X accessed File Y and completed this research that they were able to reduce the operating costs or to improve the measures quantitatively that an agency is producing, I think it would be a huge benefit to be able to continue this type of research and to be continuing this type of linkage activity.

DR. SCHEUREN: It is important, though, to have this candid tradeoff which the Food Stamp example was beautiful. I am going to give up something. I have two bad things. I am going to keep one and make it better and give up the other one. That is the right -

I mean, sometimes, you can improve the operating efficiency of an organization by simply focusing on a weakness, and that certainly happens, but, often, it is weak because you are just not spending resources on it.

MR. PREVOST: Well, yes, and just to add to that, I mean, how often have you worked with an agency and they say, I don't know why we are collecting this piece of information. It has been on the form for the last 25 years, and you finally say, I really don't need this. You have automatically improved their efficiency, but you need to capture that, so that when you go back again or to another agency, you can start using these benchmarks to show them that, yes, it is effective to do this work.

MS. MADANS: Kind of expanding on what Ed was talking about yesterday and what has been brought up, I think it is true that there is not a lot of glory in doing some of this basic work, telling people what is wrong, and it doesn't make you popular, and it certainly doesn't get you articles in JAMA.

But I think a lot of the agencies are trying to incorporate more methodologic program, and what they need to hear is that that is useful, and useful to users, useful to the people who are advising the Secretary.

I mean, there has to be some reinforcement, because, in order to do it, if you are not going to get new money - which, of course, we would all rather have new money - then, you are going to be not doing something else, and so you also then don't want to get bitten by saying, Well, why are you not doing X? Well, because we are doing quality stuff, and I think we are moving a long way to providing metadata, especially the Web. It is much easier to do it.

We used to try to do these reports. They were hard to do. Well, now, we have put all the metadata, all the quality stuff up on the Website, it is so much easier.

But there has to be some acknowledgment from someone that this is an important thing to do.

We would like - I think - what Ed was really talking about was a grant program, which is a nice thing to do, because you can bring the outside - the community in.

We like to have IPAs come, because, then, they can have access to things that they can't have other places. We have a senior fellowship through ASA. These are the kind of things that we would like to have external input on, but it is a cost, and if there is no changing that incentive, if somebody is not saying, This is a good thing to do, people are going to say, Well, it is not considered important. It is more important for me to go out and -

It is more expensive to figure out that I don't need that item that I have been collecting for 25 years and no one is looking at it. There is a cost to doing that. I might as well just keep collecting it and everyone will be happy.

DR. SCHEUREN: There might be a stakeholder, like the Census Bureau, that needed it, too.

MS. MADANS: Well, I'll tell you, every time we take something off that no one has been using for 25 years, somehow -

DR. SCHEUREN: Somebody needs it.

MS. MADANS: - the person who wanted it, has my home phone number.

DR. STEINWACHS: They may be resurrected from the graveyard, right, Fritz? Some of those people coming back.

DR. SCHEUREN: That is another project I worked on, yes. We won't do that now.

DR. STEINWACHS: We have talked a little bit about what might be possible areas in which the National Committee on Vital Health Statistics could make recommendations, and, as you know, those recommendations are made to the Secretary of DHHS, but those recommendation letters also go more broadly, in terms of being shared, and so some of those recommendation letters in the past have made it clear that the value and the idea that DHHS might take the leadership of bringing together multiagency groups to try and address specific issues, and so in areas of consolidated health informatics and some other things, there have really been multiagency activities that have gone outside of DHHS.

So I just wanted to have a chance here, if there are other areas that you would like to identify where you think the capacity of this committee to make recommendations would be valuable.

Deb.

DR. SCHRAG: I just want to bring up the issue of death certificates. So mortality is a great outcome, and we all love mortality, and it is measured pretty well.

DR. STEINWACHS: I don't like that outcome whatsoever.

DR. SCHRAG: Okay. Well, I mean, dead is usually dead, but we all pick up the newspaper, and, not to mention journals, and look at issues of cause-specific mortality, which is sort of one of the key vital statistics, and death certificates in most states basically have proximate cause, contributing cause and underlying cause, and it is one of the most misused statistics are cause-specific attribution. It is an area where there is total anarchy, very little guidance, very little in terms of metadata files, very little in terms of training the individual - I mean, this is something that resides at the level of the individual providers.

I will speak as a physician. I code death-certificate data. I started doing it the first week I graduated from medical school. Those are the people who code death certificates for deaths in hospitals.

You are called to see a dead person. You don't know anything about them. It is two o'clock in the morning. You know what the easiest thing to do it? It is to write down, Proximate Cause: Heart Attack. Contributing Cause: Edema, pulmonary embolism. Underlying Cause: Cancer, or - based on what floor the patient - It is done terribly. It is done terribly all over the United States.

It wouldn't take that much - There are not even documents. There is no training for physicians.

MS. MADANS: There is. There is. There's a lot of -

DR. SCHRAG: There is documentation, but there is no training at the level of the people who are filling them out.

MS. MADANS: We should talk. There is training. We develop a huge amount of training.

I am agreeing with you. There is a lot to be done and there is a huge literature on death certificates and the problems with death certificates and cause of death, especially in the -

DR. SCHEUREN: Your mike is not on.

MS. MADANS: Yes, it is on.

DR. SCHEUREN: It is? Okay.

MS. MADANS: I mean, I think it is a good example where there is information. We try to - This is a state issue, because states are in charge of death and birth and marriages and divorces, but it is how do you get - when the source of the data - you have no control over - How do you improve quality? And that is happening in all of these administrative systems, because you have no control of the person who is actually filling it out, and, I think, in birth certificates and death certificates, there actually is a lot of work with coroners. We go to conferences. We do the training, but if it is not in medical schools, the medical schools don't want to hear about it. They are too busy to worry about death.

It is very hard to get that dialogue going, and unless - There has to be kind of this joint agreement that this is worth time and effort, and how you do that, I don't know.

DR. SCHRAG: And I guess - I think - Yes, it may be at coroners, but it is not at medical schools or medical providers, and I think the way to get them to pay attention is to incentivize them by providing them with data back.

I don't think it has to be monetary incentive, but they actually care about the data, and if they see what the data can do for them, then, all of a sudden, I think that you would see even more in terms of input into training, because it is an absence. Maybe some states are really good at it.

DR. STEUERLE: This last discussion, let me say, I do think that is an area, not just the subcommittee, but the full committee has recently decided we may need to engage. We did have some discussion on vital statistics and even the issue of whether - this is also an issue for us - is whether we need something like, I am thinking like the multistate tax commission, where it is not tax, but whether the state-level attempts to develop protocols and standards and stuff like that are adequate.

I don't know enough about it to say what could be done, but I think it is something we should probably proceed to do. Although, I am somewhat a follower of Woody Allen, who said, I don't mind dying. I just don't want to be thee when it happens.

The question I have is, with respect to ways of - I mean, getting back a little bit to incentives. I am thinking of a couple of gaps, and I don't know fully how to approach it, but let me base one on experience like Fritz and I have had at IRS, and that is with respect to data that is in one agency that is not related to their mission that no other agency really has a strong incentive to get at.

So I am thinking this may sound minor to people here, but IRS has data on health savings accounts.

Now, I don't know how much effort they are or are not putting into it. There are people, for health reasons, who have a very different interest in developing data on health-savings accounts and whether people who use health-savings accounts get the same healthcare as others, which would be very different than IRS's concern, which might be whether people are paying their taxes adequately or IRS's data on the use of deductions. I don't know whether deductions, which are probably concentrated on things like nursing-home care, would be related to somebody doing research on nursing-home care.

Certainly has, indirectly, data on - to the extent it is listed on employers - payment for health insurance.

So there are these data sources with someplace like IRS, but my sense is there is very little incentive for HHS to say, Gee, we'll provide 10 percent of our health research funding to IRS to go do this data, and yet it could be within IRS there is some really, really valuable information, and I used IRS, because I used it as an example, but I am sure this is true across a lot of agencies. It could be that adding a health variable to one of these educational longitudinal studies could really help us to understand the extent to which education outcomes are good or bad.

To the extent you can talk the agency into doing it, it is one thing. To the extent you actually provide funding to other - interagency funding is another, and I know there's been minor examples of BEA, through - a lot of pressures funding, say, IRS to get data, but that is because IRS has the core data to develop national income statistics. So you are at that level, sometimes you can get interagency funding.

So I was just curious - So that is one area where I wonder whether we need to examine incentives.

And the other one has to do with whether we do come back and do self-criticism enough.

You know, you mentioned several times the improved quality of Food Stamp data, but I know for certain debates I am in, the quality of the data is very misleading. So I am in this debate all the time of whether income credit has higher error rates than Food Stamps, and there's always these - Well, Food Stamp error rates are down 5 or 10 percent.

Well, the reason Food Stamp error rates are 5 or 10 percent, they have a monthly income-accounting system. They essentially have abandoned that and said, Well, we are going to assume that, for compliance purposes, if you reported your income right one month, it is right for the next six months.

It is also accurate mainly if you are talking about people who don't have earnings, so people who aren't in the workforce at all, sometimes your earnings - your accuracy of your earnings record is correct, but it is very inaccurate if you get to the question about the three-million missing people in the Census who are often in these Food Stamp households, depending how you measure households.

So there is this argument that - this is a big policy debate - the earned-income credit is a much worse measure than Food Stamps, but I think actually - at least I look at the data, I don't think it is, but for the health data, there is probably the same thing. How do you get outside researchers to come in and be critical - I mean, the agency has very little incentive to have people come in and be directly critical about the way they are doing things, and I am just curious the extent to which we need to raise that issue as well.

So I give two examples, to summarize. One is how do we get outsiders to come in and do critiques? How do we pay outsiders to come in and do the necessary critiques, when the outsiders don't have a strong incentive system?

And, secondly, how do we get agencies to do more cross funding, where it may be very useful - another agency has the data they really like?

DR. SCANLON: I wanted to just sort of comment, because of an experience with the - potential experience with the IRS that never sort of happened, and kind of with the principle that - recognizing that the unique role of the IRS as the tax collector should be taken into account, and that one needs to sort of approach that question, because that is a controversial role, in terms of the public.

In HIPAA, one of the features of HIPAA was the medical savings account, which was the predecessor to the health savings account, and it was actually the thing that held up HIPAA for a long time, and there was finally a compromise allowing for a demonstration program to go forward with 750,000 sort of people that were going to enroll and an evaluation to be done or be contracted for sort of by GAO.

And when we thought about that evaluation, we realized that the only way we were going to identify sort of the people easily that identified people with a medical savings account was to go to the IRS, and the issue was sort of what were the sort of patterns of use among sort of people with medical savings accounts versus other parts of the population, because the people that oppose medical savings accounts say that people are going to forego sort of important needed services because of the high deductibles and we are going to get sort of poor health sort of as a result.

We rejected that sort of as an approach because we couldn't imagine sort of having it leaked that the idea of, Hello, I am from here - from the federal government to ask you about your health. I got your name from the IRS, and that is the kind of thing, I think, that we - it was not a good application.

Now, I think there are other applications that are potentially good. I mean, it is a question of whether, within the context of information that the IRS has, there is more that can be gathered in terms of useful insight for healthcare, but, at the same time, there's limitations on what we can do, and I think what we need to do as we move forward is to think about sort of how do we give advice about how you maximize sort of the benefit without incurring the risk.

We were worried about a backlash in that regard. This was 1996, okay? The climate was even sort of more sort of, I guess, fractionated, sort of then than it is today. There was a very sort of strong sort of political sort of atmosphere that - and we were worried sort of about what would be the response to that.

And so I think that we need to sort of find - I mean, as we give advice to the Secretary, find sort of the directions where there's some safety in the path, I mean, because we can sort of make some progress forward, but if there is going to be a response that is too negative, we are going to be set back for too long and it takes too long to recover from those setbacks.

MR. PREVOST: Just an idea, and perhaps this is coming from someone who has been head down in the trenches for too long.

As I look around, one of the things - I love coming to these conferences or conferences like this because I find out about all sorts of data that I never even knew existed before, and I find that as a concern as a person who had operated as a statistician - now, I am just a manager - but - and had to solve real-world problems and really didn't know that these resources existed.

And, furthermore, are the data available? What can be done with that data, and is there some possibility that there could be an interagency working group that could be looking at both data availability and linkages and measuring the quality of those linkages at the working-person level, not a bunch of us managers around, but those people who are statisticians who have to solve real-world problems that each agency, frankly, has?

And the incentive here is is that they would be working together to - as a problem comes up with Agency A, maybe you've got an interagency group that could be focusing on that, looking at specific files, so that each agency that is participating in it could see that as a shared resource upon which that they could tap.

Now, this may be a utopian view of the world, but it is just a thought, something for you guys to ponder.

MS. OBENSKI: I guess Ron and I have been working together too long, because I was kind of converging in the same direction, and that is, I guess, that I have never actually been to a conference like this, and I think it has been one of the most remarkable things that I have experienced, in terms of making progress in record linkage.

And I guess where I was thinking was trying to use kind of like the Medicaid undercount models, like what are the big problems you are trying to solve that could warrant record linkage, and then building on what Ron said, bring the right groups together, and something like what we did in the SHAY DACK(?) or the Medicaid undercount model, because that is where the incentives are.

The incentives are what is a big problem and what are the pieces of it and what are the different views of the different agencies and what do they have to gain from it. Because we've got states involved that are willing to be involved in this project. We've got ASPI(?). We've got CMS, and, as I said yesterday, what is remarkable is everybody is coming at it from a different - for a different reason, a different agenda.

Ours is improved statistics. Mike's is improving the CPS, so he has better data to do his research, and so I think that that would be a tremendous outcome of this group.

DR. STEINWACHS: I have recorded it. It has been fully recorded.

Other ideas, suggestions?

Sally, you were mentioning states, too, and I think one of the themes that has come up over and over again is are there better ways to work with states than we are doing now or maybe more integrated ways that bring multiple agencies together to work with the states?

MS. OBENSKI: I think that our experience, albeit limited, is that I think that whoever brought up the question about incentives, I do think that there appears - and this is just an observation - my observation, not the Census Bureau's - is that there seems to be, from what we have experienced in working both with state folks and with federal folks on different projects, is that there are competing agendas in terms of what the states' incentives are in administering their program and what the federal incentives are, and that is very, very important to understand how to bridge that, because big federal programs that allocate to the states need the state's help in administering the program, but that is just an observation that I think needs to be addressed before we are really going to make these two pieces fit.

DR. SCANLON: I have been here thinking that we potentially need some kind of process for reconciliation.

I mean, I am guilty sort of, I think, over the last two days of being kind of the nay sayer in terms of quality of information and sort of saying we have to worry about the quality of information, and I'll tell Deb that, almost on a weekly basis, I do say, Don't let sort of the perfect be the enemy of the good, and I still believe that even though what I have said here over the last couple of days.

But it seems to me that sort of with respect to quality of data, with respect to privacy, there is an issue of the tradeoff between that and the social benefit we might get sort of from sort of moving forward, but I guess I am concerned that we don't have - necessarily have a situation where the decision maker is weighing those two things.

The decision maker may be approaching this from sort of one perspective, okay? Their job is that, under statute, they are to protect the privacy of sort of all of the respondents that are in this data set, and they are moving toward the point where the probability of disclosure is .000001 - okay? - and is that sort of where we want to be from a social perspective? And the answer is, potentially, not.

It would not be unreasonable, potentially for a government. It would not be sort of an arbitrary and capricious thing for government to do to increase the percentage of disclosure to sort of have three fewer zeros sort of in front of that one.

And the question is how are we going to get there? How do we get there in terms of weighing sort of these benefits versus the risks or the - of either sort of privacy or of poor sort of data sort of leading to erroneous conclusions?

And I don't know whether the IRB model, which was suggested here, whether there is some variant of it that we could think about sort of in government or whether we should think about somebody who does - in some respects, serves as the arbitrator, hears both sides of this and comes to some conclusion as to what is the appropriate sort of tradeoff, because all the things that we do with respect to statutes, all the things we do with respect to guidelines are still going to have subjectivity in them. Somebody is going to have to make an interpretation saying, This is reasonably consistent with that statute, reasonably consistent with that guideline. It is never going to be black and white. It is going to be something where you can say, Okay. Clear case. The clear case is not to do anything, which we know has incredible social -

So that in trying to atone for my negativity about sort of quality, I wanted to sort of offer that as a process way of dealing with moving forward for the future.

DR. STEINWACHS: You have atoned well.

MR. DENBALY: In terms of process, and in the context of everything that we have been talking here, suggestions that are unmade, working groups are being put together, and perhaps one project can be picked up and used as examples.

One that I have in mind is one that Ron listed as the first one on his slide, high-valued future research project, and that is the connection of NHANES(?) to WIC and Food Stamp data, in the context of health.

I think what we eat is probably one of the most important things. We need to understand why we eat, where we eat it, how we decide how much to eat and so on, and, in particular, we are spending over $20 billion on Food Stamp and other programs as such. We need to understand the consequences of these programs and even the administration of these programs. So studies, as such are highly policy relevant.

So we have this group in here and a lot of the issues that we have been dealing with is what you have been talking about, states and providing the data and linking up with NHANES is very tough.

So I am suggesting that this group is definitely needed to give leadership to addressing some of these issues that we are - in a broader view, that we are dealing with, and perhaps this group and groups as such can work together to bring it to the table to say, Here is the kind of problems that we are dealing with. What should we do with it? How do we address it?

MR. IAMS: My hope would be that you would do something that would make it possible to use better data in policy analysis and evaluation by permitting a greater exchange of linked survey data across agencies with less pain and suffering or prohibition.

For example, Gene Steuerle yesterday said, Gee, if you had earnings records tied to Medicare records, you could look at something connected with Medicare expenditures and lifetime benefits and lifetime earnings.

That would not be possible to happen in the current legal context. Our administrative data - Well, maybe it could, but I doubt it, because you wouldn't have any authorization for the Internal Revenue Service to permit this to happen.

My ultimate goal would be if it is for statistical purposes in a safe, secure environment that is not going to violate confidentiality that any exchange should be possible with government data or government-linked survey data to other agencies.

We have some linked survey data that, if ASPI had it, their decision making would be a heck of a lot better than it is now. Rather than making it up on the back of an envelope, you would have something that is closer to relationships that exist. I won't claim that it is perfection, but it'll be closer.

The Title 13 constraint of Census data linkage is very, very limiting, in terms of this kind of thing. A straight policy analysis on something connected with Medicare or whatever might have nothing to do with data quality. It is a question of having information connected with a decision on what the agency supports or doesn't support.

That kind of exchange in matching up data that currently don't ever meet each other isn't going to happen without legislative changes that make the statistical purpose and a legitimate or the policy analysis of statistical data in a secure environment a legitimate activity that the federal agencies can exchange and share data amongst each other, and I don't know the prospects of that.

Brian Harris-Kojetin agreed, when I mentioned it to him this morning, that we really need legislation that allows data that is tied up in one agency to be shared for statistical purposes at another.

Of course, you'd want to make sure that it wasn't going to be used for administering benefits or taking sanctions or anything of that nature. So you need some sort of protections like CIPSEA has offered. I don't know if CIPSEA is the proper place. One person pointed out if you opened CIPSEA up, you could have bad things happen to CIPSEA.

But we - I think the exchange of linked data and linking more things would lead to wiser policy analyses and wiser decisions, and the agencies need to be able to share more than they can share.

MS. TUREK: (Off mike) - the statistical enclave ASPI would not qualify, because we are basically a policy office in the Office of the Secretary, but we are probably among the most intensive data users in HHS, and we have the broadest data needs, because we really are analyzing all the federal programs, and if they do open this up, I hope they can figure out a way for us to have access to the data, too, clearly, under controls, but none of the sharing agreements I have ever seen would include us.

DR. STEINWACHS: Joan, we have to send you to a secure data center and not let you out.

MS. TUREK: (Off mike).

MR. IAMS: Well, but Joan couldn't come and use data at my secure data center. I am three blocks away from her. We would have to have a special agreement and you'd have to figure out some Title 13 purpose with Census to permit that to happen.

So it is not just sending someone to a secure data center. If they are not from the appropriate agency with - the appropriate legislative requirements, they can't -

MS TUREK: (Off mike) - our travel budget.

DR. STEINWACHS: I'll get Ron in just a second.

I thought you were saying, Howard, that if we could change the legislation, it would be, in a sense, to create something where someone could, like Joan, go to a secure data center, do something for ASPI, which otherwise couldn't be done if you were saying transmit the data to ASPI, and so on, and that was - Okay - Ron.

MR. PREVOST: Yes, thank you.

I think I mentioned this in the first day. I just want to reiterate it. I mean, absent legislation being conducted, I think one of the things that would help all of us is if there were standardized agreements that - particularly - I am going to suggest OMB. I don't know. They may - they wouldn't - I think the appropriate person had blessed that said these are the way that we are going to share data between federal agencies, and they had standardized components and all the lawyers understand what these standardized components are, and so we don't spend years and years and years working on agreements between the agencies.

If it was just getting rid of all the - I am not a lawyer. So there is language I look at and I go, The word is the --. What is the question? But they interpret it differently. Okay?

So in looking at this, if we had these standardized agreements with a blessing and a set of procedures that says, Yes, if Agency A and Agency B want to share data and they both believe they have the right to share data, how can we cut the time that it takes to do this?

If we could do it in two months, rather than two years, it would be a huge advantage to the entire federal government.

MS. MADANS: We have kind of had two - several parallel conversations going on the two days, one of which is about access and confidentiality and the other is kind of the ease of the linkage.

I think we absolutely need to do what you just said, because it is - It won't solve the problem that Howard brought up. I mean, if the legislation says you can't do it, then you can't do it. So we need to fix that, and we need to figure out how to do it quicker, but a lot of these things are done with the understanding that there will be very good confidentiality protection.

And Fritz brought up the IRB, and I can tell you that our IRB, which is very well schooled in statistical uses - because that is basically all we do - much of what they allow us to do, without getting very explicit consent - I mean, really going through every possible bad thing that could happen to you - is because we will protect the confidentiality of the data.

And so while many people really are pushing for more access, the more you expand that access or there is the perception of that access, the less you probably are going to be able to do in terms of the linkage, and it shouldn't be an either/or, but, at some point, it will be, and maybe we have been - We have always dealt with it is either public use or it is confidential. There is nothing in between. You know, it is either in the data center or it is on the web, and we are changing that. I think we are trying to think through - especially now that we have the sworn agent - what kinds of risk, what kinds of - this is kind of the - what would you call it? - the portfolio idea.

But it is going to take a lot to work that out, and while we are figuring out how to do these standardized agreements, it would be nice to be able to figure out what are the primary criteria, how you determine what kind of data you can put where for what kind of access, and, for certain things, I am sorry, you are going to have to come to a data center, and when - I think it was Heather, when she said she remembers the punch cards and going to the computer center at three o'clock in the morning, because it was cheaper and all those good old days, that we are now in - we have been in a position where data access has been very easy. We went through a period where - just gave you the CD-ROM. We put it on the Web.

I have a feeling that was a very, very nice golden age in terms of data access. That is not going to be the main access route in the future, that there is going to be more of a range, and we can certainly make people's life easier, but users, I think, also have to kind of get a reality check that linked IRS data with our DNA information is not going to be easily accessible to users, that we are going to make you jump through lots of hoops before we give you that, and I think that is our responsibility.

DR. BREEN: I think, though, that - you mentioned we had been talking on a number of parallel tracks, and I think one is providing data access to users outside the federal government, but the other is that even - we can't even provide access within the federal government in a timely manner.

So I think that we need to think about both of those things, and, then, I think this notion of the portfolio that Julia had yesterday is a really, really good idea, and a lot of people have mentioned that, that there can be various hoops that you hop through depending on how much detail you want, and for exploratory analysis, and some of the original stuff, maybe a public-use data set or some very basic information is just fine, and it is only subsequent to that, when you are - you have found you've got enough information to test your hypotheses and you can write your grant with that, that then you can move forward with the rest, and maybe we need to relax some of our standards about how much information you need to provide in that grant, too.

And one other thing I wanted to mention was I think it was Gene who said, Well, can't people come in to the federal government and kind of take a look around and maybe examine the culture and make some suggestions, and, certainly, they can, through IPAs, and I know at NCI and at NIH, generally, people come in and do program evaluations. They come in a team of people almost like they would in an - Well, like they would in an academic department for accredation purposes to evaluate what is going on and make suggestions on what are the strengths and weaknesses and where you might want to be in five years or 10 years or something like that.

Plus, on a smaller scale, we have had people come in under IPAs - anthropologists, management specialists - to evaluate what is going on and to make suggestions.

So all of these things are possible, and I think there are a lot of creative ways that we can think about using and building on what we already have as we are trying to get legislation changed, because legislation change is a long process, and so I hate to put all our eggs in that basket.

MS. GENZER: I wanted to go back to the data-center issue.

I think it is very important if we are moving more to data centers that the data centers be staffed adequately so that once a person has access to it that they can then get the data that they need.

For instance, if you need to - if you are granted access to certain variables, that the agency that is providing the data at the data center has the staff to be able to get that certain data to you that you have been authorized to use, so that you are not cooling your heels for an unspecified period of time, and that is especially important if you are a federal agency working with a contract and your contractor has a specific schedule and you don't want that schedule to be way off whack so that the staff on the contractor that you have available for your contract is now busy doing competing contacts, because they were expected to be finished.

So wanted to mention that.

MS. TUREK: I was thinking about what Heather said this morning - A lot of things we are doing, two months is too long. The issue - we need the answer in a week.

So if we have to go to data centers, there needs to be some kind of agreement that we can get much faster turnaround than two months, which means we would have to negotiate a more general agreement that would allow us to do certain classes of studies rather than a particular project.

I mean, it is in the nature of the policy arena that - two months it is like the long run. We are all dead. I mean, frequently, the issue - and this is something the Secretary is interested in, because he is the one who is being given the information - or the White House or the Hill.

So we go up and say to him, Because we have to go to data centers, we can no longer do this kind of analysis, it wouldn't sit too well.

MS. MADANS: So global access.

MS. TUREK: I mean, I think you can - for certain kinds of studies, rather than each study very specifically deciding that you are allowed to do certain classes of studies.

MS. MADANS: We had a call from CBO last week, was exactly this, and that is what we are doing. They are going to have a generic kind of topic-specific, and when they need to run something, they just have to make sure somebody is there. Unfortunately, our data center is not busy every day, so there is not someone always there to welcome you, but as long as we are there, they can come and do the analysis. So I think that is the kind of thing you have to kind of build into it.

I don't know - We have different authorizing legislation. I don't know if Census can do that, but ours does allow us to be more generic in terms of what the project is.

MS. TUREK: If I understand with Census, it has to serve a Census purpose. So you have to find some way that the results can be used that are in line with your mission, not so - I mean, if Census wanted something and the best place to get the data was the Census, would it be covered by Title 13 or would they be told, You wrote this law. Too bad.

MR. PREVOST: Well, I am certainly not the one to speak for every word of our mission, but, I mean, certainly, what our job is is to disseminate data. I mean, it does no good to go out and be the collector of data if you cannot provide it to the people who you are supposed to be delivering it to, and that is why we have the Research Data Center Network.

And in doing this, and as we have said around the table here, that in using the data, you can find out where the warts are and you can tell the folks what they can do to improve the information; that is, if you are working with our data at a research data center, is a Title 13 purpose, to research and to suggest improvements to the data that we have at hand.

MR. RILEY: We have talked about some of the administrative data sets and the fact that, I guess, the lack of documentation on some of them is a barrier to people using linked data sets, particularly if they are used to, say, analyzing survey data and they find the prospect of adding administrative data sort of daunting, and CMS has started the Resdac Project to help new users of Medicare data. Resdac is a contractor to CMS that helps people, not only use just Medicare data, but they help SEER Medicare data users and so forth, and they have taken that on as part of their responsibility.

So that might serve as a model for other agencies that have complex administrative data sets, and it might help increase the demand for data linked to those data sets and might be something to be considered.

DR. STEINWACHS: Is there something up on the Web that talks about what they do? I was just looking for a document.

MR. RILEY: Resdac has its own Website. It is www.resdac. - it is at the University of Minnesota.

DR. STEINWACHS: So it is probably -

DR. DAVERN: I think it just changed, because all the Websites changed, but I you can just get there - resdac.org, I think, gets you there.

MR. RILEY: I think it ends in edu, but I think there might be something about Minnesota in between resdac and -

PARTICIPANT: Google it.

PARTICIPANT: Google it. That's -

MR. RILEY: Yes, Google it. Google resdac, you'll get it.

DR. STEINWACHS: Yes, we've all got consensus, Google is the way to find it.

I thought I would start bringing us to close, because I know we have lost some people, and I really appreciate all of you who have stayed throughout this, because this is really what we had pictured as a real chance to have a dialogue that was among the committee, the agencies, the users and the producers and providers of very critical information that supports both the nation's mission and, from our point of view, is critical information to understanding health outcomes and ways to intervene to improve that.

Just so you know what we are doing, we will be looking at what we have heard and what we have gotten from you, in terms of our capacity to use that to make specific recommendations to the Secretary and, as that moves ahead, we'll be happy to come back and share those products once they are actually approved by the committee that would be going to the Secretary. They get posted at the Website at the time that they are actually signed and sent to the Secretary.

There has also been a suggestion here, which we are going to take back, that maybe there ought to be subsequent meetings, so that we may be coming back to you talking about what kinds of topics.

There has been a lot of discussion about state data and might have some of the same people around the table here, but people from the states or others. There might be other issues, so that we are considering are there next steps that are important for us to take in learning and gathering information, and those will probably be more focused.

I think this has been a very open and, to me, a very, very productive - I sort of thought I knew what was going on, and this convinced me, after about the first five minutes, I didn't, and I was here to learn a lot and have really, really enjoyed it.

I also want to thank Cynthia. Cynthia has disappeared, but Cynthia may hear me someplace that, without her, we wouldn't physically be here, and there was a lot of rushing to get this altogether.

And very much, Joan, I appreciate that you now - I now know you know everyone and - Well, everyone that is important anyway, and without you, this wouldn't have happened and -

MS. TUREK: And I want to say thank you to everybody for agreeing to be part of it. I think it was the participants who really made it, and, although you hear me complain a lot, I think all of you are really great, special people and you do wonderful work.

DR. STEINWACHS: This has been recorded. So, now, you can play it back anytime you want. Joan has certified. Joan has said -

I also, again, from the committee's point of view, want to thank Gene and Nancy, because it was really their leadership that got us going as a committee on this, and, then, Jim Scanlon said, Hey, ASPI is very interested in this. Joan is working on this. Brought us together as a team, and so the rest of us have benefitted from all of that, and certainly benefitted from everything that you have come here and shared with us. So -

Also, so you know, the audio tape - I think I mentioned this - will be posted on the Website, once it is done. There will be a written copy of the transcript of this, and that our plan is to post the slides on the Website, too, so they are available, so that when people listen to the tape, they can see the slides as well, as we were going to copy them all for the committee members, because many - made many points in there, and we don't want to forget those, and we want to capture those.

I don't know if there are any other comments by committee members, but, if not, thank you all very, very much.

(Whereupon, the workshop adjourned at 3:25 p.m.)