Skip to main content
NIH Clinical Center
  Home | Contact Us | Site Map | Search
About the Clinical Center
For Researchers and Physicians
Participate in Clinical Studies

Back to: About the Clinical Center > Departments and Services > NIH Clinical Center Radio > BTRIS Lectures
NIH Clinical Center Radio
Transcript

NIH CLINICAL CENTER BTRIS SERIES PODCAST
Episode 2009-003
Time:  01:12:14

BIOMEDICAL TRANSLATIONAL RESEARCH INFORMATION SYSTEMS:
STANFORD UNIVERSITY’S “STRIDE” PROGRAM.

Presented by Dr. Henry Lowe, Associate Professor of Medicine (Biomedical Informatics), and director of the Center for Clinical Informatics and Senior Associate Dean for Information Resources and Technology, Stanford University School of Medicine.

ANNOUNCER:  Discussing the use of information systems in translational research – this is the NIH Clinical Center Biomedical Translational Research Informatics Seminar Series.
 
(Music establishes, goes under VO)
 
ANNOUNCER:  Greetings and welcome to NIH Clinical Center Biomedical Translational Research Informatics Seminar Series.  On today’s episode, we feature Dr. Henry Lowe associate Professor of Medicine (Biomedical Informatics), and director of the Center for Clinical Informatics and Senior Associate Dean for Information Resources and Technology, Stanford University School of Medicine.

Dr. Lowe will discuss a program at Stanford which is similar to the BTRIS Program at the NIH Clinical Center. If you would like to see a close-captioned videocast of today's subject, log on to videocast.nih.gov and click the "Past Events" link.  Now, we take you to the Lipsett Ampitheater in the NIH Clinical Center in Bethesda, Maryland, where Dr. Jim Cimino, Director of the Laboratory for Informatics Development, will introduce today’s speaker.
 
(Music fades)
CIMINO: Good afternoon. Thanks for coming much we have a couple more stragglers coming in. Thank you for coming. My name is Jim Cimino. I’m the chief of the Laboratory for Development. Our lab is the sponsor for this more or less monthly seminar series on translational informatics. And this is the third of our series now, and very delighted to introduce a phenomenon and colleague, Henry Lowe from Stanford University.

Henry and I go way back. We were post-doc follows in Massachusetts General Hospital back in the mid '80s, and spent months programming together. He taught me everything I know about Macintoshes, which isn't obviously much now. Since then he's been at University of Pittsburgh in the informatics department there and for the last 8 years been at the Stanford University where he's a senior associate dean for information resources and technology. He's going to talk to us about the project he's doing which is analogous to the BTRIS project that we're doing here.

LOWE: Thank you, Jim. It's a pleasure to be here. I’d like to thank NIH for inviting me to speak.

So I’m going to talk today about a project at Stanford University Medical Center called the STRIDE Project. And there are different facets to this project. I’m probably going to spend most of the time talking about the clinical data where house component of surprised. It stands for Stanford translational research integrated database environment.

Just a little bit of context here. Stanford University Medical Center is on the Stanford campus which is about 45, 50 miles south of San Francisco. It's a beautiful campus. There are -- it's a complex research intensive milieu, Stanford University School of Medicine has the highest NIH funding per faculty member of any University in the United States. We also have two high acuity hospitals 0, a Children’s Hospital and the Stanford hospital and clinics. We have an NCI designated cancer and we have a CTSA grant. We have 3 separate it organizations in the Medical Center. In addition to my senior associate dean role I’m the CEO at the medical school also, but these groups all work very closely together because our users expect an integrated and coherent environment to meet their needs. And I think there are a lot of parallels from what I heard from Jim this morning about NIH's needs specifically around research focus data and probably a lot of similarities between our environment and many academic Medical Centers in the United States. Our experiences over the last several years hopefully will be helpful to others just starting on this particular path.

We have as I mentioned 2 hospitals at -- that are part of Stanford University. Lucille Packard Children’s Hospital implemented and electronic health record. Stanford hospital clinics which the adult hospital, both of these hospitals do both inpatient and outpatient care. They moved from IDX to EPIC. They're about 3 years behind the pediatric hospital in terms of implementation. One of the advantages of both of these (…), we were able to take advantage of a process that had gathered up and cleaned up to some extent legacy clinical data. So in our clinical data where house we have data that goes back to 1995 at Stanford University Medical Center. And so early on in this project, I decided that I had to have a single sentence and I’m told this is grammatically correct.

That you just wanted to take away one slide, this is basically my answer to the question, what is STRIDE? And STRIDE was intended -- the, indeed, a rescuer HIPAA client of clinical and research information in the form of electronic data and imaging data linked using national, international data representation standards, designed specifically to spot Stanford Medical Center. That's what STRIDE is, and they're our motivation in doing this project. So to give you an overview, there are 4 components to STRIDE at Stanford.

The first -- think of these as essentially a service that we provide, in addition to this being an informatics research and development project it's an effort to deliver services to the research community. And the first service is to provide efficient access to Stanford University Medical Center clinical information for research purposes. And we're going to focus on this in some detail during the talk so I won't dwell on that. The second goal for the project was to, on the same platform as the clinical data warehouse using the same technology was to develop a service that offered secure data management to researchers. I’ll mention this a little bit in the talk. It won't be the main focus. This is an area where we've had some success. We have also designed into the system the ability to do enterprise level biospecimen management. I’ll have a slide on two on that.

This is really addressing an issue, most academic medical centers face which is a multitude of biospecimen banks scattered across the enterprise, and a great difficulty for researchers to discover exactly what boy specimens are available at the institution, and we have, we think, a solution for that particular problem. And then the fourth component is a very important one, which is the interface between all of these ideas and technologies and services in the community that we want to serve. We have done this by offering a free informatics consultation service where any faculty staff or student at Stanford University as a whole, our CTSA is a Stanford University CTSA, not a medical center.

So we offer this to the entire community. Using a web based request, make an appointment and come and meet with informatics experts and we provide a variety of different kinds of consultations, sometimes mixed, around how to do research data management, how to get access to clinical date for research purposes, systems with privacy and sent issues and -- security issues and also putting together an IRB protocol for something that has an informatics component to it. We have been doing this for a while. I arrived at Stanford in 2001 from the University of Pittsburgh. I report directly to the dean of the medical school, Philip, who spent decades at the NIH clinical center. And Phil really gets it. He's a strong proponent of translational research.

When we came to Stanford, one of the activities that he started was developing a new strategic plan for the Medical Center. This was titled translating discoveries. It was released in January 2002. In part, my recruitment was related to his realization that informatics was going to be a really important player in solving the problem of how you improve the overall efficiency of translational research. And we spent the first year or so of this project doing what you're supposed to do in a research project which is looking at the literature, trying to understand what had been written to date. And in particular trying to get a sense of what the research community needed in order to be more effective at clinical and translational research.

Around that time, there was a influential report from the national academic of sciences that was published in JAMA, it was about barriers to effective translational research and in there were a couple of barriers that were identified that seems to me to be specifically addressable using informatics. And so we started to think about with, you know, obviously limited resources, what we might be able to do to tackle those barriers and the major barrier was really the question of how researchers can have more direct access to clinical data, both for hypothesis generation, for exploratory research and for actual research projects. And in 2004, having conceived the name STRIDE for this project, we -- the first thing we did was to submit a research protocol to the Stanford IRB. I’ll come back to that.

This is how we have positioned our project at Stanford. We've actually -- our entire data repository project is a protocol approved by the IRB and renewed on a yearly basis. In 2004, the (…) for clinical informatics that I direct was created to engage in applied informatics research department and service delivery focused on meeting the Medical Center clinical and research missions. In 2004, after a couple of years of negotiations, we partnered with our 2 hospitals to jointly launch the STRIDE project. There was a legal agreement that was created and sign by the partners and it specifically stated that the purpose of the agreement was to make available all patient information obtain in an electronic form. I’m going to come back to this to talk about the model that we've used to gain access to clinical information while still protecting the privacy and security of the data.

So we partner and continue to partner very effectively with both hospitals and they look to us as their enterprise level partner for meeting the needs of researchers. I think have is an interesting model, because certainly the other institutions I have been at at Harvard and at pit, at least the times I was there this was on a problem that researchers are depending upon the people who operate clinical systems to meet their informatics needs and their daily needs vis-a-vis research. They're very different things. Folks are increasingly overstretched just to operate the climb systems effectively. They understand clinical care but don't understand clinical research. Our model was to say the research mission is based in the medical school component of the Medical Center. We think we understand or have a good handle on how research operates and how it should operate. Give us this possibility. We'll take it over lock stock and barrel. Give us the clinical data that you gather as part of patient care and we will take responsibility and partnership with you for delivering that information in a compliant way back to researchers. That model has worked very, very well for us. And we -- I put this up to give a sense and I was telling Jim about this earlier, that spec specifically around the clinical data warehouse component, this is a multi year, probably multi-decade activity. This is not a research activity that you can do in one or two years. Data warehouses in industry has been around for decades and the people who understand those will tell you that it takes, you know, a minimum of 5 years to get to the point where you are really able to deliver anything useful, so I put the rest of the dates up to give you a sense that it's now April 2009 and the inception of this project was sometime in 2003, so we're about 6 years out. And I’d say that it's really in the last 12 months we've started to deliver services that our community begins to value and can see the benefit of. And obviously, in the long run, having the clinical answer translational research community sort of partner with you and see the advantage of what you're offering is the critical thing you need to happen.

This is the notion of you build it will they come. Well, informatics is scattered with projects over the last couple decades where it was built and they did not come. It is very, very important that they come and it has to be through a carrot, not a stick. I think that one of the issues that we face is not -- this is not dissimilar from the way clinical systems, where maybe 20 years ago, where people were just beginning to get a sense that it probably made more sense to develop clinical systems at the end price level, to have them be standards based. I remember being in institutions where each little department would have their own clinical system. There was no data interchange. That would be in most cases today something that would not be acceptable. We're kind of back there as far as clinical and translational research information systems. Most institutions don't have any or every researcher has their own.

We're embarking on I think a long path to have the systems be accepted as enterprise level solutions. We need to understand what the carrot will be for the researchers in order for that to be successful. In 2008 we Stanford received the grant and one of the great things about the CTSA program, it's providing a what I would a institutional or enterprise level framework for doing this kind of work. It gives it a level of credibility that it didn't have before. And we now have at Stanford, all the CTSA programs. We have an informatics program and engaged with the research community and beginning to understand the needs and beginning to deliver services. I think that was very important to us. And of course I think the other great thing about the CTSA, it's provided a national community at the informatics level. I know in many other domains where we're interacting in a very useful and meaningful way with all, what is it, 39, 38 institutions.

This is great, because most of what I’m going to say today is exactly the same set of problems that people at other institutions have to face. At least some of the things that we've done may be useful. And there is nothing as depressing as seeing the wheel being invented over and over and over again. So let me go on to the next slide. If you wanted to get more information about what we're doing, this is our website, http://STRIDE.Stanford.edu. We have more information about the project at the site.

This is a slide that tries to give an overview of how we STRIDE about STRIDE at the 50,000-foot level. And the little box at the bottom is really critical. This is the foundation. We could not have built STRIDE without prior investment and this is obviously not an issue at the NIH but it could be an issue at other academic Medical Centers.
The -- in general, as I said earlier, the research mission most academic Medical Centers is either a secondary mission for the clinical enterprise or not really on their radar at all. And it tens to be something that's identified more with the academic environment, maybe the medical school that's part of that academic health center. Recently, most medical schools didn't really invest a lot of money in information technology. And in order to do this, in large parts its around 2 issues, protecting the data and also developing the enterprise system that has, you know, close to 24 by 7 availability. You've got the investment in core information technology infrastructure. You've got to have a good network, got to have security infrastructure, the ability to create a software engineering team if you're developing your own system that's well integrated with the ITO.

Another hat I wear, I’m the CIO for the medical school, so the center for clinical informatics which is where STRIDE was developed is an integral part of a much larger organization, and so the folks who operate our network and our data center and the people who handle data security all reported to me. It gave me the advantage to have the maximum amount of control over knees things. I wanted to emphasize it's really very important.

When we were negotiating with our hospitals, the whole issue of how well we could protect the data was the single most important issue we had to persuade them about. And so I want to make that point. And essentially the model with STRIDE as you'll see, it's a large database into which we migrate clinical data from our 2 electronic health records. We also operated as a platform upon research data management applications can exist so there is research data that goes into STRIDE also. And then at the next level up, we provided these four utilities, clinical data warehouse, a virtual biospecimen bank, a series of clinical research registries each being specifically designed for a research group that needed that kind of solution. And interestingly, we've also in the course of doing this, because we're merging all the demographic data from both clinical systems, we have what is the only (…) wide master person index.

Each of the hospitals have a master person index for their electronic health record specific to their patient population. This, I would also -- an integrated data repository and therefore an integrated master person index is particularly useful if you want to, for example, study a disease that begins in childhood, and there are many diseases like that. Diabetes, cystic fibrosis.

So we have the ability inside the clinical data warehouse to go across the lifetime of patients with specific diseases like that. And the master person index and integrated one is important in how you do that. This is kind of architect stack for the techies that are out there. I use that as a kind of a memory aid so STRIDE is built on the oracle currently 11g database platform. And our data model is based on the hl7 version 3 rim. We have taken some components of the rim and we have decided it's very early on, one of the things we wanted to do at STRIDE was to make it as standards based as possible. And when you started thinking about data models for a data warehouse, one of the problems you don't really know what the thing is going to look like when you're finished. And I mentioned earlier this is potentially decades long project. I was talking to Jim about the MARS system at the University of Pittsburgh, a data warehouse used for patient care and research which has been in operation since the mid to late '80s and still going strong and still being built. And so we decided we needed a standards based data model. We used the version 3 rim and like many people doing this kind of work, using an EAV model for the way we actually represent the data inside the database.

We have a semantic layer in the system. Snomed, or racks, we use the NML extensively as a resource. For example, we have some mappings between the different (…).. There is a master person index as I mentioned in the system which crosses the pediatric and adult patient communities. Them we have the 3 service areas so to speak clinical data warehouse, the biospecimen, repository and research databases. Then it's a fairly typical kind of architects and application serve been up through very complex and effective access and security layer which allows us to do very fine grained access to the level of specific individuals. Also the system that is full HIPAA compliant auditing capabilities.

We have a series of applications that we built on top of that. Another view of STRIDE is to look at the things we offer. This might be more of a user's view. You can see that we have, for example, in our clinical data warehouse a set of services. We have consultation service. We have the things that are related to our virtual biospecimen bank. We have a slew of research data management applications that we're operating. And you'll see, we have red cap now, the Vanderbilt system was part of the red cap, a very nice research data management system which we think will probably be better than STRIDE for small research databases.

We're part of the red cap consortium and in the process of implementing red cap. We're specifically interested in looking at how to create some interoperability between red cap and STRIDE itself. I’ll mention that at this point I think one of the things that we have discovered in the last 5 years is that our fairly complex and expensive to build research data management applications that sit on the STRIDE platform are probably too expensive and too overkill for many of the research data management needs that people come to us with. They come and say, you know, I’m going to be doing a research project for a year, 2 or 300 patients. I have maybe 20 different data elements. We think that red cap is going to be a much better way to provide that kind of service to the community, probably for free, than for us to, you know, have engineers build annexes specifically handle that kind of research data management need.

So let me talk about STRIDE and compliance, for want of a better word. So when we thought about doing this, one of our principle concerns was to ensure we do this in a way that maximizes the protection of the security and privacy of the data. We realize we were asking our hospitals to give us a copy of all the clinical data that they had available to them back to 1995 and moving forward into the future. Obviously, from their perspective, this presents to some extent a replication of all of the security and privacy concerns they have around their own clinical systems.

So it was very evident to us, in thinking would you how the to get to where we were to where we are now that we would have to come up with a model for doing this that would be -- would work for the research community -- but also persuade our partners in the hospitals that we could protect this date, they were not creating a risk situation for themselves or their patients. So the first thing we said, the way to do this is we are going to populate our clinical data warehouse. And it's really just a case of moving the data from the clinical systems into this -- into STRIDE. But we're not going to -- and that system will be as secure as the clinical systems are. But we're not going to take responsibility for -- going to take responsibility for releasing that data to researchers. We're going to rely on the IRB at Stanford to manage that component. To manage data release.

So we said we'll make STRIDE an IRB research protocol and go and get permission from the IRB to create this and it will be renewed every year. IRB granted wavers of informed consent and HIPAA authorization to the project which said we could move the data if you like from all the clinical systems into STRIDE so that it was integrated in one place. And -- but obviously we -- it would be impossible for us to go back to all the patients and ask for their permission to do that. We weren't -- in doing this, bypassing the patients' right to grant consent or authorization because we were not going to release any of the data without IRB approval. And so the notion was that we would obtain a fully identified -- this is area distinction between STRIDE and some of the other systems out there, all fully identified data. Our approach is it's relatively easy to de-identify data but it's very hard to identify de-identified data. And we saw lots of issues that would arise with using entirely de-identified data, some of them they canal.

If you're doing research and make discovery based upon some patient dataset and you don't have the ability to notify the patient there might be a benefit, we were concerned there would be kind of ethical issues related to that. So in essence what happens is the data is transferred into STRIDE, and then if any researcher wants any of that data for research purposes, they have to go through the IRB and then the IRB grants them the data and the conditions around the use of the data. They come to us. They go to the privacy officer who actually looks at that documentation and then we get a request for that data which we deliver back to the researcher. And this model was -- went through several months of both internal external legal review and was signed off, we have been operating under that model since 2004. So how does this work? Well, we have our two hospitals and again, I want to -- the key point of this slide as you'll see, the green boxes, this is essentially all the places in this process where the IRB controls what goes on. In other words, they could stop this at any point or they can decide how much of this they want to happen. So the first place where the IRB weighs in is on the process of moving data from the hospital systems into the STRIDE clinical data warehouse. That's something as I mentioned governed by an IRB protocol and renewed on a yearly basis, reviewed on a yearly basis.

We have 3 services that I’m going to talk about. Cohort, discovery, patient cohort, discovery, what we call chart review or clinical data review and data extraction service. And the data extraction, you can obviously do anonymous cohort discovery. You'll see screen shots of an application that we have developed to do that without exposing phi, so there is no IRB approval required for that. HIPAA allows you to do chart review prior to getting IRB approval or getting patient consent or authorization as long as you don't use the data for anything you don't take it away, so we have a chart review application that we've built in, controlled by the privacy officer. And then the data extraction piece, this is where you actually get data out of STRIDE, at that point you have to get IRB approval for that data.

We also can move clinical data from the clinical data warehouse into STRIDE research databases. Again, the IRB needs to approve that. That's another use of clinical data. I should make if point that -- which may not be obvious to if non technical folks here, this is all inside a single database. Not separate systems. All if same architect, so it's relatively easy for us to move data if there is IRB approval from the clinical data warehouse into STRIDE databases. And then as I mentioned, creating these databases, we have basically 2 kinds of applications that we can provide researchers, research data management applications, and then applications built on the biospecimen data management platform. Again they both require IRB approval.

This is a highly regulated process, and one where we feel that we're being very compliant with all of the rules and regulations, and to date, thankfully we haven't had any issues with this and it seemed to work very well. And obviously, you know, you could do data extraction from the clinical data warehouse and manually put it into a database that we create, IRB has to approve that also. So how does data get out. What data can get out. Here is a schematic presentation of that. You have STRIDE, and a master person index. We have hospital data and research data that's used to populate if master person index. So this is all done through hl7 feeds, and so now you've got this integrated master person index. You have automated data feeds. What's not on this slide is that we are now putting in place etl feeds from systems, specifically from if two electronic health records systems into STRIDE because most electronic health records do not have outgoing hl7 feeds. And so we have also got monthly entry of clinical data, the more traditional database model that goes in there. All this gets integrated into STRIDE, and then how does it get out? So you can look at data that's completely de-identified using the cohort search tool that I’ll describe in a couple minutes. That does not require IRB approval because there is no phi exposed.
You can also look at fully identified data as part of quote prep review to research activities. I’ll talk about that in a minute. And then finally, if you see a dataset that looks like it would be useful for your research project you go to the IRB and you apply to get that data, and if they approve it, we will give you if data. So there is really no holes in the system. The only way to get data out of the systems that are part of STRIDE is by going specifically to the IRB. And this works very nicely because our consultation service, we're encouraging people to come to us when they want data so we can preview how they're going to apply for this. We can make sure they're only asking for the minimum necessary data. Sometimes researchers overstate the amount of data that they need. And also, we're concerned about where the data is identified, where it will live after it leaves STRIDE. We want to make sure before they go to the IRB this is all nice and tight. So that we look on our -- in partners with the hospitals, protecting the patient's rights in this model.

So in -- what I’m going to do now is talk about the three components of STRIDE, but largely, clinical data warehouse. And as of April 2009, you can see the counts we have in here of data. We have data on patients going back to 1994. We've totaled 1.3 million patients in the system. Stanford University Medical Center is probably a somewhat smaller than average size Medical Center in terms of the number of patients that are seen. It's -- and so this is in comparison, for example, to what partners would have. It's research data repository, a smaller number. We have clinical encounters. This is a data type that we've made up where we synthesis information from adt feeds and other sources to try to capture this notion of an interaction between a patient and a system, and this turns out to be very useful data. We have inpatient and outpatient, icd9 coded diagnoses. We have both icd9 and ctp coded procedures. We have -- and of course what all of these things, we have the patient, we have the date, the time, and sometimes additional information that comes to us in vhl7. We have all the surgical pathology reports back to 1995.
Surgical -- I’ll make an obvious point. Surgical pathology reports are remarkably valuable data source. In many diseases and specifically in cancer, they are the gold standard for diagnosis, and so we're specifically interested in surgical pathology reports. We made a big effort to go back to '95 and get all the reports. Pediatric and adult for the Medical Center. We -- radiology reports back to 2004. All of the other clinical transcriptions in the system, about 4 1/2 million back to 2000. We have 91 million lab test results back to 2000 and we in june of this year, will release a new version of the clinical data where house that has all the pharmacy orders. I’ll mention this later, back to 2006.

We have data going back further, I think, additional 2 years that we're looking at integrating any that. That's the pharmacy orders. So if point of this slide I’m about to show you is to just illustrate the fact that nothing in life is simple. And so I know this is Jim saying basically what you're doing here. So the first thing you're doing when you set up a data repository, you try to get all the feeds that the clinical systems are getting. So we've done that. And these are all the lab tests, pathology, radiology transcriptions, a whole variety of other things. All the things going in. Electronic health records systems by and large are not designed to spit anything in hl7. You've got a big problem, because most of the really valuable data that's in the electronic health record is actually coming via monthly data entry -- manual data entry. In clinical documentation and cpoe, that kind of stuff.

And so we set STRIDE up and the first phase was essentially to replicate all the hl7 feeds. We're essentially now getting real-time hl7 feeds. Anything that goes into the either electronic health record at Stanford, pretty much at the same second goes into STRIDE. But all the data that's being manually earned is not -- entericide not available so how do you get that in. This is a phase of the project we started just in january of this year. And this is an inevitable phase that anybody faces in having to do this project. So we're basically doing this in a phased approach. The first thing we did was to essentially work with the hospitals so that we could obtain complete access to what are called the cdr's, the clinical data repository. All electronic health record systems, the big ones, obtain a system that's a copy of the date in the clinical database. Usually in a different database format because electronic health records systems are transaction based systems that are, you know, somewhat efficient in handling transactions about individual patients, they have a variety of tools that allow you to get information out on to specific patients or maybe a small group but not very good nor are they designed to allow you to get information about large groups of patients or to search across the database looking for cohorts.

So we're developing what are called etl connections which are extract transform and load connections between these cdr's, these clinical data repositories in both turner and epic, and bringing them into STRIDE in a phased paying. This brings up a very interesting question about what data do you want to have in your data repository? If you start -- we've looked now in detail at the data models for both of these systems, and there are a vast number of elements. The question is, you know, okay, make maybe you want everything but you're not going to be able to necessarily take everything at once. And so what do you want? So we're in the process right now of beginning to create a sort of priority list. And I’ll say one of the things that's at the top of the list are problem lists. And this is because you may think, well, you have all these diagnoses but icd9 diagnoses are not terribly accurate. There has been a great discussion on the -- American college nailing this, about this. I think it's okay to talk about this. This was an interesting phenomenon where without the details, a patient who has -- was getting care in institutions that was transferring data to one of the commercial personal health record systems looked at his data and saw a bunch of diagnosis that didn't seem right to him. It turned out when people dug into this, there is a lot of inaccuracy in icd9 coding. Some of the is done to maximize billing. It's not necessarily the inpatient stuff is not done by clinicians and so on.

So I think we have to be very careful about saying that icd9 diagnoses that we get out of clinical systems are necessarily the gold standard for understanding what's really happening to the patient. Probably a better source is problem lists. Problem lists exist most of these electronic health records. One of the things we're looking at, responsibly starting the summer, moving the problem into the clinical data warehouse, trying to sort of capture them in some standardized way. So this -- I want to make this point that you -- when you got the hl7 feed that's on the the beginning of the solution. It's a moving target. Implementation goes over many years. So for example, our cancer center recently migrated to clinical documentation epic, so there is a change in where the data lives so this is going to continue to happen over if next number of years as both of the systems at Stanford are implemented and we have to stay on top of what's happening in terms of where clinical data is being captured and understand how to get that into the data warehouse. As I mentioned we get data from the registries. So let me talk about some of the tools that we've developed that sit on top of STRIDE.

So I was sharing with Jim over lunch, probably boring his pants off. I certainly don't want to talk about Jim's age. I remember well when you wanted to get a med line search you had to go to a librarian and explain what your area of research and interest was. Usually the person had no idea what you were talking about. It was left up to them to construct the query, and they would say come back in a week or two. In my experience sometimes they did well, sometimes you know it was a very unsatisfactory process. And this is kind of where we are right now with access to clinical data for research purposes. You have to go through an analyst. And I suspect it's a very unsatisfying process for researchers. And my personal goal over some extended period of time would be to sort of get us to the pubmed model access while protecting patient security and privacy.
So the first thing we wanted to do is wanted to provide a way to have researchers directly query the clinical data that we have in our clinical data warehouse. In doing that, we also had to make sure we didn't expose any protected health information, and there are nuances to that I’ll come to. This application here, so this application, you can still hear my on the microphone? I’m going to move out here. This is called the anonymous patient cohort discovery tool. This is a job application, runs on a Mac and PC. Anybody can access this.
When you start up, you have over here in this left-hand column, a series of kind of data elements or criteria that you can drag over, and it allows you to create firmly complex queries we think in an intuitive fashion. We've offered training to faculty on this and they tell us we don't need it. We just figured out how to make it work. It was probably engineered reasonably well. This is a query, an example. You can drag over age and then variety of operators that you can apply. There is an encountered date. This is a way of saying you don't want to look at any data before a specific date. You can query beaut icd9 diagnoses, have the ability to criteria, add them by putting in multiple examples, by lab tests results, put specific values in or say it's high, low, so on. You get the general idea.

You can say I’m only interested in seeing much -- having patients included if they have a pathology report. And this works in real time. This query probably took about 45 seconds. What it does, comes back to you and says, you know, I think there are about 400 patients in the data repository that may meet your criteria. I’m being tentative here. You'll see in a second we have another tool to help you answer the question. And people love this. And essentially, it allows people, I believe that there are a lot of small but important research projects that never get beyond the shower in the morning phase. You're in the shower. You think I wonder, do we have any patients like this we can look at. But, you know, in many institutions it is going to the librarian and people don't have the time for that. This allows you 24 by 7 to log into the system and to ask questions and really all it does is it gives you a number. It gives you a little bit of demographic information. You can't use the system to triangulate to a small group of individuals, uses some clever statistical techniques to essentially prevent you doing that. It's very secure system. It's considered to be very valuable by a researcher at Stanford because it gives them direct access to clinical data, letting them answering the trivial question, are there any patients who have these particular characteristics.

So what do you do then? So where did I get that? Well, this gives you a ballpark figure. So this session there may be 400 patients. And clearly, what you need to do next is you need to look at the data. And this is something that we have discovered to be absolutely true. Only if researchers know what they need from the data. There is no way that an analyst can do this. And so people said, you know, this is great but this next phase that you make me go through where, you know, the analyst extracts the data, I have to go through the whole IRB thing, I get the data on a disk, I look at it, and half is no good. That's not going. Not going to fly. We said we're going to need to build a tool for cohort review. So this is essentially when you launch that tool, to get into this tool because it's partially identified data, we're de-identifying a lot of data on the fly. De-identification is like pregnancy. You are or you weren't. We don't feel, given the kind of data we have, we can guaranty 100%, so we have to go through a bureaucratic process to get access to this.

So this is the same cohort I showed you. These are patients with -- pediatric patients, and you can see when you go in, it gives you a report. You can come back to this multiple times and keeps a record of what you've done and what the status of the cohort is. It's saying, you know, here is a list of patients. You'll see in a second, you can go any -- into each of the charts and look at the data, and can decide whether to include or exclude the patient or you haven't decided. As you do, the statistics are up dated in real time. People need to know they have a demographic distribution of the data. We found that sometimes the data is there, but it's there and skewed in such a way, all the males, so the research project is not going to work. It has to be a balance of male and females, or a balance, for example, of particular ethnic groups.

This is a dash board that allows you -- and basically each user can have as many cohorts in the system, they go in, get a directory, open one up and they are basically allowed to get the overview. And you can see that here is the status for the patients that are -- this person has only gotten down to number 23 in the list. And you can leave notes that say for example why you included or didn't include a patient. And then you have this. So the other thing we found, we were sending people to look at the clinical systems, completely the wrong model, because those systems are not well designed. People sometimes want to go right into if data and they wanted to ask very specific set of questions that will help them determine whether the data is appropriate for their research project or not. So this is our patient cohort data review tool.

This is second tap up here, the cohort patient, the screen shot I showed you a moment ago, the dash board view of all the patients. This is an individual patient. You can see what you have is a list of data elements, and if you want to filter out certain types of data, only interested in looking at the pathology reports, what you do is you can turn off all of these accept for pathology report and automatic the other various data disappears. This is how to apply filters quickly on the data. It shows summary information, for example, shows you the icd9 diagnosis this patient has, the number of times that the person has the diagnosis. The procedures they had, so on. And then you have the data elements over here. You can sort by various criteria. This -- we're looking at a lab test result. Data is here. We flag as being abnormally high or low.

And then if you -- we have a search feature. This is the thing that makes this very powerful. So if you wanted to search for a phrase like colitis -- in this slide what happens is the person using it turned off automatic the data elements. And they've left on only pharmacy transcriptions, radiology, pathology and labs. When they do the search it will only search the things that are active. What it's done is found everywhere in the patient's chart, research chart where this phrase occurs. You can click on it, I blacked out some phi here. You can see they're highlighted. So this makes it to date anyway a very, very, very efficient way to rapidly get into a chart and see -- get into a chart and see if this data will be useful or not. Because you've no idea what criteria an investigator is going to use. You have to make it as flexible as possible.

The whole idea is to make a decision about this patient, and then go down and click one of the boxes if he bottom. So you -- at the bottom. You can say exclude, include or I haven't decided. When you're done, you've got a number of patients for your project, that gets stored away inside STRIDE. When you come back with your IRB protocol. We will only extract the data on the patients that you've decided you want to be in this particular protocol. So it's essentially a process for including and excluding patients. And so far it seems to work pretty well. And this is obviously extensionable as we added new data elements, we have the ability to put those data elements into this system.

I’ll mention briefly drug representation. This is work we're wrapping up and submitted a paper on this. I’ll give you some highlights. We wanted to obviously -- drug data is very important from a research perspective. We wanted to do this in a way that was standards based. We developed a mechanism for getting drug data from 2 separate electronic health records, turner and epic, using two different drug information systems. So we had to come up with an algorithm mechanism and map them automatically. That's worked very well for us. Only 95%. We mapped directly into the [indiscernible] and through that we're able to map out to snomed, to the pharmacologic classes. When this goes live in June we hope to have the ability to search for patient cohorts using ingredients and pharmacologic. You can create a cohort search and you can add in those two additional criteria. And you'll be able to obviously view that data and the chart review tool and then extract the indicated out for research purposes.

Very quickly on the research data management side of things we have developed a whole slew of these. This is one to give you an example, a multi media research data tool used by the entire department of dermatology at Stanford. They have their own separate IRB protocol awe allowing them to register all their patients into this. They capture clinical photography, built right into STRIDE. No one else can get into this, and it's built right into STRIDE. We can link Kate in this with IRB approval by biospecimen repository so when they do skin biopsies, there is an automatic link. The virtual biospecimen bank -- fundamentally, this is an attempt to allow our researchers to ask the question, does Stanford have tissue with these characteristics. It's a hard thing to do in many institutions because tissue is considered a very highly valuable resource. Most research groups have their own tissue banks. Sometimes they're not interested in sharing information about their tissue.

We said look, we'll develop an application that's very sophisticated that allows you to adopt a standards based approach is to registration distribution tracking searching and reporting of tissues. We'll give it to you for free. In return what you agree to do is have your tissue appear in the STRIDE specimen locator. We have about 43,000 biospecimens in the system right now. These are 2 screen shots of the application that we give to people, a very sophisticated application. More formally from a research community perspective you get to use this application. So this allows you to go in, completely deidentified information, and it allows you to create a cohort query for tissue. You basically see here that it's run, researcher has said well, I’m interested in tissue from these 2 sites, the diagnosis was cancer, the histopathologic diagnosis was adenocarcinoma, primary tumor in men, between 35 and 7o there is a universal search. It comes back and says there is 30 samples, 20 patients, here is where it is. If you click on that, you get an automated request form that comes up. That automatically sends the request to the right person.

From then on, it's sociology, between you and your maker. We're kind of brokers. We call this the match making application, the idea you defined the criteria, and we get out of the way and hope it all works. This is working very well. So finally, I want to wrap up by talking about -- well, I think it's a very important and difficult to solve problem, the issue of integrating document based data. So large amounts of clinical data in the patient record is trapped in narrative next documents. Now, historically a lot of that data is dictated. Some institutions continue to dictate. There is a current trend toward getting clinicians to click boxes and type into clinical documentation but fundamentally, in many cases the text is still narrative free text. I think if you look at the limit takes of some of the coded data that we have in the systems, you realize this is very valuable data.

We need to figure out how to extract and standardize it. And there are many different kinds of documents that exist. And we have a project that's been going on as parted of STRIDE for a couple of years called chart index, basically a natural language that can automatically do some of this. In our document -- in the clinical data warehouse we have about 7 1/2 million full text clinical documents. All are indexed using oracle text as part of the oracle suite. And you can search across all the documents for words or phrases in the clinical documents as part of the cohort tool. But it has a lot of limitations. And it doesn't support your ability to search inside a section looking for this phrase for example in the finding section of radiology document. Maybe more importantly it doesn't handle negation detection. So you go in and put in pneumonia, and it will come back and say i've got radiology reports but I can guaranty the majority will say no evidence of pneumonia. So that's not helpful. So chart index, we've written a number of papers on this. My colleague who is my Ph.D. student, this was his Ph.D. thesis, Yang is now at Kaiser. We developed a system that basically can take a variety of clinical documents, parse them into a cda clinical document architect compliant dml format, and use a statistical -- to extract out standardized terms for all the noun phrase. We have been doing this with radiology reports and just completed a project evaluating it again, surgical pathology reports and hopefully have a paper on that this year. We're interested now in applying this to discharge summaries and we think this will be a very, very interesting way to sort of automatically structure data for research purposes. So that we can kind of capture the content of clinical documents in a standardized way. It also handles negation detections nicely. It's about 95% accurate at detecting nigation, so the markup the noun phrases which eliminates a lot of stuff. 2 quick slides at the end. You can look at these online.

There are lots and lots of informatics opportunities in this. Supporting better temporal logic and cohort searching. Standards based representation of data and clinical documents. How you better integrate clinical data warehouse data into research data management solutions. One area that we're very interested in in the notion of research data alerts. This is the idea that many institutions have, they want to in the emergency room, want to know about patients arriving, a particular problem. Often the people come in, seeing and they're gone and the question is, is there a way we can detect this in real time and knife recruiters -- notify recruiters for a clinical trial. We're getting all of the hl7 messages. As soon as that person hits the emergency room, that record is in STRIDE. We want to create a system that allows individual research groups to create their own research alerts, so they can say if you see patients in these locations with these criteria, send a pager message to this particular pager. Not any phi. Just send them a message so they can log in and see what the story is. Data and text minings, very important area of informatics research as you accumulate these large datasets. Then something the CTSA is focusing on, can we kind of multiply the effect here? If we have the data repositories emerging, can we federate searches to search across multiple repositories and ask questions of larger sets of data.

Maybe you want to use that for recruiting into multicenter clinical trials. And then finally the last slide, the challenges, there are lots of challenges. I mentioned earlier, exactly what you put into the research clinical data warehouse verses what you leave in the electronic health record. There is the whole question of how we're going to represent data trapped in clinical documents. There is a real issue of sustainability. When you're developing an enterprise system and really it's a system that takes five years before it's got anything important to offer. And you're singing like -- it would be around for 20, 25 years, you have to think hard about how to make this sustainable beyond the initial excitement or the initial granted funding that gets this off the ground. This is the same issue and electronic health records have. You have to be thinking about how to operate and maintain the system for decades. Finally, and we have certainly learned, this is all fundamentally a question of culture and work flow. Not about technology. Most successful researchers are entrepreneurs, most institutions say to their researchers, you can get funding for this work, as long as you're being compliant, get on with your work.

Researchers for decades have kind of adopted the model they have to do it themselves. We need to turn that around. We need to say to people we have those tools that can make you more efficient and more than effective, and you shouldn't we wasting time doing things that informatics expertise can door for you better at the interested prize level, perhaps cheaper. That will be a very hard thing to do. This is -- if you build it, will they come phenomenon. And I just want to acknowledge we have a great team of people working at Stanford on this project. I’m kind of the architect and spokes person. All these folks do the work on a daily basis.
I’m a little bit over but it's a lot of material. Thank you.

[applause]

CIMINO: One or two questions real quick before Henry has to leave.

QUESTION: Thank you for the presentation. It looks like STRIDE is ideally suited for data mining and retrospective analyses of patients that came in for standard of care. I’d be interested in hearing about prospective data management. Is this for people who come in and are entered, randomization prospectively? Is this really suited for that kind of prospective data management or not.

LOWE: It is.

QUESTION: Then real time someone can look at the data as they're being accrued.

LOWE: I mentioned early, I deemphasized that compete. I wanted to focus on the access to clinical data. We have developed a variety of very complex data registries for prospective studies in STRIDE. And you know, the model is that we want to, first of all, pride standards based prospective data management and we want to have the research projects using those applications in STRIDE take advantage of all the other things we can do. So for example, we have a very nice joint repricement registry, and we're integrating data any that so when people have imaging studies pre and post surgery, that data automatically appears. We will be adding in the ability to add in lab test results without having to manually enter them. That's an important parted of what we do. I mentioned earlier for small projects, we think this is too expensive and too complicated. That's why we're into the process of implementing red cap which can probably meet the needs of smaller research projects for a smaller amount of time and effort. Any other questions?

QUESTION: You mentioned de-identification several times. Just curious, is there wisdom you could impart on people investigating de-identification given that you have experience with it?

LOWE: I’m not sure I've got any specific wisdom accept I think it's a really hard problem. I think there are, you know -- I don't think -- I don't think that there is commercially or available a generalizable solution to this. I think there is literature on this in informatics, and people have done it very well for certain subsets of the data. We worry about the fact that we -- we label -- we -- if you stamp something as being de-identified, the IRB applies a different set of rules around regulations to that data. I mean somebody -- if somebody goes to the IRB, they can say I’m only asking for de-identified data. They can get an expedited review and the date can be released. Once it's released it doesn't happen like this. That data could be put on the internet. And I certainly have looked at enough of this data to be worried about the nuances. I’ll give you an example. So we have -- I’m in the middle of Silicon Valley. I was looking for quality assurance purposes at some of the data. I was looking in a clinic note. It was beautifully de-identified accept for the fact that it had a snaps in there and it said this patient is the CEO of a startup in Silicon Valley and had is url. I mean one at Stanford, right? And if you clicked on that url you would get to their website and you could see who it was. I don't know how a de-identification algorithm can pick that up. There are all kinds of ways you can identify somebody, a classic example of identifying John Kennedy, not that this happened, but a woman who has done a lot of this work, you can use public records, she did with Governor Wells, you can often identify people using public data. Unless it's structured data, we won't release it as de-identified. We basically say we'll do the best we can, but we can never be absolutely sure. That's kind of currently the approach we're taking to that. Thank you.

[applause]

CIMINO: Thank you very much, Henry. Thank you for coming.

ANNOUNCER:  You’ve been listening to a presentation on Stanford University’s STRIDE Program, presented by Dr. Henry Lowe, associate Professor of Medicine (Biomedical Informatics), and director of the Center for Clinical Informatics and Senior Associate Dean for Information Resources and Technology, Stanford University School of Medicine. For more information about the BTRIS Program at the NIH Clinical Center, log on to http://btris.nih.gov. And for more information about the clinical research going on every day at the NIH Clinical Center, log on to http://clinicalcenter.nih.gov. The NIH Clinical Center Biomedical Translational Research Informatics Seminar Series Podcast has been a presentation of the NIH Clinical Center Office of Communications, Patient Recruitment and Public Liaison. From America’s Clinical Research Hospital, in Bethesda, Maryland, I’m Bill Schmalfeldt at the National Institutes of Health, an agency of the United States Department of Health and Human Services.


This page last reviewed on 05/4/09



National Institutes
of Health
  Department of Health
and Human Services
 
NIH Clinical Center National Institutes of Health