Transcribed for:
N.I.H.
NATIONAL INSTITUTES OF HEALTH: NHGRI
BETHESDA, MD

Transcript of:
2001 GENOMICS SHORT COURSE
DR. FRANCIS COLLINS:
"THE HUMAN GENOME PROJECT AND BEYOND"
8-7-2001


Transcript:
2001 GENOMICS SHORT COURSE
DR. FRANCIS COLLINS:
"The Human Genome Project and Beyond"

Dr. Francis Collins: Well, good morning everybody. It's a delight to have a chance to come and speak with you this morning. I know that you already got started on a variety of parts of this course yesterday, and have a whole plethora of interesting things going on later in the week, and I always look forward to this opportunity to talk with the people who come to this course, because we believe this is one of the more important things that we do.

I want to give you a romp through the status of the Human Genome Project, and I'm going to take advantage of the fact that there are lots of other speakers throughout the course of the week that are going to dig into the details of some of this - which leaves me free to be grandiose, which is something I tend to be anyway. So I'm particularly going to talk about some of the future implications of all this, and I will also not resist the temptation to talk about the ethical, legal, and social issues that are attached to that, because I think those are just as important as some of the basic science issues that you'll be learning about all week long.\

And I hope, if I'm reasonably well-controlled in my timing here, that we'll have a chance for some discussion after I go through these PowerPoints, because I'd really like to hear from you about what some of the issues are that are on your mind already, after having started into this.

Let me say that a particularly important thing that I want to bring to your attention is a couple of presentations tomorrow, by Betty Graham and Ron King, that are about our current efforts to try to improve the representation of minorities in genome research. We recognize that if the future is going to be as bright and exciting as I'm going to try to portray it, that means we need all the bright people to work on this project, from whatever background; and we have not been as successful, to be honest, as we wish we had, in recruiting individuals from minority backgrounds to work in this project.

And we have a very bold and ambitious effort underway to try to increase that outreach, and provide more concrete pathways for folks from minority backgrounds to get trained in this area, and become major contributing scientists to the next phase. So Betty and Ron will tell you more about that. I actually asked if they would, if they could try to get to you sometime today, a brand-new report that NHGRI has just come out with, which is basically our blueprint for what we want to do in this area, and give you a chance to look over it this evening perhaps, so that when they come and speak to you tomorrow, you can ask even more explicit questions.

Because I suspect many of you in the room are potential ambassadors in this regard, who could, if all goes well, get the word out there to minority students, that there is a real pathway here of exciting scientific career possibilities; and the Genome Institute stands ready to assist in every way that we can, to try to make those pathways more accessible. And Betty and Ron can tell you more about that tomorrow. But again, I wanted to specifically bring that to your attention, as something you can do for us: which is to try to spread the word here about our interest in those kinds of recruitments.

Well. Let me go through a little description of what we have learned about the genome in the course of the last few years. Again now, particularly, I'm going to talk about where this is all going to take us next.

We have gotten to the point where the Genome Project is no longer an obscure entity - you can find cartoons like this in magazines all over the place on the newsstand, which I guess means we have arrived, and that certainly wasn't true four or five years ago. When I would ask people on airplanes, "Have you heard of the Human Genome Project?" I would usually get a blank stare. And all of that seems to have changed, at least in many people's minds about a year ago, with the announcement in the White House that we had covered 90% of the sequence of the human genome; and even further attention has been flowing since then - particularly back in February with the publication of the analysis of what the sequence has told us.

But I think it's still fair to say that most people are pretty confused about what this is. And the Genome Project gets lumped together with a whole lot of other things that sound scary - like human cloning, and stem cells, and a variety of other issues. And certainly, I'm sure from your perspective as teachers, you are constantly struggling to try to help people understand what these things are, and what they are not.

Well, the reason we're doing the Genome Project - just to be very clear about it - is a medical one. While there are all aspects of this project of other sorts that are also pretty interesting - it's fascinating basic science; it's going to tell us a lot about human history; it undoubtedly will be good for the economy, because it seems to have stimulated an awful lot of biotechnology - but from my perspective, as a physician… and I think speaking for the originators of this project some 15 years ago, the real intention here, is a medical one. And it's a medical one that should be broadly applicable to virtually any disease.

We tend to think of genetic diseases as things like sickle cell anemia, that are caused by a single gene gone awry, and inherited in a very straight-forward recessive, or dominant, or x-linked fashion; but that's much too narrow a view for the reality of what genetics has to offer.

Most diseases that fill up our clinics and wards are rather like adult-onset diabetes, here in the middle, where it's a mix of genetic contributors, and environmental contributors, and they are very hard to sort out.

My own research lab, over in Building 9, is working on the genetics of adult-onset diabetes, and I can tell you, it's a very, very tough problem. There are probably a dozen genes involved in susceptibility - no one of those genes is particularly strong in its contribution; there is clearly a big environmental contribution that has to do with diet, and unraveling that kind of complexity - and it's really been completely out of balance for human genetics for all of the past centuries, but we are now on the cusp of being able to do that. And I think that was, in many ways, the promise of the Genome Project, and the promise is being born out.

Even an infectious disease, like AIDS, can have important revelations about it, derived from the study of genetics. We know, for instance, there are some people who are genetically immune to getting AIDS. Even though they might be exposed to the virus repeatedly, they lack one of the components that allows the virus to get into the cell, and so they're not going to get sick from this particular exposure - and that's true for virtually every infectious disease you look at. If you study the host genes, you begin to figure out that not everybody has the same risk of illness after exposure.

So the intention here of the Genome Project, is to try to unravel the yellow part of these pie charts, and in the process, learn a lot about the red part, too. Because by studying the genetics, you immediately begin to understand things about the environment that you couldn't have, by lumping everybody together.

And how are we doing on that. Well, certainly, for single-gene disorders, we've done extremely well. There are very few single-gene disorders of humankind that have not had their genes identified by now, because the tools of the Genome Project have made that possible. And that's not been true for very long.

My lab spent the better part of the 1980s trying to find just one of those genes - the gene for cystic fibrosis. It took us nine years of very hard work, and a lot of going down blind alleys and burned-out post-docs, and all manner of frustrations; because we didn't have the tools to do this, and many people thought it was crazy to even try. And I would never want to go back to that era, of trying to do things in that fashion.

Nowadays, I can tell you what it took us nine years to do, could be done by a decent graduate student in the space of a couple of weeks, just working by themselves. All you'd have to do is go to the Internet, and see: what are the genes in the area where you know the cystic fibrosis gene would have to be; set up some PCRs with affecteds and unaffecteds; and sure enough, you'd find that three-based pair deletion in a couple of weeks, if you were doing science in any sort of a reasonable way. And that's rather sobering, to think just how hard it was, and how long it took us, and how recently that was.

What we'd like to do now, is to see that same leap forward - go beyond things like sickle cell anemia, and single-gene disorders - to really unravel the complexities of things like diabetes, or heart disease, or hypertension, or the common cancers, or mental illness. And all of those are potentially feasible now. But we're just beginning to start to see some of those developments occur.

Now of course, if we're going to understand hereditary contributions to disease, this is the molecule that we're going to be most interested in unraveling, and that's what the Human Genome Project has primarily been about; although I think maybe there's been a bit of a misunderstanding in some people's minds, that the Genome Project had only one goal - which was to read out the letters of the human DNA code. That's been it's most visible goal. But actually, we've only spent a small fraction of the funds allocated to the Genome Project on that particular goal. And many of the other aspects of what we've been doing, is to try to understand what this sequence is all about.

And the Genome Project is by no means over. At the time that we now have most of this information in front of us, we still have a prodigious number of things to do to try to interpret its meaning. In fact, we're trying to figure out exactly: What should we call the Genome Project, anyway. What are the boundaries of this - because it's spilling out into a whole host of other areas.
Anyway, the Genome Project got underway in 1990, in the U.S. It is led by the N.I.H., and specifically by the Genome Institute; but we have a very important partner in the Department of Energy, who has also been carrying out a significant effort in genome research for the entire ten years.

Now in addition, we have international partners that are extremely important. The sequencing of a human genome - this reading out of these 3 billion letters - was actually done by 16 centers around the world, in 6 different countries. I've had the privilege of serving as the project manager for that enterprise, and that has been really quite an interesting experience, to try to oversee efforts involving many different countries; and in some instances, all sorts of complications arise from that, particularly because the project really only could be coordinated because people wanted it to be. I have no authority to tell people in Japan what to do with their sequencing instruments on any given day.

But because we all agreed we wanted to get this done, and wanted to get it done efficiently, and not duplicate efforts, it actually has gone extremely well.

Now the early days of the Genome Project were devoted not actually trying to sequence human DNA in large numbers of base pairs, because we didn't know how to do that efficiently enough to justify launching into that back in 1990. And so instead, the first of several years were largely devoted to building maps - which you can think of as sort of a low-resolution view.
I think we've gotten a little fuzzy in our terminology, by the way. When I say a map, I'm talking about a low-resolution view of the genome, that looks at sort of mile-markers along the road, if you will. When I talk about sequence, I'm talking about reading out all of the letters of the code.

What we did in February was to publish the first sequence, and the analysis of the sequence. Maps of various types had been generated for several years prior to that; and there are other types of maps that we are now going to go on and generate in the future. But the sequence - the reading out of the letters - is now 90%… actually 95% in-hand, although as I'll explain, we have a lot of cleaning up yet to do, with that last 5%.

So in 1996 - as you can see from this curve - we actually did begin the process of trying to read out the letters of the code. Now we had practiced on some simpler organisms over the course of the first six years of the Genome Project - things like E.coli, and yeast, and roundworms - and were getting pretty good at doing large-scale sequencing; but still, in '96 it was a bit daunting to imagine scaling up that effort by a factor of 30 or so, to try to tackle a genome the size of a human.

So we used a three-year pilot effort between '96 and '99 to try to see what we could learn about how hard this was going to be, and whether we were really up to the task. And by March of 1999, when we looked to see how far that had gotten, this is what we found.
If you'd gone to the Internet to see our progress - which you'd be able to track every day, because the Genome Project investigators had agreed from the beginning, that all this data would go into the public domain every 24 hours - and by the way I think that was a very important and rather groundbreaking decision, made back in 1996 - all of these sequencing centers all over the world got together, and… you know, the standard is, you deposit your data when you publish. And it was pretty clear we weren't going to publish a paper about the sequence of the human genome until we had most of it, and that was going to be several years.

But this group decided it would be really not a good use of the time and effort that went into generating all this information, if nobody could have access to it. And there was no justification for the sequencing centers hoarding the information and just studying it themselves - if it was really going to benefit the public, anybody with a good idea ought to be able to see it. And so they decided to put the data up on the Internet every 24 hours, beginning in 1996, and we have adhered to that ever since. And the results of that have been quite profound. This is a sequence database that people used tens of thousands of times a day. And discoveries about disease genes have been happening all along - not waiting for sort of some final publication for people to get started.

So in March of '99, when you went and looked to the Internet - here is a diagram of the human chromosomes - this is what you would have seen, as far as our progress. The areas that are in red or orange are areas that are essentially finished, where the sequence is at high accuracy and there are no gaps; the areas in yellow-green are what we call working draft - which is actually a very good sequence, but it does have gaps in it that have to be closed ultimately; it can be used though, to answer most of the questions that people wanted to ask of the genome, but you can see there's a lot of areas that hadn't been touched. In fact, only about 15% of the sequence was in hand in March of '99.

Well, I remember that month very clearly. We got together at the Houston Genome Center, at Baylor, and the five largest genome centers - which were going to be doing about 85% of the work, so they really had to take the responsibility for filling in most of this territory - surveyed what their capacity was, and tried to figure out: how quickly could we get this chart filled in here. And the decision was to go for broke, and to try to fill in all of the chromosomes with at least working draft sequence, as fast as the machines would allow it to be done. And that meant scaling up, in individual centers, sequencing capacity - something like tenfold in the space of a few months, and that is an extremely challenging problem.

When I was a graduate student, if you sequenced a thousand base pairs of the DNA code, you could probably get a Ph.D. In order to fill in this diagram in the space of a year and a half, we calculated we'd have to sequence a thousand base pairs a second - seven days a week, 24 hours a day, a thousand base pairs a second. And that was really a prodigious task, but people decided: you know, we've learned a lot from this pilot effort; let's try it and see what we can do.

So I'll just show you what happened here. As the months click by, you can see what was beginning to fill in here on the various chromosomes. All of the genome centers divided things up and took responsibility for particular parts - we didn't want any parts of the genome to be left out in the cold - and sure enough, by May of 2000, this diagram had gone to 90% coverage. That's so satisfying, I just have to do it again. [laughter] And this represents the work of probably about 2,500 individuals, working at these 16 centers, in 6 countries, frequently communicating with each other by conference calls, and e-mails, and regular face-to-face meetings, and doing a terrific job - of both maintaining the quality of the data, which is very high; and getting the job done in a very tight time table, at a an affordable cost.

All of this - I should tell you, the Genome Project has benefited by massive improvements in technology all the way along. So the original estimates of what all of this would cost, have never actually been reached. We've done all of this ahead of schedule and under budget, which distinguishes this from many other things that happened in Washington.

So there we were, it was May of 2000, and by June we sort of tallied it all up and agreed that we had reached our 90% goal; and there was this big announcement in the White House in June of 2000, where I got to go and stand next to the President and talk about what this all meant for the future. And I think people were excited, but they were also a little confused, and there were a few cynics who said: "Well, you know, that's fine; but isn't this a bit arbitrary? sort of saying 90% is good enough? And what did you learn from this, anyway?" Well that was, for the scientists, also the right question: "What did we learn from this."

And so between June of 2000 and early 2001, we assembled a group of about 4 dozen of the smartest computational biologists that we could find. Now these are interesting people, who are both equally at home with biology and with writing a program to try to interpret sequence, and they dropped everything else they were doing and agreed to work on this together; and by February of this year, we were able then to publish this paper in Nature - which is the longest paper ever published in Nature - describing what we had learned from the sequence; and there were quite a number of surprises in the sequence that we hadn't expected. Which was good. It would have been sort of upsetting if you had looked at it, and spent months staring at it and said: "Oh. Well, kind of what I thought." We hadn't needed to worry though, about that.

This cover of Nature, by the way, was chosen by us rather specifically, to convey a message. The message is, of course this is DNA - you recognize the double helix - but if you look closely, this is actually a mosaic, where the tiles of the mosaic are made up of the faces of people from all over the world. And they are people from every ethnicity, and culture, and form of dress, and age, and gender that you can think of, and that really was what we wanted to say. This is the people's genome. This is all of our genomes. This is our shared inheritance. Sure, it's about DNA, but it's really about human beings. And I think that message continually is one that seems very important to this International Public Sequencing Consortium.

You may know, of course, that there was a private effort to sequence the human genome going on at a company up the road - Celera Genomics. They published there own paper in Science the same week, and we had agreed to correlate these publications so they came out at the same time. Their sequence data was partly their own, and partly data that was derived from the public databases. Since we were putting it all up there, there was no reason they shouldn't take it. We were happy for them to do so. Their sequence data however, was not made freely available, even at the time of publication, which has caused some considerable consternation in some quarters.
I would say all of this fuss that went on, about: Is there a race here between the company and the international consortium, got a bit overblown. I mean, how can you have a race, when one of the parties is giving away all of their information to the other. And I think it actually got kind of silly. The company was using a different strategy than what the international consortium had, and obviously they had a different business plan, and a different date-of-release plan.

I think the good new is: The sequence got done; it's accessible to anybody who wants to go and query it; it's in the public domain; the company still did fine; everybody ended up okay at the end of this interval. And I'm glad we're passed that interval, to be honest, because it tended to overshadow in many ways: Why are we doing this anyway; and what are the scientific reasons to be excited about this enterprise.

Just for fun, because this cover of Nature was generated by a bunch of scientists who do have a sense of humor, we decided we would hide in this picture a Watson and Crick image from 1953, and get everybody to play "Where's Waldo," [laughter] because it seemed like an interesting thing to do. We actually hid a picture of Mendel in here, too. But when Nature, the journal got the image that we'd created, they trimmed the margins, and so Mendel ended up right down there. [laughter] That's his forehead. If you know the famous picture of Mendel, you might be able to tell that that is Mendel's forehead. But most people haven't been able to pick that up. Watson and Crick, on the other hand, survived the cutting of the margins and they are here, and it takes people quite a while to find them. So I'm going to spare you some time here, and show you where they are - right down there, in the midst of the backbone of one of the strands, is this famous photo of them admiring the double helix structure back in 1953, and we certainly felt as if we were standing on their shoulders, as one does in science.

There are a variety of other interesting photos in here. The Queen of England seems to have made an appearance - I'm not quite sure why. I think the Brits snuck that one in there. Our good colleagues in Britain wanted to be sure that their particular contribution was hidden in there somewhere. But I also discovered Senator John Glenn is in here, and I'm not quite sure how he got in either. I guess we're exploring things. Oh, well.

Question: Where are you?
Dr. Francis Collins: I'm not in there. No, no, no. No, we decided we really should not be putting the scientists in here who were part of the work, because where would it end. There were 2,500 of them, and we couldn't very well put them all in, and it wouldn't seem fair to leave people out.

So if you want to go find all the data that's in here - you're going to hear other presentations about this during the week - I'll just tell you this is my favorite site for going to look at the genome sequence, and one that I think could be quite useful in teaching exercises, because it's reasonably friendly to the user in terms of what's there, and your ability to move around in the genome.

This particular site was set up by a graduate student named Jim Kent at the University of California, Santa Cruz. Jim is one of the smartest people I've ever met. He played a significant role in taking all the sequence that the centers had produced, and assembling it into a contiguous stretch of DNA that went from one end of a chromosome to the other; but he also put up this browser, which is really terrific. Because it allows you to move around, and you can blow things up and then narrow it down, you can go directly to the sequence, you can jump around to other species - I don't have time to go through what all these tracks are here, but it would be fun to play with if you have a spare moment - and I think it would be a good educational tool.

This is just one little part of chromosome 7, here actually looking at the anka gene called met, showing you it's exon structure across the sequence - this is quite a long stretch of sequence - but it's also showing you a bunch of other things; like where it matches mouse, and FISH, and a bunch of other interesting features. You'll hear more about the various databases and how to use them during the week.
Well, what did we learn about the genome? I could go on for hours about the things we learned about the genome. Yes, it is lumpy, as Jeff said. If you look to see the distribution of genes in the genome, it's not at all random. There are very crowed urban areas where genes are packed one on top of the other, practically; and other great deserts, where you'll go for millions of base pairs without encountering a single gene. Why is that? It doesn't seem like a very efficient way to package information. But it must be pretty important, because it's been maintained that way by evolution over a long period of time.

One of the things that got a lot of attention, and well it should have, is that the gene count for humans turned to be considerably lower than expected. We had always been expecting to find something in the neighborhood of 100,000 genes - that was the number we've all been using, right? for the last 15 years, since Wally Gilbert did a little back-of-the-envelope calculation, and decided that was the right number.

Well, it was within an order of magnitude, I guess, but it was not that close to right, because the actual answer seems to be about 30 to 35,000 - roughly a third of what the previous predictions had indicated.

Now let me say, we don't have that number precisely nailed down. And even if the sequence was finished to every last base pair in perfection, we still at the present time wouldn't exactly know how many genes there are. Because our abilities to scan through large amounts of DNA and pick out the genes is not perfect. And so I'm sure we have over-called and under-called in various places; but I'd be surprised if the real, final answer turned out to be more than 50,000, and I suspect this number of 30 to 35,000 is probably about right.

Now that is a surprisingly small number, when you consider all the things that we have to do biologically; and it's also surprisingly small when you look at some other organisms that we have tended to look down our noses at, as being much less complex than we are. So roundworms, for instance, with their 19,000 genes, are getting dangerously close to the same number that we have. The arabidopsis, mustard weed, who's sequence has recently been derived, has 25,000 genes. A little plant that we don't give a lot of respect to. I gather, for the people who are sequencing rice, that they think rice has 50,000 genes. Oof! More than us.

Well clearly, gene count must not be everything, or else we've really been severely misled in our interpretations of our own complexity; and in fact, the second point here, on my list of cool things, may be part of the answer. That we are able to make more of one gene than people had appreciated - and using this thing called "alternative splicing," where you put together exons in a different combination, you can in fact see that most human genes do use this; and on the average, a human gene makes about three different proteins. And that is more than you would find in worms, of flies, or yeast. So we are using the [competetatorex] [ph] to help us out.

Are there other aspects of our protein repertory that also seem unique. And when I say "our" protein repertory, I'm probably really talking about the mammalian protein repertory; because we're the only mammal, so far, that we've had the chance to examine. But our proteins do seem architecturally more complicated. They've cobbled together more different kinds of domains per protein than you'd expect to find in a yeast, or a fly, or a worm.

Another thing we looked at - and I actually didn't realize this was going to be possible, and it was one of the things that came about from studying the repeats of the genome - we are able to deduce the mutation rate in males compared to females. Now how do we do that. Well, you look at the Y-chromosome. The Y-chromosome, of course, can only be passed from male to offspring… mainly from father to son. So if you see mutations arising on the Y-chromosome, they must have arisen in male meiosis. Whereas mutations that arise on the X-chromosome, or on the autosomes, might have happened in either male or female meiosis.

Because we are tracking this very large family of repetitive sequences, and that repetitive family - actually, you know what the sequence was of a particular family member when it landed on the chromosome, and then it sits there and diverges over millions of years - you can actually use that as a clock to see how rapidly are mutations being acquired. When you look at the Y-clock, compared to the X-clock, or the autosome clocks, you find the Y-clock is running faster. Which means mutations are happening more rapidly in male meiosis than in female meiosis. If you go through the math, it looks as if it's about twofold.

Now there had been suggestions of that, and there are biological reasons to think that might be the case. In spermatogenesis, you have to go through a lot more cell divisions to get to a mature sperm, than you do in autogenesis. And if the mistake rate is sort of dependent upon the number of times you have to copy the DNA, you could see why sperm might have a higher mistake rate. But here it is - the evidence that that's the case. So I guess it won't surprise the women here to learn that men make mistakes in passing their DNA on more often than women do; and men - I'm afraid this means that we're responsible for two-thirds of genetic disease. It has to start somewhere, and two-thirds of the time, it starts in a male passing their DNA on. But we can also take credit for two-thirds of evolutionary progress, since that's the same process, so… [laughter] that should be some compensation and reassurance.

Another major thing that we were able to learn about, was the so-called "junk" DNA. Fifty percent of our genome is recognizably made up of these bland repeats; various families of them. And those are repeats, which in the past, we have tended to ignore; or even to look down our noses at, as being of no real interest, and called it "selfish" DNA, or "junk" DNA, and said that it was along for the ride, and it was basically an irritant to the molecular biologist and not much else.

Well actually, we've acquired a new respect for this particular component of our genomes, because at least one of those repeats - in fact, the most common one, the one called Alu repeat, has all of the properties that one would expect for a functional element, in terms of the company that it keeps.

We don't understand right now what its function is; but you can infer, from how evolution has held on to it in the areas where the genes are most dense, that it has some function. And it obviously opens up a whole new field now, to try to figure out what that function is. And if that's true of the most common element where we have the most data, it may well be true of others as well. So we should probably eliminate "junk" DNA from our vocabulary; "junk" DNA may simple reflect our own level of ignorance, and at least a lot of this is going to turn out to be important after all.

Well, those are a few of things that we learned about the genome from this romp through, and I'm sure people will spend the next many decades improving on this analysis. What I hope is, when people go back and read that paper in Nature in February of 2001, they won't laugh too hard. I think they will look at it as sort of the first analysis - sort of like the student's book report after they've first read some classic of literature - and then after awhile, others will come along and add substantially to the analysis. And that's fine; they should.
So what do you see if you go and look at the sequence. Again, you could now go to the Internet and read this stuff off - and this is about 2,000 letters of the code - and it is pretty daunting when you consider that's what we've now got. We've got 3.1 billion of these things, and the sequence of course doesn't help you very much in terms of giving you punctuation, or paragraphs, or capital letters, or any of that - it's just A-C-G-and-T - and the effort now very much shifts into a decryption mode of: What does this all mean, and how do we apply it to medicine.

Of course, this might be a typical page from a person, but if I picked a different person, I might see a slightly different sequence. How different? Well, not very much. This would be pretty typical. If I picked a piece of DNA from me, and from one of you - and it wouldn't matter which one of you I picked - I'd see roughly about 2 differences out of 2,000. Point one percent. So I've just changed two letters there - I don't know if those happen to be variable, in reality, but I'm guessing they have some chance to be - and this is about what you would see, in scanning any part of the genome. And most of those variations, of course, are ones that preexisted in our common ancestral pool.

This is turning out to be very interesting in terms of what we're learning about our relatedness to each other. The evidence now is pretty compelling from a variety of different directions, that we're all descended from this common ancestral pool of about 10,000 individuals that lived in Africa about 100,000 years ago. And most of the variance - like the ones you're seeing here - were already present in those 10,000 people. So that as humans dispersed around the globe, they carried that variation with them; which is why for variants like these, if you looked in Asia, or in Northern Europe, or in Native Americans, you'd be likely to find these variants in all of those groups, because they were already there in the founding pool.

Now, not to say that you'd find them at exactly the same frequency. Maybe the things marked with the arrows here, you might find at 40% frequency in Asia, and at 10% frequency in Europe, and 50% frequency in Africa. You can imagine the frequencies varying a little bit, just on the basis of drift, as well as founder affects - but only a modest fraction of the variation of the genome is actually going to turn out to be strongly correlated with geographic origin.

So that is to say, if you looked at the DNA of all of this diverse bunch of folks, most of the variation you find there would be shared amongst all of the geographic background groups; and only a small fraction - something like 7% of it - would have strong geographic correlations. And that turns out to be, I think, useful information in hopefully informing our dialog that's going on right now, about the biological significance of race. And I want to take a small detour here to say a couple of things about that. Because I think it's a very important case of an intersection here between science and society.

People have been of course, trying to come up with statements about ethnicity and race for a long time; and I like this quote a lot, because it sure resonates with me, when I'm talking with people about this. "Most people believe they know it when they see it, but arrive at nothing short of confusion when pressed to define it."

Now there's a reason for that. Scientifically, there really is no justification for drawing sharp boundaries around any particular group, and saying: "That group is different than this group over here." The boundaries are very blurry.

When people draw the history of human populations as a tree - you've probably seen such tree structures - that may be a useful shorthand, but it really isn't right. Because gene-flow doesn't occur down one branch and then never back again, and does some other branch - the way the gene-flow occurs is backwards and forwards. Our history as a human species is really more of a trellis, than it is of a tree.

And so while we use racial and ethnic designations in various ways, science would not support them being rigorous in any sense of the word; and often, in fact what we're talking about, is really a definition that has more to do with social and cultural concepts than it does with biology.

Now what does that mean for the study of disease. One of the major priorities for the N.I.H., and for all of the government health-oriented agencies in the next ten years, is to try to identify the causes for health disparities between groups, and to eliminate them. When you see a difference in frequency of a disease between one group and another, should one assume that that's genetic. Well, [break in audio; change to side B] …ought to be, but we shouldn't make the mistake of assuming that that's necessarily the case. It could also be a difference in diet, or in cultural practices, or socioeconomic status.

So for instance, again, my group studying Type-2 diabetes - we're very interested in trying to understand: Why is that disease twice as common in African Americans than it is in Caucasians. Why is it even more common in Native Americans - particularly the Pima Indians.
It is very likely that there are some genetic undercurrents going on there, but it's also pretty likely that diet is playing some role as well. And trying to take those things apart is going to be a challenge; and I think it means that researchers studying genetics, also really have to study the environment, or we may miss the boat.

So again, this is a bit of a difficult message, I suspect, for you to convey to students; it's hard for me to get this exactly right - both, to say two things are true. First of all, that it is improper to draw precise boundaries around any particular ethnic or racial group and say they're biologically different. Science won't support that. But at the same time, it is correct to say there may be differences in frequencies, of a quantitative sort, of particular disease-susceptibility genes in one group compared to the other, that may play a role in why there are health disparities. Both those statements are true, and yet it is I think fairly challenging to get that message across in a fashion that is convincing to people, and it doesn't seem paradoxical.

Well, let me go on to: Where are we going anyway, with genomics. Now we have this 3.1 billion base pair catalog of information; how are we going to apply that to better health. There's a whole host of ways that we're doing so, and you're going to be hearing about a lot of them this week.

Obviously, we'd like to take that information and study disease, and figure out why some people are at risk for one thing, and others for another. In order to understand the genome and what it means, though, we really have to make comparisons to other sequences, because that gives us often our best insights as to which parts of the genome are really functionally the most important. And so the Genome Project is not done sequencing. We won't be done sequencing for a long time.

We're sequencing the mouse - it's going quite well, we have 95% of that in a draft of a sequence; we're sequencing the rat; we're sequencing the zebra fish - which you're going to hear about later today from Shawn Burgess as a very important model organism for understanding development, because of its transparent embryo and its rapid generation time, and the fact that there's now a whole bunch of terrific mutants that seem to be quite analogous to various human birth defects.

We need to understand how genes turn on or off - this is the DNA chip business, which some of you heard about yesterday from Mike Bittner, and all of you will hear about I guess later in the week from others involved in this course. This is a terrifically exciting new technology - to try to figure out how it is that genes turn on or off, or how a liver cell is different from a muscle cell, and how that goes wrong in particular diseases. And particularly in cancer, this is making remarkable inroads into our understanding of how the genome works.

And then there's this thing called "proteomics," which is a bit of a fuzzy word, recently coined. What people intend to say, I think, when they talk about proteomics, is sort of a global study of all the proteins. Not just one at a time, but let's look at all of them - figure out how they interact, what their structures are, where they are in the cell, maybe what they do - but this is a very open-ended kind of investigation that will go on for a very long time. And just the same, the proteins that do the work in the cell - we want to understand them in a very detailed level, if we can, and that's what proteomics is about.

Well, how is this going to affect medicine. Let me move on to that quickly here. Again, I started out saying the justification for the Genome Project was a medical one - how is that going to play out.

This diagram is a favorite of mine, because I think it does help organize the flow of the ways in which genomic studies are going to find their way into the clinic. Whatever disease you're interested in unraveling, the Genome Project now provides you with the tools to identify the genes that are playing a role in the hereditary susceptibility.

Immediately after doing that, you then have the ability to make predictions about who's at risk, by who's carrying those susceptibility spellings. And that would be diagnostics - and that would be particularly interesting to people, if it gave you a chance to do something for those at high risk. To intervene. To offer them some change in medical surveillance, or lifestyle, or diet, that would reduce their risk of becoming sick. I'll come back to this diagram in a minute, but lets look at an example where that's already the case - you'll hear about others this week.

This is another condition my lab has been studying - hereditary non-polyposis colon cancer. This is a typical pedigree. A grandmother with uterine cancer; a son and a daughter with colon cancer in their 50s - pretty early onset. Families such as this frequently turn out to have a misspelling of one of the genes involved in DNA mismatch repair. The DNA mismatch repair system is sort of like the spellchecker. When you copy DNA, there's a spellchecker there making sure you've got it right. If the spellchecker isn't working, then you can imagine mutations creeping in. And that consequence of that, it turns out is colon cancer, and sometimes uterine cancer, and sometimes some other cancers as well.

So here's a family where in fact one of those mismatch repair genes is misspelled itself, and it gives you the opportunity - once you know that, and you know that this person, and this person, and this person are all carrying that misspelling - to begin to offer other people in the family some insight into their own risks.

So in this family, these two folks have been tested… so has she, so has she. And some of them are positive. And in fact, this allows the people who have the mutation to know they're at about 60 to 70% risk of getting colon cancer, to get started on a program of colonoscopy beginning at age 35 or 40, to do that faithfully every year - the chances are very good, following that protocol, that the polyp that develops in their colon can be removed before it ever goes down the malignant pathway, and these folks will probably end up being able to live out a pretty normal life. Whereas if nothing is done, they may turn up with already metastatic disease, as unfortunately happened with this guy.

So there's a circumstances where being able to offer diagnostics is attached to a good preventive medicine strategy, and people would be interested in participating in that; but there's going to be a proliferation of those kinds of opportunities in the next ten years, as we unravel the genetics of many diseases.

Pharmacogenomics is I think a very exciting area that's come along rather quickly, and which promises to actually become part of the standard of practice of medicine for a least a few drugs in the not too distant future. So let me give you an example. This is a bit of a tricky one, but let me walk you through it.

What we're looking at here, is people with heart disease - specifically narrowing of the coronary arteries. So these are people who have already had that diagnosis, and they are being followed over a two-year period to see if it progresses; to see if the arteries get even more narrow. They are also being studied at the DNA level for a gene called CETP, that's involved in cholesterol metabolism. And CETP has two different spellings that are here abbreviated B1 and B2. The alleles are B1 and B2.

Now ignore the red bars - just look at the placebo-treated, so this is natural history of the disease here. The B1 folks, the homozygotes for B1, have the worst outcome over two years. They progress the furthest in the course of that two-year observation; whereas the heterozygotes are intermediate, and the homozygotes for the B2 go the slowest. So if you had to pick, you'd probably want to be a B2 homozygote here.

But look what happens with Pravastatin, which is a drug very commonly given in this situation. The B1 people do great on that. They're narrowing is reduced to better than anybody else on the diagram. The heterozygotes maybe get a little bit of benefit; the B2 homozygotes get no benefit at all. This study I gather has been replicated, but not yet published; this one was from The New England Journal a couple of years ago. This is not very far off then, from saying: Gee, maybe before you prescribe Pravastatin to somebody with coronary artery disease, you want to know whether they're a B2B2, because if so, you're wasting their money and their time, and you might ought to pick some other approach.

That same kind of approach is being studied for many other conditions - particularly asthma - trying to predict which drug is going to work. There are a couple of instances where this has already become the standard of care - for instance in childhood leukemia, the drug 6-mercaptopurine is typically given to kids with ALL. It turns out 1 out of 300 children will have a fatal reaction to that drug. And that's entirely predictable, on the basis the version of a gene that they carry that metabolizes the drug. And so now, before you give 6MP to a kid, you check and make sure they don't have that particular version; and if they do, you give them a very, very reduced dose.

Of course, where we really want to get to down here, is in the therapeutic outcomes. And again, time is over here on the Y-axis; and each one of these arrows may look very simple, but involves years of research, hundreds of millions of dollars of effort, the best and brightest minds, to actually get down to the bottom of the diagram. But it is encouraging to see that people are making strides in that direction.

For gene therapy, we see now some successes in things like hemophilia - although this has been a very tough road to travel - and I think gene therapy was a bit oversold in its earlier days, and exactly how it's going to play out in terms of the treatment of disease still remains to be determined.

I think though, this pathway - where you use the information about the gene to come up with a drug therapy that you couldn't have thought of otherwise - is looking very promising; and I'll give you one example here, which is the drug recently approved by FDA for a particular type of leukemia - a drug called Gleevec.

So this is used to treat the CML - chronic myeloid leukemia - the type of leukemia that typically shows you in their malignant white cells, the Philadelphia chromosome. And the Philadelphia chromosome, just to remind you, is a rearrangement between chromosomes 9 and 22, to make this little guy here - which as you can see under the microscope, is this little small chromosome - which is partly a 9 and partly 22; but at the point of the breakpoint right there, two genes get brought together that weren't supposed to be together - and that's the gene Bcr which is on 22, and the gene Abl which is on 9; and they make a fusion gene, which in turn produces a fusion protein, and that fusion protein - this weird chimera called Bcr-Abl, is capable of transforming a white cell into a malignant leukemia cell. And apparently it does this, because it has an active site right there that binds ATP, and it then transfers a phosphate to some other protein, and that starts a cascade that results in malignancy, and the consequences are pretty severe. This is a peripheral blood smear of somebody with CML, just loaded up with these malignant white cells.

Well researches at Novartis reasoned that if they could block that active site, they might be able to do quite a lot of good for people in this circumstances. So they crystallized the protein, got a three-dimensional structure, and designed a small molecule that would fit right in that pocket, and that's what Gleevec is. So this was sort of drug design based on three-dimensional information.
And when they gave that drug in a Phase 1 trial to 32 patients who had advanced CML - unresponsive to all other therapies, not expected to live more than a few months - 31 of those 32 people went into remission. Which is an unheard-of response in a Phase 1 trial. That has then been replicated; the FDA approved the drug about 3 months ago. It is clear that this is not a cure for some of the people who take it; that relapses are occurring. But still, this is a dramatic improvement over anything that had ever been available for this particular condition. And it does give one a lot of hope, that this same strategy could be played out for disease after disease.
Well, that all sounds fine, but let me not finish this discussion without pointing out that in addition to the medical advances that we would all I think celebrate, there's a whole host of other issues that are raised by this increased knowledge about our genomes that deserves a lot of attention. And I certainly spend a lot of my time working on these: both in terms of trying to encourage research, and in trying to encourage good policymakin

on the part of states and the U.S. government, and the world at large. There are several questions that need answers.
Perhaps the most pressing one at the present time, is this business of discrimination. I showed you that pedigree of a family with colon cancer. The people who got tested there were somewhat reluctant to go through the testing - not because they didn't want the information; but because they were afraid if they tested positive, they might lose their insurance, or lose their jobs. And there is currently no effective federal legislative protection to prevent that outcome. And many people actually go through this testing under an assumed name, so that they reduce the likelihood of the information getting into the wrong hands; or they pay out of pocket, even though their insurance company might have been willing to pay, in order to keep the insurance company from knowing the result.
This is crazy. We all have glitches in our DNA - probably 40 or 50 of them, per person. We're all at risk for this kind of misuse of the information, if we don't just put a stop to it. And yet, we have not yet seen that happen on the federal legislative agenda. Still, there is some hope for that. This year there are a couple of bills that have been introduced to the Congress. President Bush came out in his radio address six weeks ago in favor of a federal legislative solution to this, for both employment and health insurance, and so there is some hope that something's going to happen this year. But only with a lot of public pressure. So feel free to exert public pressure.
There's a big question of how all this information is going to get incorporated into people's idea of their health, and the non-medical aspects of who we all are - and clearly, there's a great deal of confusion, and you all as educators are going to be part of the solution.
We have a big challenge for healthcare providers. Many of them are going to be on the front lines of genetic medicine, and they're not necessarily ready. Access is going to be a huge issue. We have not, in this country, made that a very high priority. There are forty-million of our citizens who don't have new health insurance. And new things that come along tend to be available to those with resources, and not necessarily to those who don't. And that will be true of genetics as well, unless we move swiftly to try to handle some of those inequities.

Will all of this information about variation be used in a way that I hope it will - to reduce prejudice; or will people figure out some way to use this kind of information to hammer on other people they don't like. I would hope that the science in this case so strongly supports the notion of our very great similarity to each other, that that could be a useful part of the dialog.

Will we figure out how to set boundaries; or will we sort of say: Well, whatever science can do, eventually they will do. Do we want to see genetics used not only to cure terrible diseases, but to improve the characteristics of future offspring.

How many of you saw the movie "Gatica," for instance. Well, if you saw that, you saw sort of a Hollywood version of how this could all run off the track, if people decided to start using it in a global way, to try to choose the optimum characteristics of the next generation. Although the "Gatica" movie also points out that there is a fatal flaw in that particular paradigm - in that genetics is not going to be that good, really, at predicting things like intelligence or athletic ability - those are so heavily environmentally influenced.

And that raises, then, the question of genetic determinism. Will we get so excited about all of this, in terms of being able to discover genetic contributions to this trait and that illness, that we'll begin to neglect the importance of the environment, and undervalue things that are much broader than what DNA can tell us: namely, things about the human spirit, and our relationship with God. I don't think God is threatened by our study of the genome, but some people seem to want to make it so.

Let me finish by making some predictions about where we might go in the next 30 years, and then I'll be glad to take some questions. And again, these are my own predictions, and they're probably wrong, but you know, what the heck. It probably is worth thinking about: What is a likely set out outcomes based on the current trajectory.

So in 2010 - and these are mostly about medical consequences - I think this business of being able to offer diagnostic tests to all of us, to tell us what we're at risk for will be a reality, for maybe as many as a dozen conditions. You'll be able to find out what's in your genome that potentially places you at risk for future illness. And for many of those conditions, there will be interventions available - medical surveillance, or diet, or drugs - to reduce the risk.

Pharmacogenomics, where you want to know the genotype before you write the prescription - as the couple of examples I told you about - will undoubtedly be standard-of-care for several drugs by 2010. But… will access be inequitable. Will health disparities persist. I guarantee you the access will be inequitable, and disparities will persist, unless we as a country make that a very high priority.
Will we have solved the discrimination problem. We darned well better, and I hope we don't wait until 2010 to do it.

I'll go another ten years - 2020. I think then the therapeutics - the gene therapy and the drug therapy based on genes - will be in full swing, and we'll have gene-based designer drugs for things like diabetes, and Alzheimer's, following the same paradigm that I described for CML. Gene therapy will be a standard-of-care for some conditions - I don't know which ones.

But there will be a big debate underway - in fact, it's already getting underway, so maybe this is actually putting it further off than it should be - about: what are the boundaries here. Do we want to use this kind of technology for non-medical uses - particularly when it comes to the so-called designer-baby scenario.

I think it is undoubtedly going to be the case that people will not necessarily welcome all of this genetic technology with open arms. Look at the way it's being received with genetically modified foods; and there will be a lot of folks saying: "This just isn't natural, and we shouldn't be doing this." And in many instances, I think there are real concerns that need to be addressed; and others - maybe they'll be a bit overblown. The only way to really deal with this effectively is to have an educated public, so that they can size up the concerns and evaluate them.

Well, I'll go another ten years, and then I won't go any further. 2030 - I think we'll have genomics-based health care, with both preventive and therapeutic strategies; we'll have the ability… perhaps by looking at circulating white cells, to look inside the body by gene expression efforts, and find out what's going on, even before symptoms have appeared. You could think of your white cells as your "canary in the coal mine." And they may very well be giving signs that something is awry months or years before you know about it, and that could be a very powerful way of maintaining good health.

But… more complications. If life expectancy increases, as we all hope it will, social security will get in deep trouble, if it wasn't already - I guess that's a good problem to have; and then will be a broader debate, about: "Gosh, should we go even further here, and not just try to improve a child here and there, but why don't we improve our whole species." And I've got to tell you, that gives me chills. That implies that somebody would be able to decide what's an improvement, and we don't have a good history about that; and undoubtedly, this is something that would not be accessible to all; and we could make some profound mistakes, and end up creating a species that we really didn't like very much, and that placed us in all sorts of risks of a changed environment, and who knows what that would do to our spiritual nature as well. So I hope, personally, that's something we never do.

Well finally, I want to finish with a quote - it's the quote that appears on the last page of that Nature paper - that long Nature paper, if you ever dug through it. When you finally got to the end, you would read this. It's a quote not from a scientist, but from a poet, and it sums up I think very nicely the adventure we're on here, which really is an adventure of exploration. A remarkable time in history, that we're in the midst of. And it's a quote from T.S. Eliot, from The Four Quartets. "We shall not cease from exploration. And the end of all our exploring will be to arrive where we started, and know the place for the first time."

I'm not sure what Eliot was talking about, but it sounds to me like he was talking about us, and the Genome Project. Thank you all very much.

[applause]

[End Collins]