previous contents

Exploring the Genomic Landscape


Mapping the Terrain

One of the central goals of the Human Genome Project is to produce a detailed "map" of the human genome. But, just as there are topographic maps and political maps and highway maps of the United States, so there are different kinds of genome maps, the variety of which is suggested in Genomic geography. One type, a genetic linkage map, is based on careful analyses of human inheritance patterns. It indicates for each chromosome the whereabouts of genes or other "heritable markers," with distances measured in centimorgans, a measure of recombination frequency. During the formation of sperm and egg cells, a process of genetic recombination -- or "crossing over" -- occurs in which pieces of genetic material are swapped between paired chromosomes. This process of chromosomal scrambling accounts for the differences invariably seen even in siblings (apart from identical twins). Logically, the closer two genes are to each other on a single chromosome, the less likely they are to get split up during genetic recombination. When they are close enough that the chances of being separated are only one in a hundred, they are said to be separated by a distance of one centimorgan.

Genomic geography (27k GIF)

The role of human pedigrees now becomes clear. By studying family trees and tracing the inheritance of diseases and physical traits, or even unique segments of DNA identifiable only in the laboratory, geneticists can begin to pin down the relative positions of these genetic markers. By the end of 1994, a comprehensive map was available that included more than 5800 such markers, including genes implicated in cystic fibrosis, myotonic dystrophy, Huntington disease, Tay-Sachs disease, several cancers, and many other maladies. The average gap between markers was about 0.7 centimorgan.

Other maps are known as physical maps, so called because the distances between features are measured not in genetic terms, but in "real" physical units, typically, numbers of base pairs. A close analogy can thus be drawn between physical maps and the road maps familiar to us all. Indeed, the analogy can be extended further. Just as small-scale road maps may show only large cities and indicate distances only between major features, so a low-resolution physical map includes only a relative sprinkling of chromosomal landmarks. A well-known low-resolution physical map, for example, is the familiar chromosomal map, showing the distinctive staining patterns that can be seen in the light microscope. Further, by a process known as in situ hybridization, specific segments of DNA can be targeted in intact chromosomes by using complementary strands synthesized in the laboratory. These laboratory-made "probes" carry a fluorescent or radioactive label, which can then be detected and thus pinpointed on a specific region of the chromosome. Fishing for genes shows some results of fluorescence in situ hybridization (FISH). Of particular interest are probes known as cDNA (for complementary DNA), which are synthesized by using molecules of messenger RNA as templates. These molecules of cDNA thus hybridize to "expressed" chromosomal regions -- regions that directly dictate the synthesis of proteins. However, a physical map that depended only on in situ hybridization would be a fairly coarse one. Fluorescent tags on intact chromosomes cannot be resolved into separate spots unless they are two to five million base pairs apart.

Fishing for genes (11k GIF)

Fortunately, means are also available to produce physical maps of much higher resolution -- analogous to large-scale county maps that show every village and farm road, and indicate distances at a similar level of detail. Just such a detailed physical map is one that emerges from the use of restriction enzymes -- DNA-cleaving enzymes that serve as highly selective microscopic scalpels (see Tools of the Trade). A typical restriction enzyme known as EcoRI, for example, recognizes the DNA sequence GAATTC and selectively cuts the double helix at that site. One use of these handy tools involves cutting up a selected chromosome into small pieces, then cloning and ordering the resulting fragments. The cloning, or copying, process is a product of recombinant DNA technology, in which the natural reproductive machinery of a "host" organism -- a bacterium or a yeast, for example -- replicates a "parasitic" fragment of human DNA, thus producing the multiple copies needed for further study (see Tools of the Trade). By cloning enough such fragments, each overlapping the next and together spanning long segments (or even the entire length) of the chromosome, workers can eventually produce an ordered library of clones. Each contiguous block of ordered clones is known as a contig (a small one is shown in Genomic geography), and the resulting map is a contig map. If a gene can be localized to a single fragment within a contig map, its physical location is thereby accurately pinned down. Further, these conveniently sized clones become resources for further studies by researchers around the world -- as well as the natural starting points for systematic sequencing efforts.

Two giant steps: Chromosomes 16 and 19

One of the signal achievements of the DOE genome effort so far is the successful physical mapping of chromosomes 16 and 19. The high-resolution chromosome 19 map, constructed at the Lawrence Livermore National Laboratory, is based on restriction fragments cloned in cosmids, synthetic cloning "vectors" modeled after bacteria-infecting viruses known as bacteriophages. Like a phage, a cosmid hijacks the cellular machinery of a bacterium to mass-produce its own genetic material, together with any "foreign" human DNA that has been smuggled into it. The foundation of the chromosome 19 map is a large set of cosmid contigs that were assembled by automated analysis of overlapping but unordered restriction fragments. These contigs span an estimated 54 million base pairs, more than 95 percent of the chromosome, excluding the centromere.

An emerging gene map (30k GIF)

Most of the contigs have been mapped by fluorescence in situ hybridization to visible chromosomal bands. Further, more than 200 cosmids have been more accurately ordered along the chromosome by a high-resolution FISH technique in which the distances between cosmids are determined with a resolution of about 50,000 base pairs. This ordered FISH map, with cosmid reference points separated by an average of 230,000 base pairs, provides the essential framework to which other cosmid contigs can be anchored. Moreover, the EcoRI restriction sites have been mapped on more than 45 million base pairs of the overall cosmid map. Over 450 genes and genetic markers have also been localized on this map, of which nearly 300 have been incorporated into the ordered map. An emerging gene map shows the locations of the mapped genes. Among these genes is the one responsible for the most common form of adult muscular dystrophy (DM), which was identified in 1992 by an international consortium that included Livermore scientists. A second important disease gene (COMP), responsible for a form of dwarfism known as pseudoachondroplasia, has also been identified. And yet another gene, one linked to a form of congenital kidney disease, has been localized to a single contig spanning one million base pairs, but has not yet been precisely pinpointed. About 2000 other genes are likely to be found eventually on chromosome 19.

In a similar effort, the Los Alamos National Laboratory Center for Human Genome Studies has completed a highly integrated map of chromosome 16, a chromosome that contains genes linked to blood disorders, a second form of kidney disease, leukemia, and breast and prostate cancers. A readable display of this integrated map covers a sheet of paper more than 15 feet long; a portion of it, much reduced and showing only some of its central features, is reproduced here as Mapping chromosome 16. The framework for the Los Alamos effort is yet another kind of map, a "cytogenetic breakpoint map" based on 78 lines of cultured cells, each a hybrid that contains mouse chromosomes and a fragment of human chromosome 16. Natural breakpoints in chromosome 16 are thus identified, leading to a breakpoint map that divides the chromosome into segments whose lengths average 1.1 million base pairs. Anchored to this framework are a low-resolution contig map based on YAC clones and a high-resolution contig map based largely on cosmids (for more on YACs, yeast artificial chromosomes, see Tools of the Trade). The low-resolution map, comprising 700 YACs from a library constructed by the Centre d'Etude du Polymorphisme Humain (CEPH), provides practically complete coverage of the chromosome, except the highly repetitive DNA in the centromere region. The high-resolution map comprises some 4000 cosmid clones, assembled into about 500 contigs covering 60 percent of the chromosome. In addition, it includes 250 smaller YAC clones that have been merged with the cosmid contig map. The cosmid contig map is an especially important step forward, since it is a "sequence-ready" map. It is based on bacterial clones that are ideal substrates for DNA sequencing, and further, these clones have been restriction mapped to allow identification of a minimum set of overlapping clones for a large-scale sequencing effort.

Mapping chromosome 16 (44k GIF)

The high- and low-resolution maps have been tied together by sequence-tagged sites (STSs), short but unique stretches of DNA sequence. They have also been integrated into the breakpoint map, and with genetic maps developed at the Adelaide Children's Hospital and by CEPH. The integrated map also includes a transcription map of 1000 sequenced exons (expressed fragments of genes) and more than 600 other markers developed at other laboratories around the world.

Getting down to details: Sequencing the genome

Ultimately, though, these physical maps and the clones they point to are mere stepping stones to the most visible goal of the genome project, the string of three billion characters -- A's, T's, C's, and G's -- representing the sequence of base pairs that defines our species. Included, of course, would be the sequence for every gene, as well as the sequences for stretches of DNA whose functions we don't yet know (but which may be involved in such little-understood processes as orchestrating gene expression in different parts of our bodies, at different times of our lives). Should anyone undertake to print it all out, the result would fill several hundred volumes the size of a big-city phone book.

Only the barest start has been made in taking this dramatic step in the Human Genome Project. Several hundred million base pairs have been sequenced and archived in databases, but the great majority of these are from short "sequence tags" on cloned fragments. Only about 30 million base pairs of human DNA (roughly one percent of the total) have been sequenced in longer stretches, the longest being about 685,000 base pairs long. Even more daunting is the realization that we will eventually need to sequence many parts of the genome many times, thus to reveal differences that indicate various forms of the same gene.

Hence, as with so many human enterprises, the challenge of sequencing the genome is largely one of doing the job cheaper and faster. At the beginning of the project, the cost of sequencing a single base pair was between $2 and $10, and one researcher could produce between 20,000 and 50,000 base pairs of continuous, accurate sequence in a year. Sequencing the genome by the year 2005 would therefore likely cost $10-20 billion and require a dedicated cadre of at least 5000 workers. Clearly, a major effort in technology development was called for -- an effort that would drive the cost well below $1 per base pair and that would allow automation of the sequencing process. From the beginning, therefore, the DOE has emphasized programs to pave the way for expeditious and economical sequencing efforts -- programs to develop new technologies, including new cloning vectors, and to establish suitable resources for sequencing, including clone libraries and libraries of expressed sequences.

Efforts to develop new cloning vectors have been especially productive. YACs remain a classic tool for cloning large fragments of human DNA, but they are not perfect. Some regions of the genome, for example, resist cloning in YACs, and others are prone to rearrangement. New vectors such as bacterial artificial chromosomes (BACs), P1 phages, and P1-derived artificial cloning systems (PACs) have thus been devised to address these problems. These new approaches are critical for ensuring that the entire genome can be faithfully represented in clone libraries, without the danger of deletions, rearrangements, or spurious insertions.

Marked progress is also evident in the development of sequencing technologies, though all of those in widespread current use are still based on methods developed in 1977 by Allan Maxam and Walter Gilbert and by Frederick Sanger and his coworkers (see Tools of the Trade). Both of these methods rely on gel-based electrophoresis systems to separate DNA fragments, and recent advances in commercial systems include increasing the number of gel lanes, decreasing run times, and enhancing the accuracy of base identification. As a result of such improvements, a standard sequencing machine can now turn out raw, unverified sequences of 50,000 to 75,000 bases per day.

Equally important to the sequencing goals of the genome project is a rational system for organizing and distributing the material to be sequenced. The DOE's commitment to such resources dates back to 1984, when it organized the National Laboratory Gene Library Project. Based on cell- and chromosome-sorting technologies developed at Livermore and Los Alamos, libraries of clones were established for each of the human chromosomes, and the individual clones are widely available for mapping and for isolating genes. These clones were invaluable in such notable "gene hunts" as the successful searches for the cystic fibrosis and Huntington disease genes. More recently, as more efficient vectors have become available, complete human DNA libraries have been established using BACs, PACs, and YACs.

Another critical resource is being assembled in an effort known as I.M.A.G.E. (Integrated Molecular Analysis of Genomes and their Expression), cofounded by the Livermore Human Genome Center. The aim is a master set of mapped and sequenced human cDNA, representing the expressed parts of the human genome. By early 1996, I.M.A.G.E. had distributed over 250,000 partial and complete cDNA clones, most of them with one or both ends sequenced to provide unique identifiers. These identifiers, expressed sequence tags (ESTs), are usually 300-500 base pairs each. Twenty-five hundred genes have also been newly mapped as part of this coordinated effort.

Shotguns and transposons

Such advances as these, in both technology development and the assembly of resource libraries, have brought much nearer the day when "production sequencing" can begin. A great deal of variety remains, however, in the approaches available to sequencing the human genome, and it is not yet clear which will prove the most efficient and most cost-effective way to read long stretches of DNA over the next decade. One of the available choices, for example, is between "shotgun" and "directed" strategies. Another is the degree of redundancy -- that is, how many times must a given strand be sequenced to ensure acceptable confidence in the result?

Shotgun sequencing derives its name from the randomly generated DNA fragments that are the objects of scrutiny. Many copies of a single large clone are broken into pieces of perhaps 1500 base pairs, either by restriction enzymes or by physical shearing. Each fragment is then separately cloned, and a convenient portion of it sequenced. A computational assembly process then compares the terminal sequences of the many fragments and, by finding overlaps that indicate neighboring fragments, constructs an ordered library for the parent clone. The members of this ordered library can then be sequenced from end to end to yield a complete sequence for the parent. The statistics involved in taking this approach require that many copies of the original clone be randomly fragmented, if no gaps are to be tolerated in the final sequence. A benefit is that the final sequence is highly reliable; the main disadvantage is that the same sequence must be done many times (in the many overlapping fragments). Nevertheless, shotgun sequencing has been the primary means for generating most of the genomic sequence data in public DNA databases. This includes the longest contiguous fragment of sequenced human DNA, from the human T-cell receptor beta region, of about 685,000 base pairs -- a product of DOE-supported work at the University of Washington.

The shotgun strategy is also being used at the Genome Therapeutics Corporation and The Institute for Genomic Research (TIGR), as part of the DOE-supported Microbial Genome Initiative. Genome Therapeutics has sequenced 1.8 million base pairs of Methanobacterium thermoautotrophicum, a bacterium important in energy production and bioremediation, and TIGR has successfully sequenced the complete genomes of three free-living bacteria, Haemophilus influenzae (1,830,137 base pairs; an effort supported mostly by private funds), Mycoplasma genitalium (580,070 base pairs), and Methanococcus jannaschii (1,739,933 base pairs).

The alternative to shotgun sequencing is a directed approach, in which one seeks to sequence the target clone from end to end with a minimum of duplication. The essence of this approach is embodied in a technique known as primer walking. Starting at one end of a single large fragment, one replicates a stretch of DNA -- say, 400 base pairs long -- that can be sequenced in one run. With the sequence for this first segment in hand, the next stretch of DNA, just overlapping the first, is then tackled in the same way. In principle, one can thus "walk" the entire length of the original clone. Unfortunately, this conceptually simple approach has been historically beset with disadvantages, mainly the expense and inconvenience of custom-synthesizing a primer as the necessary starting point for each sequencing step. The widely automated Sanger sequencing method involves a DNA replication step that must be "primed" by a DNA fragment that is complementary to 15 to 20 base pairs of the strand to be sequenced (see Tools of the Trade). Until recently, making these primers was an expensive and time-consuming business, but recent innovations have made primer walking, and similar directed strategies, more and more economically feasible.

Taking a directed approach (50k GIF)

One way to deal with the primer bottleneck, for example, is to use sets of very short fragments to prime the next sequencing step. As an illustration, the four nucleotides (A, T, C, and G) can be ordered in more than 68 billion ways to create an 18-base primer, an imposing set of possibilities. But it is eminently practical to create a library of the 4096 possible 6-base primers. Three of these "6-mers" can be matched to the end of the fragment to be sequenced, thus serving as an 18-base primer. This modular primer technology, developed at the Brookhaven National Laboratory, is currently being applied to Borrelia burgdorferi, the organism that causes Lyme disease; a 34,000-base-pair fragment has already been sequenced.

Another directed approach uses a naturally occurring genetic element called a transposon, which insinuates itself more or less randomly in longer DNA strands. This predilection for random insertion and the fact that the transposon's DNA sequence is well known are the keys to the sequencing strategy depicted schematically in Taking a directed approach. The largest clones are broken into smaller subclones (each of about 3000 base pairs), which then become the targets of the transposons. Multiple copies of each subclone are exposed to the transposons, and reaction conditions are controlled to yield, on average, a single insertion in each 3000-base-pair strand. The individual strands are then analyzed to yield, for each, the approximate position of the inserted transposon. By mapping these positions, a "minimum tiling path" can be determined for each subclone -- that is, a set of strands can be identified whose transposon insertions are roughly 300 base pairs apart. In this set of strands, the region around each transposon is then sequenced, using the inserted transposons as starting points. The known transposon sequence allows a single primer to be used for sequencing the full set of overlapping regions.

At the Lawrence Berkeley National Laboratory, this technique has been used to sequence over 1.5 million base pairs of DNA on human chromosomes 5 and 20, as well as over three million base pairs from the fruit fly Drosophila melanogaster. On chromosome 5, interest focuses on a region of three million base pairs that is rich in growth factor and receptor genes; whereas, on chromosome 20, Berkeley researchers are interested in a region of about two million base pairs that is implicated in 15 to 20 percent of all primary breast carcinomas. As an example of the kind of output these efforts produce, Sequence data: The final product shows a stretch of sequence data from chromosome 5.

Sequence data: The final product (32k GIF)

Researchers supported by the DOE at the University of Utah are also pursuing the use of directed sequencing. In addition, they have developed a methodology for "multiplex" DNA sequencing, which offers a way of increasing throughput with either shotgun or directed approaches. By attaching a unique identifying sequence to each sequencing sample in a mixture of, say, 50 such samples, the entire mixture can be analyzed in a single electrophoresis lane. The 50 samples can be resolved sequentially by probing, first, for bands containing the first identifier, then for bands containing the second, and so forth. In a similar way, multiplexing can also be used for mapping. The Utah group is now able to map almost 5000 transposons in a single experiment, and they are using multiplexing in concert with a directed sequencing strategy to sequence the 1.8 million base pairs of the thermophilic microbe Pyrococcus furiosus and two important regions of human chromosome 17.

The completed physical maps of chromosomes 16 and 19, with their extensive coverage in many different kinds of cloning vectors, are especially ripe for large-scale sequencing. Los Alamos scientists have therefore begun sequencing chromosome 16, focusing special effort on locating the estimated 3000 expressed genes on that chromosome and using those sites as starting points for directed genomic sequencing. A region of 60,000 base pairs has already been sequenced around the adult polycystic kidney gene, and good starts have been made in mapping other genes. Interestingly, even random sequencing has led to the identification of gene DNA in over 15 percent of the samples, confirming the apparent high density of genes on this chromosome. Between chromosome 16 and the short arm of chromosome 5, another Los Alamos target, the genome center there has produced almost two million base pairs of human DNA sequence.

A parallel effort is under way at Livermore on chromosome 19 and other targeted genomic regions. Using a shotgun approach, researchers there have completed over 1.3 million bases of genomic sequence. Initially, they are attacking two major regions of chromosome 19: one of about two million base pairs, containing several genes involved in DNA repair and replication, and another of approximately one million base pairs, containing a kidney disease gene. The Livermore scientists are making use of the I.M.A.G.E. cDNA resource to sequence the cDNA from these regions, along with the associated segments of the genome. In addition, Livermore scientists have targeted DNA repair gene regions throughout the genome and, in many cases, have done comparative sequencing of these genes in other species, especially the mouse. Such comparative sequencing has identified conserved sequence elements that might act as regulatory regions for these genes and has also assisted in the identification of gene function (see The Mighty Mouse).

How good is good enough?

The goal of most sequencing to date has been to guarantee an error rate below 1 in 10,000, sometimes even 1 in 100,000. However, the difference between one human being and another is more like one base pair in five hundred, so most researchers now agree that one error in a thousand is a more reasonable standard. To assure a higher level of confidence, and perhaps to uncover important individual differences, the most biologically or medically important regions would still be sequenced more exhaustively, but using this lowered standard would greatly reduce the cost of acquiring sequence data for the bulk of human DNA.

With this philosophy in mind, Los Alamos scientists have begun a project to determine the cost and throughput of a low-redundancy sequencing strategy known as sample sequencing (SASE, or "sassy"). Clones are selected from the high-resolution Los Alamos cosmid map, then physically broken into 3000-base-pair subclones -- much as in other sequencing approaches. In contrast to, say, shotgun sequencing, though, only a small random set of the subclones is then selected for sequencing. Sequence fragments already known -- end sequences, sequence-tagged sites, and so forth -- are used as the starting points. The result is sequence coverage for about 70 percent of the original cosmid clone, enough to allow identification of genes and ESTs, thus pinpointing the most critical targets for later, more thorough sequencing efforts. Further, the SASE-derived sequences provide enough information for researchers elsewhere to pursue just such comprehensive efforts, using whole genomic DNA. In addition, the cost of SASE sequencing is only one-tenth the cost of obtaining a complete sequence, and a genomic region can be "sampled" ten times as fast.

As the first major target of SASE analysis, Los Alamos scientists chose a cosmid contig of four million base pairs at the end (the telomere) of the short arm of chromosome 16. By early 1996, over 1.4 million base pairs had been sequenced, and a gene, EST, or suspected coding region had been located on every cosmid sampled.

In addition, Los Alamos is building on the SASE effort by using SASE sequence data as the basis for an efficient primer walking strategy for detailed genomic sequencing. The first application of this strategy, to a telomeric region on the long arm of chromosome 7, proved to be as efficient as typical shotgun sequencing, but it required only two- to threefold redundancy to produce a complete sequence, in contrast to the seven- to tenfold redundancy required in shotgun approaches. The resulting 230,000-base-pair sequence is the second-longest stretch of contiguous human DNA sequence ever produced.

In a sense, though, even a complete genome sequence -- the ultimate physical map -- is only a start in understanding the human genome. The deepest mystery is how the potential of 100,000 genes is regulated and controlled, how blood cells and brain cells are able to perform their very different functions with the same genetic program, and how these and countless other cell types arise in the first place from an single undifferentiated egg cell. A first step toward solving these subtle mysteries, though, is a more complete physical picture of the master molecules that lie at the heart of it all.

SIDEBAR: Tools of the Trade

SIDEBAR: The Mighty Mouse

previous contents

To Know Ourselves was prepared at the request of the U.S. Department of Energy, Office of Health and Environmental Research, as an overview of the Human Genome Project.