Bacteria Artificial Chromosomes (BACs)

Enriching BACs for Sequencing
with Sequence Tag Connectors

Several types of chromosome maps supporting biomedical research have been constructed during the Human Genome Program (HGP). In the preparation for high-throughput chromosomal sequencing, the most valuable are megabase-scale assemblies of overlapping DNA clones (contigs). Building long contigs, however, has proven a difficult task. The contig maps of chromosome 16 developed at LANL and chromosome 19 developed at LLNL were largely complete in 1995. Other chromosomes were much less ready for high-throughput sequencing.

Obtaining a substantially uniform representation of the genome using recombinant DNA clones was itself a problem, up to a few years ago. The problem was solved by the DOE-supported development of the more stable and larger recombinant BACs (bacterial artificial chromosomes) by the team of Melvin Simon at CalTech, with later process improvements by the team of Pieter de Jong. To support the contig building requirements of sequencers, sequence tag connectors (STCs) for the BACs are now bring generated.

STC ideas were first used by George Church and evolved in several smaller-scale, sequencing projects. Acquiring STC datasets for BACs representing a deep coverage of the whole human genome was advocated in 1995-6 (Venter, J.C., Smith, H.O., and Hood, L.E., Nature 381: 364-366) at sequencing workshops and a BAC resources meeting. A primary utility is illustrated below. The BACs whose STCs overlap an already sequenced region are candidate clones for extension of the sequence.

BAC End Sequencing Extends Contigs. Software tools are helping to position STCs. One tool, provided by the Genome Channel, allows investigators to view the contig positions of more than 15,000 BAC end sequences and their relationships to other clones and predicted genes and exons (gene-coding regions). In the figure, the black bar represents 250 kb of a much longer contig. Below the bar, the long horizontal lines denote BAC clones, of which the first, fifth, and sixth are candidates for extending the seed contig to the left. Above the bar, vertical tick marks indicate exons as predicted by GRAIL software. Exons connected by short horizontal lines represent putative gene models for the contig's forward DNA strand. (From Human Genome News v10n1-2)

Upon receipt of applications to a DOE 1996 HGP research solicitation, Ari Patrinos implemented a fast-track, special review panel for those relevant to the contig problem. Overall the panel found the STC strategies meritorious but recommended pilot projects rather than an immediate genome scale implementation. Thus STC production protocols could be refined and economics clarified. Projects were initiated at six sites with a total of $5 million in September, 1996.

Several months later a workshop and review was held to assess progress. Each team had developed substantive and useful results. A major recommendation emerging from subsequent discussions was that DOE should maintain its support near the current level, about $5M/yr, but that a STC production phase should be implemented only at the sites achieving the highest-quality sequence reads. These reads would enable the more-demanding design of sequence tag sites (STSs) in addition to serving as STCs. STSs support other mapping methodologies, including BAC positioning on radiation hybrid maps, which are an important complement to contig maps.

After a transition phase, high-throughput STC production was implemented only at the The Institute for Genomic Research (TIGR) , initially under Mark D. Adams, and at the University of Washington with production managed by Gregory Mahairas of Leroy E. Hood's Department of Molecular Biotechnology (UWMB). A September 1998 site review at the newly opened High Throughput Sequencing Center of the UW Department of Molecular Biotechnology reaffirmed plans for completing the STC projects at TIGR and UWBC by fall 1999.

This timeline has now been shortened, however. In March 1999 the consortium of major sequencing centers announced a short-term objective of generating a draft sequence of the human genome within a year. Availability of the full BAC STC datasets was found crucial to achieving this goal. TIGR and the University of Washington consequently reprogrammed their ongoing projects. With additional support from DOE, STC production was expected to be substantially complete in July 1999.

The UWMB datasets already include include restriction fingerprints for BACs of the CalTech library. Extension of fingerprinting to the RPCI BACs is planned. The fingerprints will help sequencing teams validate candidate BACs overlapping their sequenced regions by distinguishing them from those that merely have limited homology within their STCs and probably represent distant chromosomal loci. With the increasing evidence of duplicated regions within the genome, great care is necessary for validating contig extension before commitment to expensive sequencing.

For the CalTech BACs there is an expanding correlation with cDNAs of the Unigene collection. This will enable the concurrent sequencing of BACs with the messenger RNAs (as represented by cDNAs) they putatively encode. This research area is complemented by a DOE-initiated series Workshops on Complete DNA Sequencing addressing international coordination of cDNA sequencing. Recognition and specification of gene-coding segments of chromosomes is greatly aided when both genome and cDNA sequences are available.

Recent reports from the STC teams were presented at the January 1999 DOE HGP Contractors and Grantees Workshop, together with many other reports relevant to genome sequencing. Detailed information and protocols are on-line at The Institute for Genomic Research (TIGR) and the Department of Molecular Biotechnology's Sequencing Center. Both teams, along with the DOE Genome Annotation Consortium, are providing online tools to aid sequencers worldwide, in the identification of BACs needed for contig extension.

For major sequencing centers, the BAC libraries are available directly from CalTech and RPCI. Sites that require fewer BACs can obtain them through commercial suppliers or regional resource centers in Europe, after identification of contig-extension candidates, by comparisons between the STC database and chromosome seed sequences.

At an October 1998 meeting of sequencing team leaders at the NIH/NHGRI it was recognized that although large validated contigs remain the most desirable inputs, sequencing begun on single interesting BACs also has a substantial role to play in the total HGP effort. The STC datasets will be particularly useful for sequencing so initiated, as it will enable the rapid construction of contigs to guide sequence extension from the initial loci utilized. Use of STC data is integral to the whole human genome shotgun-sequencing strategy of Celera, Inc.

Genome projects for other species in which STC datasets are either in use or planned currently include Arabidopsis thaliana and the mouse.

Please e-mail any comments and suggestions for further related links to Marvin Stodolsky for the DOE Human Genome Task Group.

The paragraphs below correspond to some of the Hot Links in the preceding text.

Review of the STC related applications:
The applications to the HGP solicitation were received in April 1996 and reviewers for the special panel were obtained during May. The panel represented genomics efforts in seven countries and expertise in human and mouse genetics, mapping, sequencing, informatics and management. In preparation for a joint discussion, reviewers received the applications and returned initial critiques by e-mail. Some requests for clarification were forwarded to applicants and their responses returned to the anonymous reviewers. Outstanding differences in reviewer opinions were listed, in preparation for a joint conference call in July 1996. Staff of the DOE, NIH and NSF were listen-in observers, with M. Stodolsky coordinating for the DOE. The individual reviewer critiques were subsequently completed and sent to DOE . Following a final assessment by DOE Human Genome Task Group (HGTG) staff, pilot projects at six sites were initiated, with funds transmitted in September 1996.

Pilot Projects

The BAC libraries were provided by the teams under:

Melvin Simon at the California Institute of Technology, (CalTech)
Pieter de Jong at Children's Hospital Oakland Research Institute [formerly at the Roswell Park Cancer Inst. (RPCI)].

A basic technical problem was to purify BAC DNAs economically but with quality high enough to obtain good sequence reads. This task was addressed by:

the CalTech team,
the RPCI team,
Skip Garner and Glen Evans at University of Texas SW Medical Center,
a team under Leroy E. Hood at the University of Washington, Department of Molecular Biotechnology, (UWMB)

a team under M.D. Adams at TIGR (The Institute for Genomic Research) following Adams transfer to Celera, Inc., STC production at TIGR is now managed by William Nierman and Shaying Zhao .

An analysis of regions with chromosome segment duplications discovered in the laboratory of J. Korenberg at the Cedars Sinai Medical Center was extended, because duplications can pose troublesome ambiguities to contig map construction. Other problematic cases are described in research by Evan Eicher and colleagues:

Eichler, EE, Lu, F, Shen, Y, Antonucci, R, Doggett, NA, Moyzis, RK, Baldini, A, Gibbs, RA, Nelson, DL. (1996) Duplication of the Xq28 CDM-CTR region to 16p11.1: A novel pericentromeric-directed mechanism for paralogous genome evolution. Hum. Molec. Genet. 5:899 912

Eichler, EE, Budarf, ML, Rocchi, M, Deaven, LD, Doggett, NK, Nelson, DL, Mohrenweiser, H. (1997) Interchromosomal duplications of the adrenoleukodystrophy locus: a phenomenon of pericentromeric plasticity. Hum. Molec. Genet. 6: 991-1002.

Eichler, EE. (1998) Masquerading repeats: Paralogous pitfalls of the human genome. Genome Res. 8: 758-762.

Eichler, EE, Hoffman, SM, Gordon, LA, McCready, P, Lamerdin, JE, Mohrenweiser, HW. (1998) Complex beta-satellite repeat structures and the expansion of the zinc-finger gene cluster in 19p12. Genome Res. 8 : 791-808.

and by Barbara Trask and colleagues:

Trask, B.J., Friedman, C., martin-Gallardo, A., Rowen, L, Akinbami, C., Blankenship, J, Collins, C, Giorgi, D., Iadonato, S., Johnson, F., Kuo, W.-L., Massa, H., Morrish, T., Naylor, S., Nguyen, O.T.H., Rouquier, S., Smith, T., Wong, D.J., Youngblom, J., van den Engh, G. (1998) Members of the olfactory receptor gene family are contained in large blocks of DNA duplicated polymorphically near the ends of human chromosomes. Human Molecular Genetics 7: 13-26.

Trask, B.J., Massa, H., Brand-Arpon, V., Chan, K., Friedman C., Nguyen, O.T., Eichler, E., van den Engh, G., Rouquier, S., Shizuya H., Giorgi, D. (198) Large multi-chromsomal duplications encompass many members of the olfactory receptor gene family in the human genome. Human Molecular Genetics 7: (in press).

Pilot projects workshop and review on May 29, 1997

GRANTEES

The Institute for Genomic Research, abstract: Mark Adams, Steve Rounsley, Jenny Kelley and Hamilton O. Smith from Johns Hopkins University

Roswell Park Cancer Institute, abstract: Pieter de Jong, Joseph Catanese

University of Texas Southwestern Medical Center, abstract: Glen A. Evans and Harold R. Garner

University of Washington, Department of Molecular Biotechnology, abstract: Leroy H. Hood, Greg Mahairas, Todd Smith and Keith Zackrone

California Institute of Technology, abstract: Melvin I. Simon and Ung-Jin Kim

Cedars-Sinai Medical Center, abstract: Julie R. Korenberg

REVIEWERS:

Larry L. Deaven, Los Alamos National Laboratory
Trevor L. Hawkins, MIT Whitehead Inst.
Stanley Letovsky, Genome Data Base at Johns Hopkins University
David L. Nelson, Baylor College of Medicine
Michael Palazzolo, Lawrence Berkeley National Laboratory
Richard M. Myers, Stanford University School of Medicine
Lisa Stubbs, Oak Ridge National Laboratory

OBSERVERS/DISCUSSANTS:

Elbert W. Branscomb, DOE Joint Genome Institute
Robert W. Cottingham, Genome Data Base, Johns Hopkins University
Norman Doggert, Los Alamos National Laboratory
Sylvia Spengler, Lawrence Berkeley National Laboratory

U.S. GOVERNMENT STAFF:
DOE/OBER - Marvin Frazier, Daniel W. Drell, Arthur Katz , Marvin Stodolsky, David Thomassen
NIH/NHGRI - Mark S. Guyer, Adam Felsenfeld, Jane Peterson, Jeffery Schloss
NIH/NCI - Carol Dahl

STS versus STC requirements

For a sequence read to be useful as an STC, it need contain only a sequence segment unique to the source genome. For a read to be suitable for STS design, it must include two unique segments within the source genome. These segments must additionally lack features that would hinder priming for or read through by DNA polymerases used in the polymerase chain reaction (PCR). Thus the requirements for STS design are more stringent than those for STC usage. A higher-quality, longer sequence read is generally necessary to support the more demanding STS design requirements.

A review of progress and plans at the recently opened sequencing facility used by the University of Washington, Department of Molecular Biotechnology was held in September 1998. Leroy E. Hood and Gregory MacHarris gave presentations and production projections (see abstract).

The reviewers were:

Ellson Chen, Applied Biosystems Division of Perkin Elmer, Inc.

David Nelson, Baylor College of Medicine

Robert Robbins, Fred Hutchinson Cancer Research Center

Elbert Branscomb attended as an observer from the DOE Joint Genome Institute.

DOE was represented by Marvin Frazier, Director of the OBER Life Sciences Division

Validation of contigs
Clones that may overlap each other can be identified by a variety of methodologies, but none are foolproof by themselves. There may be some inadvertent representation of distant genomic loci. Also some clones may be defective due to accidents in construction or subsequent DNA rearrangements. Contig structure can be tentatively validated before sequencing by assessing whether putative overlaps contain such common representative features as restriction sites or STS markers. For members of a candidate contig passing validation tests, the one with minimal overlap of the previously sequenced region is the optimal choice for extending the seed region.

In the United States, Genome Systems Inc. and Research Genetics distribute clonal resources and provide screening services. In Europe, similar services are provided by the U.K. Human Genome Mapping Project Resource Centre and the German Resource Center.

Last modified:

Disclaimers * Webmaster

Home Links Meetings Articles Contacts History

Enriching BACs for Sequencing with Sequence Tag Connectors

Enriching BACs for Sequencing
with Sequence Tag Connectors