Home
STC
Project History
1995
Meeting
Articles
Contacts
Links
HGP
Sequences
HGP Research
What is a BAC?
|
Enriching BACs for Sequencing
with Sequence Tag Connectors
Several types of chromosome maps supporting biomedical research have been constructed
during the Human Genome Program (HGP). In the preparation for high-throughput
chromosomal sequencing, the most valuable are megabase-scale assemblies of overlapping
DNA clones (contigs). Building long contigs, however, has proven a difficult
task. The contig maps of chromosome 16 developed at LANL and
chromosome 19 developed at LLNL
were largely complete in 1995. Other chromosomes were much less ready for high-throughput
sequencing.
Obtaining a substantially uniform representation of the genome using recombinant
DNA clones was itself a problem, up to a few years ago. The problem was solved
by the DOE-supported development of the more stable and larger recombinant BACs
(bacterial artificial chromosomes) by the team of Melvin Simon at CalTech, with later process
improvements by the team of Pieter de
Jong. To support the contig building requirements of sequencers, sequence
tag connectors (STCs) for the BACs are now bring generated.
STC ideas were first used by George Church and evolved in several smaller-scale,
sequencing projects. Acquiring STC datasets for BACs representing a deep coverage
of the whole human genome was advocated in 1995-6 (Venter, J.C., Smith, H.O.,
and Hood, L.E., Nature 381: 364-366) at sequencing
workshops and a BAC resources meeting. A primary
utility is illustrated below. The BACs whose STCs overlap an already sequenced
region are candidate clones for extension of the sequence.
BAC End Sequencing Extends Contigs. Software tools are helping to position
STCs. One tool, provided by the Genome Channel, allows investigators to view
the contig positions of more than 15,000 BAC end sequences and their relationships
to other clones and predicted genes and exons (gene-coding regions). In the
figure, the black bar represents 250 kb of a much longer contig. Below the bar,
the long horizontal lines denote BAC clones, of which the first, fifth, and
sixth are candidates for extending the seed contig to the left. Above the bar,
vertical tick marks indicate exons as predicted by GRAIL software. Exons connected
by short horizontal lines represent putative gene models for the contig's forward
DNA strand. (From Human Genome News v10n1-2)
Upon receipt of applications to a DOE 1996 HGP research solicitation, Ari
Patrinos implemented a fast-track, special review
panel for those relevant to the contig problem. Overall the panel found
the STC strategies meritorious but recommended pilot projects rather than an
immediate genome scale implementation. Thus STC production protocols could be
refined and economics clarified. Projects were initiated at six
sites with a total of $5 million in September, 1996.
Several months later a workshop
and review was held to assess progress. Each team had developed substantive
and useful results. A major recommendation emerging from subsequent discussions
was that DOE should maintain its support near the current level, about
$5M/yr, but that a STC production phase should be implemented only at the sites
achieving the highest-quality sequence reads. These reads would enable the more-demanding design of sequence tag sites (STSs) in
addition to serving as STCs. STSs support other mapping methodologies,
including BAC positioning on radiation
hybrid maps, which are an important complement to contig maps.
After a transition phase, high-throughput STC production was implemented only
at the The
Institute for Genomic Research (TIGR) , initially under Mark D. Adams, and
at the University of Washington with production managed by Gregory
Mahairas of Leroy E. Hood's Department of Molecular
Biotechnology (UWMB). A September 1998 site review at the newly opened High Throughput Sequencing Center of
the UW Department of Molecular Biotechnology reaffirmed plans for completing
the STC projects at TIGR and UWBC by fall 1999.
This timeline has now been shortened, however. In March 1999 the consortium
of major sequencing centers announced a short-term objective of generating a
draft sequence of the human genome within a year.
Availability of the full BAC STC datasets was found crucial to achieving this
goal. TIGR and the University of Washington consequently reprogrammed their
ongoing projects. With additional support from DOE, STC production was expected
to be substantially complete in July 1999.
The UWMB datasets already include include restriction fingerprints for BACs
of the CalTech library. Extension of fingerprinting to the RPCI BACs is planned.
The fingerprints will help sequencing teams validate
candidate BACs overlapping their sequenced regions by distinguishing them from
those that merely have limited homology within their STCs and probably represent
distant chromosomal loci. With the increasing evidence of duplicated regions
within the genome, great care is necessary for validating contig extension before
commitment to expensive sequencing.
For the CalTech BACs there is an
expanding
correlation with cDNAs of the Unigene collection.
This will enable the concurrent sequencing of BACs with the messenger RNAs (as
represented by cDNAs) they putatively encode. This research area is complemented
by a DOE-initiated series Workshops
on Complete DNA Sequencing addressing international coordination of cDNA
sequencing. Recognition and specification of gene-coding segments of chromosomes
is greatly aided when both genome and cDNA sequences are available.
Recent reports from the STC teams were presented at the January 1999 DOE HGP
Contractors and Grantees Workshop, together with many other reports relevant
to genome sequencing. Detailed information and protocols are on-line at The Institute
for Genomic Research (TIGR) and the Department of Molecular Biotechnology's
Sequencing Center. Both teams, along with the DOE Genome Annotation Consortium, are providing
online tools to aid sequencers worldwide, in the identification of BACs needed
for contig extension.
For major sequencing centers, the BAC libraries are available directly from
CalTech and RPCI. Sites that require fewer BACs can obtain them through commercial
suppliers or regional resource centers in Europe, after identification of
contig-extension candidates, by comparisons between the STC database and chromosome
seed sequences.
At an October 1998 meeting of sequencing team leaders
at the NIH/NHGRI it was recognized that although
large validated contigs remain the most desirable inputs, sequencing begun on
single interesting BACs also has a substantial role to play in the total HGP
effort. The STC datasets will be particularly useful for sequencing so initiated,
as it will enable the rapid construction of contigs to guide sequence extension
from the initial loci utilized. Use of STC data is integral to the whole human
genome shotgun-sequencing strategy of Celera, Inc.
Genome projects for other species in which STC datasets are either in use or
planned currently include Arabidopsis
thaliana and the mouse.
Please e-mail any comments and suggestions for further related links to Marvin Stodolsky for the DOE
Human Genome Task Group.
The paragraphs below correspond to some of the Hot Links in the preceding text.
Review of the STC related applications:
The applications to the HGP solicitation were received in April 1996 and reviewers
for the special panel were obtained during May. The panel represented genomics
efforts in seven countries and expertise in human and mouse genetics, mapping,
sequencing, informatics and management. In preparation for a joint discussion,
reviewers received the applications and returned initial critiques by e-mail.
Some requests for clarification were forwarded to applicants and their responses
returned to the anonymous reviewers. Outstanding differences in reviewer opinions
were listed, in preparation for a joint conference call in July 1996. Staff
of the DOE, NIH and NSF were listen-in observers, with M. Stodolsky coordinating
for the DOE. The individual reviewer critiques were subsequently completed and
sent to DOE . Following a final assessment by DOE Human Genome Task Group (HGTG)
staff, pilot projects at six sites were initiated, with funds transmitted in
September 1996.
Pilot Projects
The BAC libraries were provided by the teams under:
Melvin Simon at the California Institute
of Technology, (CalTech)
Pieter de Jong at Children's
Hospital Oakland Research Institute [formerly at the Roswell Park Cancer
Inst. (RPCI)].
A basic technical problem was to purify BAC DNAs economically but with quality
high enough to obtain good sequence reads. This task was addressed by:
the CalTech team,
the RPCI team,
Skip
Garner and Glen Evans at University
of Texas SW Medical Center,
a team under Leroy E. Hood at the University
of Washington, Department of Molecular Biotechnology, (UWMB)
a team under M.D. Adams at TIGR (The
Institute for Genomic Research) following Adams transfer to Celera,
Inc., STC production at TIGR is now managed by William
Nierman and Shaying Zhao .
An analysis of regions with chromosome
segment duplications discovered in the laboratory of J.
Korenberg at the Cedars Sinai Medical Center was extended, because duplications
can pose troublesome ambiguities to contig map construction. Other
problematic cases are described in research by Evan Eicher and
colleagues:
Eichler, EE, Lu, F, Shen, Y, Antonucci, R, Doggett, NA,
Moyzis, RK, Baldini, A, Gibbs, RA, Nelson, DL. (1996) Duplication of the Xq28
CDM-CTR region to 16p11.1: A novel pericentromeric-directed mechanism for
paralogous genome evolution. Hum. Molec. Genet. 5:899 912
Eichler, EE, Budarf, ML, Rocchi, M, Deaven, LD, Doggett, NK, Nelson, DL,
Mohrenweiser, H. (1997) Interchromosomal duplications of the
adrenoleukodystrophy locus: a phenomenon of pericentromeric plasticity.
Hum. Molec. Genet. 6: 991-1002.
Eichler, EE. (1998) Masquerading repeats: Paralogous pitfalls of the human
genome. Genome Res. 8: 758-762.
Eichler, EE, Hoffman, SM, Gordon, LA, McCready, P, Lamerdin, JE,
Mohrenweiser, HW. (1998) Complex beta-satellite repeat structures and the
expansion of the zinc-finger gene cluster in 19p12. Genome Res. 8 :
791-808.
and by Barbara Trask and colleagues:
Trask, B.J., Friedman, C., martin-Gallardo, A., Rowen, L, Akinbami, C.,
Blankenship, J, Collins, C, Giorgi, D., Iadonato, S., Johnson, F., Kuo, W.-L.,
Massa, H., Morrish, T., Naylor, S., Nguyen, O.T.H., Rouquier, S., Smith, T.,
Wong, D.J., Youngblom, J., van den Engh, G. (1998) Members of the olfactory
receptor gene family are contained in large blocks of DNA duplicated polymorphically
near the ends of human chromosomes. Human Molecular Genetics 7: 13-26.
Trask, B.J., Massa, H., Brand-Arpon, V., Chan, K., Friedman C., Nguyen,
O.T., Eichler, E., van den Engh, G., Rouquier, S., Shizuya H., Giorgi, D.
(198) Large multi-chromsomal duplications encompass many members of the
olfactory receptor gene family in the human genome. Human Molecular Genetics
7: (in press).
Pilot projects workshop and review on May 29, 1997
GRANTEES
The Institute for Genomic Research, abstract:
Mark Adams, Steve Rounsley, Jenny Kelley and Hamilton O. Smith from Johns
Hopkins University
Roswell Park Cancer Institute, abstract: Pieter
de Jong, Joseph Catanese
University of Texas Southwestern Medical Center, abstract:
Glen A. Evans and Harold R. Garner
University of Washington, Department of Molecular Biotechnology, abstract:
Leroy H. Hood, Greg Mahairas, Todd Smith and Keith Zackrone
California Institute of Technology, abstract: Melvin
I. Simon and Ung-Jin Kim
Cedars-Sinai Medical Center, abstract: Julie
R. Korenberg
REVIEWERS:
Larry L. Deaven, Los Alamos National Laboratory
Trevor L. Hawkins, MIT Whitehead Inst.
Stanley Letovsky, Genome Data Base at Johns Hopkins University
David L. Nelson, Baylor College of Medicine
Michael Palazzolo, Lawrence Berkeley National Laboratory
Richard M. Myers, Stanford University School of Medicine
Lisa Stubbs, Oak Ridge National Laboratory
OBSERVERS/DISCUSSANTS:
Elbert W. Branscomb, DOE Joint Genome Institute
Robert W. Cottingham, Genome Data Base, Johns Hopkins University
Norman Doggert, Los Alamos National Laboratory
Sylvia Spengler, Lawrence Berkeley National Laboratory
U.S. GOVERNMENT STAFF:
DOE/OBER
- Marvin Frazier, Daniel W. Drell, Arthur Katz , Marvin Stodolsky, David
Thomassen
NIH/NHGRI - Mark S. Guyer, Adam
Felsenfeld, Jane Peterson, Jeffery Schloss
NIH/NCI - Carol Dahl
STS versus STC requirements
For a sequence read to be useful as an STC, it need contain only a sequence
segment unique to the source genome. For a read to be suitable for STS design,
it must include two unique segments within the source genome. These segments
must additionally lack features that would hinder priming for or read through
by DNA polymerases used in the polymerase chain reaction (PCR). Thus the requirements
for STS design are more stringent than those for STC usage. A higher-quality,
longer sequence read is generally necessary to support the more demanding STS
design requirements.
A review of progress and plans at the recently opened sequencing
facility used by the University of Washington, Department of Molecular Biotechnology
was held in September 1998. Leroy E. Hood and Gregory MacHarris gave presentations
and production projections (see abstract).
The reviewers were:
Ellson Chen, Applied Biosystems Division of Perkin Elmer, Inc.
David Nelson, Baylor College of Medicine
Robert Robbins, Fred Hutchinson Cancer Research Center
Elbert Branscomb attended as an observer from the DOE
Joint Genome Institute.
DOE was represented by Marvin Frazier, Director of the OBER Life
Sciences Division
Validation of contigs
Clones that may overlap each other can be identified by a variety of methodologies,
but none are foolproof by themselves. There may be some inadvertent representation
of distant genomic loci. Also some clones may be defective due to accidents
in construction or subsequent DNA rearrangements. Contig structure can be tentatively
validated before sequencing by assessing whether putative overlaps contain such
common representative features as restriction sites or STS markers. For members
of a candidate contig passing validation tests, the one with minimal overlap
of the previously sequenced region is the optimal choice for extending the seed
region.
In the United States, Genome
Systems Inc. and Research Genetics
distribute clonal resources and provide screening services. In Europe, similar
services are provided by the U.K. Human
Genome Mapping Project Resource Centre and the German
Resource Center.
Last modified:
Disclaimers
* Webmaster
Home
Links
Meetings
Articles
Contacts
History
|