Sequencing Abstracts

DOE Human Genome Program
Contractor-Grantee Workshop VIII
February 27-March 2, 2000  Santa Fe, NM


Home
Author Index
Sequencing
Table of Contents
Abstracts   
Instrumentation
Table of Contents
Abstracts
Mapping 
Table of Contents
Abstracts
Bioinformatics
Table of Contents
Abstracts
Function and cDNA Resources
Table of Contents
Abstracts

Microbial Genome Program
Table of Contents
Abstracts
Ethical, Legal, and Social Issues
Table of Contents
Abstracts
Infrastructure
Table of Contents
Abstracts

Ordering Information

Abstracts from
Past Meetings

1. Sequence Analysis of Human Chromosome 19

Anne Olsen1, Paul Predki2, Ken Frankel2, Laurie Gordon1, Astrid Terry1, Matt Nolan1, Mark Wagner1, Amy Brower1, Andrea Aerts2, Marnel Bondoc2, Kristen Kadner2, Manesh Shah3, Richard Mural3, Miriam Land3, Denise Schmoyer3, Sergey Petrov3, Doug Hyatt3, Morey Parang3, Jay Snoddy3, Ed Uberbacher3, and the JGI Production Sequencing Team

1Lawrence Livermore National Laboratory and 2Lawrence Berkeley National Laboratory, DOE Joint Genome Institute, Production Sequencing Facility, Walnut Creek, CA; 3Genome Annotation Consortium, Oak Ridge National Laboratory, Oak Ridge, TN

olsen2@llnl.gov

Chromosome 19 has an estimated size of ~65 Mb and is the most GC-rich human chromosome. The higher than expected proportion of genes and ESTs mapped to this chromosome suggests that it is exceptionally gene-rich, consistent with its high GC content. Sequencing of chromosome 19 will thus be especially rewarding in terms of gene discovery and elucidation of higher order gene organization. The sequence-ready BAC/cosmid map of chromosome 19 constructed at Lawrence Livermore National Laboratory currently consists of 17 ordered, restriction mapped BAC/cosmid contigs of average size 3.3 Mb spanning 56.3 Mb, or approximately 97% of the estimated 58 Mb comprising the p- and q-arms. A minimally overlapping set of clones (468 cosmids and 290 BACs) spanning the chromosome has been selected for sequencing. About 15 Mb of unique sequence has been finished and submitted to Genbank. Draft sequence (minimum coverage 3X) has been generated for about 68% of the remaining territory with an average depth of 7.7X. Sequence contigs have been ordered and oriented for about 5.2 Mb of the draft sequence. Updated map and sequence data is available from the LLNL web site and the JGI web site. Sequence is being analyzed through the Genome Annotation Pipeline at Oak Ridge National Laboratory. The analysis of 15 Mb of finished genomic sequence yielded 719 gene models predicted by Genscan, and 766 gene models predicted by GRAIL-EXP. About two-thirds of the gene models predicted by GRAIL-EXP were aligned with one or more ESTs. 500 of the Genscan predicted proteins and 456 GRAIL-EXP predicted proteins had homologs with BLAST E-values <1.0e-5. Annotation summaries are available from the ORNL Genome Catalog and Genome Channel at http://genome.ornl.gov. Detailed analyses of specific chromosomal regions will be presented.

Supported by USDOE under Contracts W-7405-Eng-48 (LLNL), DE AC0376SF00098 (LBNL) and DE-AC05-96OR22464 (ORNL).


2. Draft Sequencing Procedures for Chromosome 16 Sequencing

Mark O. Mundt, David C. Bruce, Leslie Chasteen, Judith Cohn, Lynne Goodwin, Kristina Kommander, Chris Munk, Robert Sutherland, Norman Doggett, and Larry Deaven

Bioscience Division and DOE Joint Genome Institute, Los Alamos National Laboratory, Los Alamos, NM 87545

mom@telomere.lanl.gov

As the amount of human sequence that is publicly available increases, efficient use of tools to monitor sequencing progress becomes a more important issue. Our current strategy for monitoring chromosome 16 draft data produced by the JGI includes immediate collection of information on 1) sequence marker content using ePCR, 2) BAC end sequence overlap with BLAST, 3) E. coli contamination level, 4) short subclone level, 5) Q20 quality analysis, and 6) confirmation of suspected overlaps based on our maps. These data are measured after the first two plates of forward and reverse sequencing, and decisions are then made for the continuation and desired level of sequence coverage

Order and orientation of contigs becomes the main concern as the draft depth is increased. We present Java tools to assist in both detecting assembly and tracking/handling errors as well as to help order contigs when possible. These tools take advantage of the paired end relationships that are available from our double-end plasmid sequencing approach. Finally, we demonstrate the importance of proper BAC end orientation in choosing clones to extend sequence as well as in feeding information back to a more accurate mapping process.


3. Large-Scale Finishing of Human and Mouse Genomic Sequences

Richard M. Myers, Jeremy Schmutz, Jane Grimwood, the Sequencing Group at Stanford Human Genome Center, and the Joint Genome Institute

The Stanford Human Genome Center and Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305-5120 and the Joint Genome Institute, 2800 Mitchell Drive, B100, Walnut Creek, CA 94598

http://www.jgi.doe.gov
http://www-shgc.stanford.edu
myers@shgc.stanford.edu

We have begun a new collaboration with the the Joint Genome Institute to generate large amounts of finished human and mouse sequence from "draft" sequences produced by the JGI and its associated laboratories. Our goal is to produce about 100 Mb of finished sequence each year, focusing first on finishing human chromosome 19 while also finishing clones from human chromosomes 5 and 16 and syntenic mouse sequences. Our criteria for considering a large-insert clone as finished is that it has an estimated base-pair error rate of less than one in 10,000 bp, and that the entire sequence is contiguous, with exception for small, difficult-to-fill gaps of known size in a small fraction of the clones.

We receive subclones and sequence traces from the JGI and reassemble the data, generally resulting in assemblies with 10-20 contigs per 100 kb for the 6X of shotgun sequence data produced for each clone. We then use a computationally-driven process that requires almost no human decision-making to choose subclones, directed sequencing reactions, and, for a portion of the reactions, oligonucleotide primers. After applying this automated stage, all or almost all of the gaps are filled and the clone is passed to a group of finishers, who reassemble the sequence data and design specialized sequencing reactions to fill remaining gaps and to bring up the quality of the sequence so that the entire clone meets our finished criteria. The final sequence is checked by our informatics group and then submitted to GenBank. We have finished more than 3 Mb of sequence with the JGI since this collaboration began, and expect to have an additional 8 Mb finished by the end of February. We hope to achieve an average of 8-10 Mb of finished sequence per month within the next quarter.


4. A Tale of Three Loci

Lee Rowen, Anup Madan, Shizhen Qin, and Lee Hood

Multimegabase Sequencing Center, University of Washington, Seattle, WA 98195 and Institute for Systems Biology, Seattle, WA

leerowen@u.washington.edu

One of the fascinating results of large-scale sequencing is the revelation of vastly different types of genomic landscapes. Based on our accumulation of finished sequence over long contiguous stretches of the genome, we plan to present data analyses pertaining to three of these landscapes:

A) The human and mouse beta T cell receptor loci, which exemplify the rapid evolution of a multigene family. Here, genes are embedded in long repeats which are added to or deleted from the genome via unequal cross-over. In this regard, human and mouse have undergone somewhat different evolutionary paths.

B) The human and mouse major histocompatibility complex class III regions, which exemplify high gene density (> 15% coding sequence). Here, the orthologous relationship between human and mouse is highly conserved. This landscape raises interesting questions about gene regulation and why it is that genes with apparently unrelated functions might be so closely spaced.

C) The human neurexin III gene on chromosome 14, which exemplifies a large-intron gene that spans over a megabase. Neurexins are noteworthy for their large number of alternative splice forms and their differential expression in neurons, depending supposedly on the alternative splices.

The notion of genomic landscapes provides a framework for thinking beyond individual genes to the organization of the genome as a whole and how this organization bears on different types of function for individual genes and gene families.


5. Human Telomere Mapping and Sequencing

Robert K. Moyzis, Deborah L. Grady, Han-Chang Chi, and Harold C. Riethman

Department of Biological Chemistry, College of Medicine, University of California at Irvine, Irvine, CA 92697 and The Wistar Institute, Philadelphia, PA 19104

rmoyzis@uci.edu

The Human Genome Project has undergone a dramatic shift to the goal of obtaining a "working draft" sequence of human DNA by the end of this year. Such a framework sequence will catalyze gene discovery and functional analysis, and allow finished sequencing to be focused on regions of the highest biomedical priority. Over 80% of human DNA can be rapidly sequenced in the next few years by highly automated, high throughput sequencing centers. However, a significant fraction of the human genome will not be sequenced and/or assembled to completion by such approaches, as demonstrated by the recent sequence of human chromosome 22 (Dunham et. al., Nature 402, 489-495, 1999). These are regions that contain 1) a high percentage of repetitive DNA sequences; 2) internal tandem duplications, including multigene families; and/or 3) are unstable in all current sequencing vectors. Producing quality DNA sequence of these regions, which faithfully represents genomic DNA, will be a continuing challenge.

Telomeres, the ends of the linear DNA molecules in human chromosomes, exhibit both high levels of repetitive DNA composition and cloning instability. In addition, extensive heterogeneity exists in these regions between various individuals. Half-YAC clones are uniquely suited as starting material for the sequence analysis of human telomeric regions. The inability to clone the extreme end of human chromosomes in bacterial vectors, including BACs, is well known. Due to the lack of appropriate restriction sites in the terminal (TTAGGG)n regions, as well as the necessary size selection involved in BAC library construction, the most terminal BAC clones will be 20-200Kb from the true DNA ends. By functional complementation in yeast, however, the true human telomeric end can be cloned. To date, 43 of the 46 unique human telomeres have been obtained as half-YACs.

Using RARE (RecA-Assisted Restriction Endonuclease) cleavage, 20 of these telomere half-YAC clones (representing the telomeres of human chromosomes 1p, 1q, 2p, 2q, 4p, 6q, 7p, 7q, 8p, 8q, 9p, 12q, 13q, 14q, 17p, 17q, 18p, 18q, 19p, and 21q) have now been confirmed to represent the true telomere. An additional 11 clones (representing the telomeres of 3q, 4q, 9q, 10p, 10q, 11p, 11q, 15q, 16q, 19q, and 20p) are currently being confirmed by RARE cleavage analysis. Given the new goals of the Human Genome Project, we have initiated framework sequencing on these clones, as well as the most terminal BACs identified from our chromosome 5 mapping project (Peterson et.al., Genome Res 9, 1250-1267, 1999). These framework telomere sequences will provide a "cap" to the worldwide genome sequencing efforts. A combination of cosmid and plasmid end sequence analysis, combined with extensive restriction enzyme mapping of the original YAC, results in highly ordered framework sequences. To date, framework sequence of 1q, 5p, 9q, 11q, 17p, and 18p have been completed. Following framework sequencing, finished sequencing will be conducted in select regions, with priority given to areas with high biological interest and/or relevant to the JGI, i.e., chromosomes 5,16, and 19. An important QC/QA aspect of our sequence analysis is the extensive confirmation of the sequence against genomic DNA by PCR-resequencing. Numerous polymorphisms in these regions, including SNPs, VNTRs, and rearrangements have been identified. Using pooled DNA PCR/sequencing, the population distribution of many of these polymorphisms can be determined rapidly.


6. Targeted cDNA Sequencing

Kimberly Prichard, Susi Wachocki, Mira Dimitrijevic-Bussod, Mark Mundt, Judith Cohn, David Bruce, Cliff Han, Norman Doggett, Christa Prange, and Michael R. Altherr

Los Alamos National Laboratory, Los Alamos, NM 87545

ALTHERR@LANL.GOV

The sequencing of cDNAs that are co-linear to genomic sequencing targets adds considerable value to the information generated from both efforts. Through the use of sequence analysis tools, comparisons of these distinct data sets reveal details of gene organization, splice sites and, because the sequences are derived from different sources, gene based single nucleotide polymorphisms (SNPs). We intend to exploit gene predictions derived from genomic sequencing data to identify full-length cDNAs (from initiation codon to the poly adenylation site) for complete cDNA sequencing. We will use "overgo" probes to identify cDNAs corresponding to the gene predictions. We have chosen the strategy of cDNA insert concatenation as our sequencing method. To model this effort, we have embarked by sequencing the complete inserts of cDNAs from the IMAGE collection previously mapped to chromosomes 5, 16, and 19. Approximately, 1800 clones were identified for this effort. We used the Unigene database to identify cDNAs for which the sequence data was incomplete and to identify the largest predicted member of the clone set. These clones are undergoing concatenation cDNA sequencing. Subsequent analyses are being done to identify those coding sequences for which co-linear genomic sequence exists, to characterize their gene structure and to identify SNPs.

This work was supported by Los Alamos National Laboratory LDRD funds and by US DOE OBER under contract W-7405-Eng36.


7. Determining Quality of Oligonucleo-tides Synthesized in a High Throughput Process

Linda S. Thompson, David C. Bruce, Norman A. Doggett, Mark O. Mundt, and Larry L. Deaven

Bioscience Division and DOE Joint Genome Institute, Los Alamos National Laboratory, Los Alamos, NM 87545

thompson@telomere.lanl.gov

LANL obtained from the University of Texas Southwest Medical Center a liquid chemical dispensing robot, LCDR or Mermade, to produce large numbers of oligos. The Mermade is capable of making two 96-well plates of oligos per day. The instrument protocol adheres to the standards of DNA synthesis using deprotection, evaporation, resuspension, and quantitation. We also utilize a Biomek 2000 to resuspend the samples, prepare standard dilutions for OD readings on a plate reader, and to set up the samples for gel electrophoresis. Quality control consists of running a representative sample of the plate on BioRad 15% TBE-Urea Ready gels, 12 wells. If those samples are confirmed to be the required length without n-x species, i.e. oligos missing one or more bases, and if quantitation turns up no zero values, the plate can be given to the user. A quality control issue at this time is whether one needs to PAGE every oligo made on the Mermade. A recent random test of a plate of Mermade oligos vs. "factory made" oligos showed a success rate of 93% for both plates, indicating that 7% failed or were n-x. UTSW and LANL both report an average success rate of 95%, showing there's little or no difference between oligos made on the Mermade or those purchased through an oligo company.

LANL has also purchased a MALDI mass spectrometer. This instrument can be utilized to look at the DNA samples made by the Mermade and has been considered to take over the quality control aspect of DNA synthesis. The MALDI-MS could be used as a qualititative tool to screen all oligos for the desired length and the presence of n-x species, followed up with PAGE to identify the extent of the n-x in the suspect sequences.


8. Progress of Concatenation cDNA Sequencing at the BCM-Human Genome Sequencing Center

Richard Gibbs

Baylor College of Medicine, Houston, TX

agibbs@bcm.tmc.edu

Concatenation cDNA sequencing (CCS) has been used to complete sequencing of more than 750 clones from the human brain cDNA library (1NIB) and 30 clones representing childhood leukemia. Together these represent a total length of 1.2 megabases of assembled sequence. An additional 390 clones are currently in the sequence pipeline. Statistics from 14 completed projects continue to show that CCS is as efficient as sequencing of single large DNA fragments, with an average of 17 reads and one custom primer to complete each kb of sequence. Methodological improvements, including pooling of clones during growth and the use of Phred and Phrap for assembly, have further simplified CCS.

For 596 different clones from the 1NIB brain library, a similarity search was performed against the public cDNA database. Of these 58% were novel and the remaining 42% (251) had partial matches to known sequences or genes from human or other organisms. Of the latter, 159 clones displayed similarity matches to known proteins. A comparison against the Unigene cDNA dataset revealed that 61% of the cDNAs or submitted cDNAs represented novel contributions, 258 from among a set of 424.

Of 159 clones with partial protein matches, only 32 (20%) had a complete ORF (open reading frame). This indicates a low percentage of cDNA clones representing full length mRNAs from the libraries. To generate better libraries, a postdoctoral student made a trip to Japan where four cDNA libraries (one human infant brain, one mouse brain and two childhood leukemia) were constructed, using the CAP-trapping technology developed by Hayashizaki's group at the RIKEN Institute. Three libraries have been evaluated in detail, acquiring ESTs (expressed sequence Tags) from 192 clones of each library. The data show good quality through little contamination with vector (1.8-2.5%), ribosomal DNAs (1.3-1.8%), and low redundancy (3.1-4.2%). About half of the ESTs lacked matches with Unigene cDNA sequences. Some 65.0-66.6% of clones possessed the first ATG codon of the encoded protein, indicating very high quality of the libraries. Thus the three analyzed cDNA libraries are suitable for large-scale and full-length sequencing. About 8,000 ESTs have been generated from these cDNAs and potentially novel clones are being selected for subsequent full-length sequencing.


9. Full-Length cDNA Sequencing Using Differential Extension with Nucleotide Subsets (DENS)

O. Chertkov1, C. Naranjo1, D. Zevin-Sonkin3, H. Hovhanissyan3, A. Ghochikyan3, L. Lvovsky3, A. Liberzon3, M.C. Raja2,3, and L.E. Ulanovsky1,2

1Los Alamos National Laboratory, Los Alamos, NM 87545; 2Argonne National Laboratory, Argonne, IL 60439; and 3Weizmann Institute of Science, Rehovot 76100, Isreal

levy@anl.gov

Upon moving to LANL, we are setting up a full-length cDNA sequencing facility using our technology termed Differential Extension with Nucleotide Subsets (DENS) which is essentially primer walking without primer synthesis (Raja et al., 1997, NAR 25, pp. 800-805). DENS works by converting a short primer (selected from a pre-synthesized library of 8-mers with 2 degenerate bases each) into a long one on the template at the intended site only. DENS starts with a limited initial extension of the primer (at 20 C) in the presence of only 2 out of the 4 possible dNTPs. The primer is extended by 5 bases or longer at the intended priming site, which is deliberately selected, as is the two-dNTP set, to maximize the extension length. The subsequent termination (sequencing) reaction at 60 C then accepts the primer extended at the intended site, but not at alternative sites where the initial extension (if any) is generally much shorter.

DENS primer walking seems to be tailor-made for full-length cDNA sequencing, as the absence of the primer synthesis step facilitates closed-loop automation of primer walking with the benefit of unattended operation. Earlier, in a pilot experiment we used DENS for sequencing both strands of four cDNA clones containing inserts of 1.9, 2.3, 3.8 and 4.9 kb. The success rate of the DENS sequencing reactions was 72% yielding 27,864 base-calls. The median PHRED quality value was 40, corresponding to the error probability of approximately 10-4. The plotted distribution showed that base-calls with PHRED values less than 20 occurred only 1% of the time. The 8-mer primers for DENS sequencing were selected using our dedicated software.


10. pZIP: A Versatile Vector for Sequencing by Nested Deletions

John J. Dunn

Biology Department, Brookhaven National Laboratory, Upton, NY 11973

jdunn@bnl.gov

We have constructed a low-copy, amplifiable vector that should be particularly useful for cloning and sequencing full-length cDNAs and highly repeated DNAs. This pZIP vector is maintained in Escherichia coli at low copy number by the F replicon and can be amplified 300 fold from an IPTG-inducible phage P1 replicon (repL). A relatively small size of 4.4 kbp was achieved by removing the 2.5-kb sop (stability of plasmid genes) region of F, but the plasmid is stably maintained by selective growth in the presence of kanamycin. A multiple cloning region (MCR) is flanked by sites that allow the biochemical generation of unidirectional nested deletions crossing the cloned DNA. The resulting deletion clones can be ordered by size, and an ordered, overlapping set of sequences can be obtained by priming within the flanking vector sequence to produce the complete sequence of both strands. The correspondence of plasmid lengths with those predicted by the assembled sequence aids in and verifies the correctness of the assembly. The low copy number should allow the cloning of DNAs that might not be stable in higher copy vectors, and amplification provides ample DNA for generating the nested deletions.

Unidirectional nested deletions are produced by cutting the DNA specifically near one end of the cloned DNA to generate an end that is sensitive to digestion by E. coli exonuclease III (ExoIII) and an end that is resistant, or by specifically nicking the appropriate strand. The ends or nick are oriented so that ExoIII will digest one strand across the cloned DNA. The resulting single-strand gaps are converted to double-strand gaps by treatment with S1 nuclease, and the ends are repaired and ligated with T4 DNA polymerase and ligase. ExoIII digests quite synchronously, and treating pooled samples from several different ExoIII digestion times, followed by electroporation, produces a population of clones with a distribution of different deletion end points. ExoIII-resistant ends are produced by intron-encoded endonucleases that cut at very rare sites to produce 4-base 3' overhangs. I-CeuI and I-SceI flank the MCR on one side and I-PspI on the other. ExoIII-sensitive ends can be generated by cutting with a restriction endonuclease at any of several different 8-base or other rare cleavage sites located between the sites cut by the intron-encoded nucleases and the cloning sites in the MCR. The fd origin of replication is also located on the I-PspI side of the MCR, oriented so that the specific nick by the gene 2 protein can be extended across the cloned DNA by ExoIII.

In collaboration with the Joint Genome Institute, we are evaluating the capability of the pZIP vector and the nested-deletion sequencing strategy to close gaps that have resisted closure by standard sequencing strategies in several different regions of human chromosome 19. We hope to demonstrate cloning in the low-copy pZIP of regions that are difficult to clone in standard sequencing vectors, and to determine accurate sequences of highly repeated regions by the nested-deletion strategy.


11. pUC-SV: A New Double Adaptor Plasmid System for Sequencing Complex Genomes

Jonathan L. Longmire, Nancy C. Brown, Larry L. Deaven, and Norman A. Doggett

Bioscience Division and DOE Joint Genome Institute, Los Alamos National Laboratory, Los Alamos, NM 87545

longmire@telomere.lanl.gov

The sequencing of complex genomes requires shotgun cloning (or subcloning) of genomic DNA (or BACs) into vectors that carry smaller inserts and that can serve as templates in sequencing reactions. Such subcloning vectors typically include plasmids or M13. For sequencing purposes at Los Alamos, we have previously used blunt end ligation of inserts into the HincII site of pUC-18. In addition, we have also used the double adaptor approach described by Andersson et al. ([1996] Analytical Biochemistry 236: 107-113) to subclone BAC fragments into pBluescript. Both of these approaches have distinct advantages and disadvantages. For example, blunt end subcloning is technically straight forward but can result in clones with multiple inserts and nonrecombinants even when vector ends are dephosphorylated. Double adaptor subcloning into pBluescript can reduce the frequency of nonrecombinants and clones with multiple inserts. However, the sequencing priming sites are located at a greater distance from the cloning site in Bluescript compared to pUC. Consequently, some of the sequence data that is generated in Bluescript clones is vector readthrough that has to be trimmed prior to assembly of the data.

In order to improve upon existing systems, we have developed a new cloning vector that allows double adaptor shotgun subcloning of large target molecules into pUC-18. The vector pUC-SV was constructed by cloning a 2 kb human DNA insert fragment into the XbaI and PstI sites of pUC-18. The insert serves as a "stuffer" and enables one to easily monitor for complete digestion when the plasmid is being processed to produce subcloning-ready vector. Vector adaptors are ligated to the SacI and SphI ends of the linearized vector. This produces 12 nt overhangs that are complimentary to adaptors that are ligated to the repaired ends of the fragmented target DNA. Nonrecombinants and clones with multiple inserts are eliminated because neither the vector adaptors nor the insert adaptors are self-complimentary. The pUC-SV vector yields cloning (subcloning) efficiencies greater than 105 colonies per microgram target DNA with zero nonrecombinant background. Highly representative subclone libraries can be made using as little as 10 ng of processed target DNA. In addition, the amount of DNA sequence data that is produced using pUC-SV is increased compared to Bluescript due to placement of the priming sites. In the adapted pUC-SV, the primer sites are located 26 nt and 28 nt away from the insert ends (compared to 53 and 64 nt in adapted Bluescript). Thus, 63 nt less data is lost to trimming for every subclone that is processed. This increased data yield becomes very significant when several thousands of subclones are processed.


12. A Fluorescent Sequencing Vector for High-Throughput Clone Selection by Cell Sorting

Juno Choe and Ger van den Engh

Department of Molecular Biotechnology, University of Washington, Seattle, WA 98195-7730

engh@biotech.washington.edu

With the advent of automated high-throughput DNA sequencers, clone selection and preparation is rapidly becoming a major bottleneck in genome sequencing. We are developing an integrated clone-selection / clone-preparation process that has the potential to dramatically increase the speed of sample production. The process makes use of a sequencing vector that contains fluorescent proteins so that insert-containing bacteria can be selected with a cell sorter. The cell sorter will deposit individual bacteria onto a carrier ribbon that moves the samples in a linear procession along processing stations.

We have constructed two vectors that can be use din this process. Most recently we developed a vector containing a tandem of Blue and Green Fluorescent Proteins separated by a cloning site. This vector indicates integration of a cloned insert through the shifting of ratios of two fluorescent proteins. In the native pBGFP, a fusion protein composed of Blue Fluorescent Protein (BFP2, Clontech) and Green Fluorescent Protein (GFPmut3.1, Clontech) genes is expressed at high levels. The BFP/GFP fusion protein can be excited by UV light in the 350-400 nm range. This causes excitation of the BFP, followed by GFP due to fluorescence resonance energy transfer (FRET). When an insert is successfully ligated into the cloning site in the linker region between BFP and GFP, there is a loss of function of the GFP portion of the protein. In this case, increased BFP fluorescence will be observed with loss of observable green fluorescence. We can quantitate the ratios of these two fluorescent proteins very accurately by flow cytometry. This provides the ability to rapidly sort individual bacteria with high BFP and low GFP content at a rate of up to 10,000 per hour. Amplification of cloned inserts can be achieved by growing bacteria in culture medium and/or PCR amplification.

Under actual test conditions, ligation of fragments between 2.3-5.6 kb resulted in clear separations of two populations of E. Coli grown in liquid media: BFP/GFP expressing bacteria containing native pBGFP plasmid and BFP expressing bacteria containing pBGFP with cloned insert. Furthermore, bacteria were observed under UV fluorescence microscopy. Two clearly distinguishable phenotypes appearing either green or blue were observed.


13. An Isothermal Amplification System for the Production of DNA Templates for DNA Sequencing

Stanley Tabor and Charles Richardson

Department of Biological Chemistry and Molecular Pharmacology, Harvard Medical School, Boston, MA 02115

stabor@hms.harvard.edu

We are developing DNA polymerases for use in DNA sequencing and amplification applications. We will describe a very efficient isothermal amplification system which provides an attractive alternative to conventional methods of generating plasmid and BAC templates for DNA sequencing. Amplification of from 1 pg to 1 µg of template DNA results in the synthesis of DNA to a final concentration of 0.5 µg/µl in 15 min at 37 °C (corresponding to an amplification of up to several million-fold). Amplification is nonspecific; all sequences present are amplified equally. The reaction requires no exogenous primers. This system is based on the replication apparatus of bacteriophage T7; the principle enzymes required are two forms of T7 DNA polymerase, the T7 helicase/primase, and single-stranded DNA binding protein. The products are linear double-stranded DNA fragments several thousand base pairs in length. When the products are used as templates for capillary-based fluorescent sequencing, the fluorescent signal produced is several fold higher than a comparable amount of supercoiled plasmid DNA, and results in 20% more base calls that have a quality score greater than Phred 20. The attractive features of this system for large sequencing projects is its simplicity and the constant, reproducibly high yield of DNA that can be used directly in DNA sequencing reactions without further purification. This nonspecific amplification reaction could also be of use in immortalizing small, precious samples of genomic DNA required for genotype analysis. We are also using this technology to amplify single DNA molecules embedded in agarose. This enables one to construction and amplification DNA libraries in vitro without the need to transform bacterial cells. Finally, we will present an update of our work modifying DNA polymerases to increase their processivity and their use of nucleotide analogs for use in DNA sequencing.


14. Universal Energy-Transfer Cassettes for Facile Construction of Energy-Transfer Fluorescent Labels

Jin Xie1, Lorenzo Berti2, Richard A. Mathies2, and Alexander N. Glazer1

1Department of Molecular and Cell Biology, University of California, Berkeley, CA 94720 and 2Department of Chemistry, University of California, Berkeley, CA 94720

glazer@uclink4.berkeley.edu

Energy-Transfer fluorescent labels are advantageous in DNA sequencing applications because of their improved emission signal strength and spectral purity1. To facilitate the production of ET labels it is useful to develop cassette labeling strategies. In one approach an ET cassette was synthesized as a part of the oligonucleotide primer using sugar-phosphate spacers2. Here, we have designed and synthesized a universal cassette DMT-O-(CH2)6-S-S- (CH2)6T(Rox) SSSSSS-Fam (where DMT is dimethoxytrityl and S is 1',2'-dideoxyribose phosphate) for attachment to appropriate terminator or primer derivatives which can be readily prepared by automated methods. With 488 nm excitation, the emission of this cassette is 10-fold higher than that of free Rox. The thiol group on the universal ET cassette is exposed by reduction, while an amino-derivative of the terminator (or primer) is substituted with a bifunctional NHS-ester-maleimide reagent. The conjugation of the universal ET cassette with maleimide-derivatized terminator or primer is almost quantitative in 2 hours at room temperature. When used in sequencing, cassette-labelled primers gave excellent results.

  • Xie, J., Hung, S.-C., Glazer, A. N. and Mathies, R. A. Energy Transfer Fluorescent Labels for DNA Sequencing and Analysis, in Topics in Fluorescence Spectroscopy, Volume 7: ANA Technology, in press (1999).
  • Ju, J., Glazer, A. N. and Mathies, R. A. Cassette Labeling for Facile Construction of Energy Transfer Fluorescent Primers, Nucleic Acids Research 24, 1144-1148 (1996).

15. Fimer Chemistry for Sequencing off BAC and Genomic DNA Templates

S. Kozyavkin, N. Polouchine, A. Malykh, O. Malykh, and A. Slesarev

Fidelity Systems, Inc., 7961 Cessna Avenue, Gaithersburg, MD 20879-4117

http://www.fidelitysystems.com
fsi1@fidelitysystems.com

Robust sequencing off BAC and genomic templates presents a new challenge in technology development. The problems associated with the use of standard oligonucleotides as primers in genomic cycle sequencing protocols include insufficient specificity of primer annealing, non-specific amplification, low sensitivity and premature truncation at secondary structures in template DNA.

To overcome these problems we have developed a new method to generate combinatorial libraries of chemically modified oligonucleotides (fimers). The method is based on the use of our proprietary monomers containing MOX or SUC reactive moieties. We assessed the effects of modifications on DNA melting, electrophoretic mobility and DNA-protein interaction for individual oligonucleotides and their small libraries. We have developed rapid procedure for modification, deblocking and purification of fimers in 96-well plate format. Different design strategies for fimers have been tested with ThermoFidelase-2A, -2B and -2C, deaza-dGTP and dGTP in various thermal cycling protocols. We found that fimer design eliminates many restrictions on choosing primer sequence. Our results demonstrate feasibility of suppressing non-specific PCR amplification and primer-dimer formation after 100-400 cycles, synergy of chemical and enzymatic tools to sequence through strong stop and long simple repeats and sequence directly off sub-microgram quantities of bacterial genomic templates. The implementation of fimers in high-throughput projects will be presented.

We have achieved contiguity and high total and local quality of base calls starting from 2x - 5x shotgun coverage in draft human BAC projects. The major conclusion is that workflow for finishing low-coverage projects differs significantly from that for full shotgun projects and has become manageable due to the increased power of sequencing chemistry. For BAC-end sequencing projects we have developed long fimers and ThermoFIdelase-2E to accelerate kinetics of primer annealing to minute quantities of template DNA and 400x(1 min) sequencing protocol. We have increased detection sensitivity to 10 ng BAC and obtained high quality reads from 30 ng BAC. New protocol is compatible with the yield of BAC DNA from 1-ml cultures in 96 well plate format.

Enhanced reaction chemistry has allowed us to overcome major obstacles in bacterial genomic sequencing associated with high flourescent background and low signal. We obtained high quality reads from as low as 100-300 ng genomic template. Applications of direct genomic sequencing to the discovery of novel genes and characterization of bacterial populations will be presented.

Supported in part by DOE and NIH (DE-FG02-98ER82557 and 2R44GM55485-02).


16. Chemical Conversion of Boronated PCR Products into Bidirectional Sequencing Fragments

Barbara Ramsay Shaw, Kenneth W. Porter, Ahmad Hasan, Kaizhang He, and Jack Summers

Department of Chemistry, Duke University, Durham, NC 27708-0346

brshaw@chem.duke.edu

We developed an alternate sequencing chemistry which avoids cycle sequencing, allows direct bidirectional genomic sequencing, and permits direct loading of PCR products onto the separating system. The method employs template-directed enzymatic, random incorporation of small amounts of boron-modified nucleotides (i.e. 2'-deoxynucleoside 5'-alpha-[P-borano]-triphosphates) during PCR amplification. The position of the modified nucleotide in each PCR product can be revealed in two ways, either enzymatically (as previously described1) or chemically. Both approaches take advantage of differences in reactivity of the normal and boronated nucleotidic linkages to generate PCR sequencing fragments that terminate at the site of incorporation of the modified nucleotide. By employing labeled PCR primers, the original PCR products are able to be converted directly into bidirectional sequencing fragments.

In the enzymatic approach, the modification of a phosphate into a boranophosphate internucleotidic linkage prolongs its lifetime toward degradation by nucleases. The sequential hydrolysis by 3'-5' exonuclease III is thereby blocked by a boranophosphate, resulting in fragments that terminate in a nucleoside boranophosphate. However, normal and borano-phosphate linkages with a 3'-cytosine are more susceptible to exonuclease degradation than other purines and pyrimidines, which reduces band uniformity. A series of base-modified cytosine derivatives were therefore synthesized and tested for nuclease resistance. Substitution at the C-5 position of cytosine by alkyl groups (ethyl and methyl) markedly enhances the cytidine boranophosphate resistance towards exonuclease III (i.e., 5-ethyl-dC > 5-methyl-dC > dC 5-bromo-dC > 5-iodo-dC). The best analog, 5-ethyl-a-borano-dCTP, not only showed an increased resistance to exonuclease III compared to the a-borano-dCTP used previously in our method, but did so without affecting incorporation and resulted in more even banding patterns2. Analysis with Basefinder software (M. Giddings) takes into account any mobility changes, permitting increased consistency and accuracy. The enzymatic approach may find use in applications where high resolution of longer fragments requires stronger signals at longer read lengths, because the distribution of fragments produced by nuclease digestion is skewed to long fragments.

In the chemical approach, we have examined several methods for generating sequencing fragments, as an alternative to exonuclease chew-back. First, we identified reagents that selectively cleave the backbone of the PCR product at deoxy boranophosphate linkages, while leaving the normal phosphodiester linkages intact. Second, we synthesized a new boranophosphate RNA dimer analogue3 and found conditions under which the ribo boranophosphate linkage is considerably more susceptible to cleavage than a deoxy or normal phosphodiester linkage. We then synthesized diastereomers of ribonucleoside 5'-(a-P-borano)triphosphates4 and showed that one isomer can be incorporated readily into RNA with T7 RNA polymerases, yielding boronated transcripts that are thousands of nucleotides long. We are now examining DNA polymerases that can incorporate the boronated RNA triphosphates into DNA. Also under investigation are agents that can result in colorimetric detection of boranophosphate. Direct sequencing of PCR products by cleavage of boranophosphates should simplify mono- and bidirectional sequencing and provide a simple, direct, and complementary method to cycle sequencing.

  • K.W. Porter, J. D. Briley, and B. R. Shaw, "One-Step PCR Sequencing with Boronated Nucleotides", Nucleic Acids Research 25, 1611-1617 (1997).
  • K. He, A. Hasan, Bozenna Krzyzanowska and B. Ramsay Shaw, "Synthesis and Separation of Diastereomers of Ribonucleoside 5'(a-P-Borano)triphosphates". Journal of Organic Chemistry 63(17), 5769-5773 (1998).
  • K. He , D. S. Sergueev, Z. A. Sergueeva and B. Ramsay Shaw, "Synthesis of Diuridine 3',5'-Boranophosphate: H-Phosphonate Approach." Tetrahedron Letters 40, 4601-4604 (1999).
  • K. He, K. W. Porter, A. Hasan, J. D. Briley and B. R Shaw, "Synthesis of 5-Substituted 2'-Deoxycytidine 5'-(a-P-Borano)triphosphates, their Incorporation into DNA and Effects on Exonuclease". Nucleic Acids Research 27, 1788-1794 (1999).


17. Human and Mouse BAC Libraries for Genome Sequencing, Mapping, and Functional Analysis

Kazutoyo Osoegawa, Chung Li Shu, Aaron Mammoser, Joe Catanese, and Pieter J. De Jong

Department of Cancer Genetics, Roswell Park Cancer Institute, Buffalo, NY 14263 and Children's Hospital Oakland Research Institute, Oakland, CA 94609

pieter@dejong.med.buffalo.edu

Our earlier 25-fold redundant human BAC library (RPCI-11; EcoRI fragments) has recently been expanded with an additional 7-fold genome redundancy from the same donor using MboI-digested DNA. Insert sizes averaged 173 and 195 kb for the early and late parts of the library, respectively. A 1.5-Mb BAC contig was extensively analyzed to test the human BAC library for clonal integrity and fidelity. The results indicate the absence of chimeric clones and 19 rearranged clones in the contig of 169 BACs. Three murine libraries (each 11-fold genome redundant) have previously been constructed various digest strategies, two strains (129S6/SvEvTac and C57BL/6J) and either PAC or BAC vectors. The BAC library for the C57BL/6J strain (designated RPCI-23) is most significant in view of the large EcoRI-inserts (average 200 kb) and because it was selected as a preferential source for murine genome sequencing. To obtain additional representation for the C57BL/6J strain, we created an additional the C57BL/6J BAC library (RPCI-24) from male DNA partially digested with MboI (average insert sizes about 155 kb). To permit the cloning of sheared DNA, a new vector, pTARBAC6, was constructed with two BstXI sites for cloning. The BstXI sites have non-complementary ends to avoid vector self-ligation. Blunt-ended DNA fragment are ligated to a BstXI linker to create ends complementary with the vector. In pilot experiments, fragments were cloned from HincII partially-digested DNA and DNaseI partially-digested DNA resulting in average insert sizes around 100 and 60 kb, respectively. To maximize randomness of the BAC cloning process, we are optimizing the cloning of blunt-ended fragments obtained by shearing. Information on current libraries can be found at http://bacpac.med.buffalo.edu .

Supported by grants from the U.S. DOE (#DE-FGO3-94ER61883) and NIH (#1RO1RGOl 165).


18. Human and Mouse BAC Ends

Shaying Zhao, Mark D. Adams, Joel Malek, Lily Fu, Bola Akinretoye, Sofiya Shatsman, Maureen Levins, Stephany McGann, Keita Geer, Getahun Tsegaye, Margaret Krol, Peter Choi, Tamara Feldblyum, William Nierman, and Claire Fraser

The Institute for Genomic Research, Rockville, MD 20850

szhao@tigr.org

End sequences from Bacterial Artificial Chromosomes (BACs) provide highly specific sequence markers in large-scale sequencing projects. To date, we have generated >300,000 BAC end sequences (BESs) from >186,000 human BAC clones with the following properties. 1) Over 60% of the clones have BESs from both ends representing 5X coverage of the human genome by the paired-end clones. 2) The average read length is ~460 bp providing a total of 141 MB covering ~4.7% of the genome. 3) The average phred Q20 length is ~400 bp giving an identity of >99% to the human finished sequences. 4) Over 90% of the BESs faithfully represent the original clones and over 85% of the paired-end clones have both ends tracked correctly. This high quality of data gives BAC end users a high confidence in 1) retrieving the right clones from the BAC libraries based on the BAC end sequence matches; and 2) building a minimum tiling path of sequence-ready clones across the genome and building genome assembly scaffolds. Our sequence analyses indicate that BESs from human BAC libraries developed at The California Institute of Technology (CalTech) and Roswell Park Cancer Institute (RPCI) have similar properties. The analyses have highlighted differences in insert size for different segments of the CalTech library. Problems with the fidelity of tracking of sequence data back to physical clones have been observed in some subsets of the overall BES dataset. The annotation results of BESs for the contents of available genomic sequences, sequence tagged sites (STSs), expressed sequence tags (ESTs), protein encoding regions and repeats indicate that this resource will be valuable in many areas of genome research. (human BAC ends URL)

We have been funded to end sequence the mouse BACs from RPCI-23 library within the next year. To date, we have over 25,000 mouse BESs with an quality similar to our human ends. In addition, all end sequencing are being conducted on the ABI 3700 sequencers to eliminate the lane tracking errors experienced on the ABI 377 sequencers. We expect that our mouse ends will have 1) an average read length of 500 bp; 2) an average phred Q20 bases of 400; 3) over 90% of the clones having paired-ends; and 4) a clone tracking accuracy of 99%. The mouse resource will have an even higher quality than the current human ends.


19. Library Strategy for Genome Sequencing Projects

William C. Nierman

The Institute for Genomic Research, Rockville, MD 20850

wnierman@tigr.org

Microbial genome sequencing projects at TIGR have been conducted using two-ended clone sequencing primarily from a small 1.5 to 2 kb insert size plasmid library supplemented with sequence reads from both ends of several hundred 15 - 20 kb lambda clones. We have recently implemented a new shotgun library strategy which incorporates sequence reads from both a small 2 kb insert size library and a larger 10 kb insert size plasmid libraries. This strategy has resulted in a dramatic decline in the number of gaps at the end of the random phase of sequencing for which there is no clone coverage, greatly simplifying the process of closure of the genome. Data from several TIGR sequencing projects will be provided to document this conclusion.

BAC based projects for organisms such as the human and mouse are undertaken to minimize the assembly and closure problems of large repeat rich genomes. The BAC libraries supporting these projects were constructed using partial restriction digests to fragment the genomic DNA prior to ligation to the BAC vector. Due to the non-random distribution of restriction sites for any enzyme in genomic DNA libraries thus constructed always have over-representation and underrepresentation of some regions and no coverage of some small fraction of the genome. These regions of no coverage are revealed as gaps in the BAC contig maps produced by analysis of restriction fingerprints of the BAC clones.

In order to develop a resource for providing clone coverage across these gaps in the BAC contig maps for the human and mouse sequencing efforts, we are constructing BAC libraries from random sheared genomic DNA. The targeted insert sizes are 50 and 100 kb. The human libraries are being constructed with donor DNA collected in strict accordance with appropriate informed consent at Celera Genomics (Hamilton Smith). The mouse libraries are being constructed from male C57BL/6J DNA provided by The Jackson Laboratory with appropriate animal committee review at both The Jackson Laboratory and at TIGR. All libraries are being constructed with very narrow insert size cuts to facilitate easy detection of consequential deletions by clone insert size determinations.


146. The Need for a Simple Sequence Annotation Standard

Lincoln Stein1, Sean Eddy2, Robin Dowell3

1Cold Spring Harbor Laboratory; 2Department of Genetics, Washington University; 3Biomedical Engineering, Washington University

lstein@cshl.org

The pace of human genomic sequencing has outstripped the ability of sequencing centers to annotate and understand the sequence prior to submitting it to the archival databases. Multiple third-party groups have stepped into the breach and are currently annotating the human sequence with a combination of computational and experimental methods. Their analytic tools, data models, and visualization methods are diverse, and it is self-evident that this diversity enhances, rather than diminishes, the value of their work.

The main risk of third-party annotation is that it may fracture knowledge about the genome. Instead of having a convenient one-stop source for genomic annotation, such as Entrez, researchers may have to check multiple Web sites for information about a particular region of interest, download the data in several different formats, and perform a manual integration in order to get the whole picture. Clearly, this is undesirable.

There are several possible approaches to this problem. One is for each of the annotation centers to submit their annotations to a centralized database, such as GenBank. However, this option raises a number of political and technical problems, not the least of which is the long-held tradition of GenBank and its sister databases of allowing only the sequence submitter to modify or comment on a GenBank entry. Another option would be a system which uses Web links to point from the GenBank entry to one or more annotation Web sites. Such a system is available now in the form of the NCBI LinkOut service. However, while this makes it easier for researchers to find third-party annotation sites, it does not solve the problem of data integration.

The solution that we advocate allows sequence annotation to be decentralized among multiple third-party annotators and integrated on an as-needed basis by client-side software. A single server is designated the "reference server." It serves essential structural information about the genomethe physical map which relates one entry to another (where an "entry" is an arbitrary segment of the sequence, such as a sequenced BAC or a contig), the DNA sequence for each entry, and the standard authorship information. Multiple sites then act as third-party "annotation servers." Using a web browser-like application, researchers can interrogate one or more annotation servers to retrieve features in a region of interest. The servers return the results using a standard data format, allowing the sequence browser to integrate the annotations and display them in graphical or tabular form. No attempt is made to automatically resolve contradictions between different third-party annotations. Indeed, it is the ability to facilitate comparison among different centers' annotations that distinguish this proposal. We currently have a working prototype of this system based on ACeDB servers and CGI scripts, and are now generalizing this architecture to support other client and server combinations.

The key development that is necessary for a successful distributed annotation system is the adoption of a standard format to describe sequence features. While almost any one of the existing standards could be adapted for this purpose, certain characteristics are very desirable.

1. Handling of multiple levels of relative coordinates
In the ideal world, the genome would be finished to the base pair, and we would be able to unambiguously refer to an annotation based on its position from the top of the chromosome. This will not happen for a very long time. For the conceivable future, the genome will consist of multiple segments of high confidence, related to one another by mapping information of lower confidence. In order to deal with annotations in this dynamic and changeable environment, the format must be able to deal with relative coordinates in which annotations are related to arbitrary hierarchical landmarks. For example, a "clone end" annotation may be related to the start of a contig, an "mRNA" annotation may be related to the clone end, and an "exon" annotation may be related to the start of the mRNA.

2. Easily generated and parsed
Experience has shown that it is difficult to convince groups to adopt complex and sophisticated data formats. For this reason, a "lowest common denominator" format is desirable, even if it sacrifices some of the expressiveness of the more sophisticated formats. A human-readable format, such as tab-delimited tables, XML, or even ".ace" format is also desirable.

3. Extensibility
Any format must be extensible to allow for new types of annotations. Specifically, we feel that it is desirable to create a category of annotation that has to do with the availability of experimental data concerning the region of interest. For example, the format should allow a researcher to note the presence of RNAi results overlapping the region of interest. The format should also provide a mechanism for pointing the researcher to a location where he or she can get more information about a selected annotation. In the ACeDB-based system, each annotation contains a pointer into an ACeDB entry somewhere on the Internet. This entry is in turn linked to related biological and experimental information.

4. Functional groupings of annotations
To further enhance the extensibility of the format, it is desirable to group specific annotations into functional categories rather than maintaining an unsorted "laundry list" of feature types. For example, splice sites, polyA signals, introns and exons are all annotations having to do with a generic "mRNA" category, while clone ends, primer pairs, and hybridization probes are "structural" features. Grouping annotations into conceptual categories makes the data more manageable, and facilitates formulating biologically relevant queries on the annotation servers.

Even if the community does not undertake a formal third-party annotation system for the human genome, the value of a standardized format for the interchange of annotations is immeasurable. Several suitable formats are under development. One is GAME, an XML-based format developed by Suzanna Lewis of the Berkeley Drosophila Genome Project and members of the BioXML interest group. Another is GFF, a tab-delimited format developed by Richard Durbin, Tim Hubbard, and others at the Sanger Centre. Both formats are supported by a handful of Java-based sequence annotation viewers. For our part, we are using a XMLized version of GFF for our ACeDB-based distributed annotation system.

We urge this group to consider the need for a standard format for genome sequence annotation, and to consider architectures that will allow genomic annotations to be developed and interchanged across database and institutional boundaries.


156. Time-Resolved Sequence Analysis on High Density Fiberoptic DNA Probe Arrays

David R. Walt and Jane Ferguson

Tufts University, Department of Chemistry, 62 Talbot Avenue, Medford, 02155 Mark Chee, Illumina Inc.

david.walt@tufts.edu

An optical imaging fiber consists of a coherent bundle of individual fibers. Each individual fiber has a light conducting inner core that can be chemically etched at a different rate from its surrounding cladding. By treating the polished end of an optical fiber with acid, an array of microwells is generated. Using a simple, one step procedure, the individual wells in the etched fiber can be filled with microspheres slightly smaller in diameter than the well. To prepare a genosensor array, oligonucleotide probe sequences are attached to individual sets of microspheres. The resulting bead populations containing the sequences of interest are then pooled and randomly distributed into the etched fiber array. The position of each bead in the array is then ascertained in several ways. In the first method, each bead can be optically encoded with a unique combination of dyes to allow identification of the probe sequence contained on that bead. Alternatively, we can decode the probe by hybridizing the array to pools of labeled decoding sequences. In most array-based approaches to nucleic acid sequence analysis, each assay at each probe site aims to obtain information from a single sequence. Here, our aim is to develop more efficient methods of analyzing sequences for variation by using each site in the array to analyze multiple positions in a target sequence. We are investigating the simultaneous analysis of multiple positions in a target sequence. This approach has the potential to provide a new, highly parallel method of comparing a sample sequence to a previously determined reference sequence.


The online presentation of this publication is a special feature of the Human Genome Project Information Web site.