Informatics

abstracts from the
DOE Human Genome Program Contractor-Grantee Workshop VI
November 9-13, 1997, Santa Fe, NM


Home
 
PDF

Author Index
Sequencing
Table of Contents
Abstracts
Sequencing Technologies
Table of Contents
Abstracts
Mapping and Resources
Table of Contents
Abstracts
Informatics
Table of Contents
Abstracts
Ethical, Legal, and Social Issues
Table of Contents
Abstracts
Infrastructure
Table of Contents
Abstracts
Ordering Information
Informatics at the Center for Applied Genomics

Donn Davy, Tom Cloutier, Colin Collins, Manfred Zorn, Joe Gray
University of California at San Francisco/ Lawrence Berkeley National Laboratory
DFDavy@lbl.gov

The Resource for Molecular Cytogenetics/Center for Applied Genomics, in collaboration with the University of California at San Francisco-Cancer Center, has molecularly characterized and sequenced an interval of chromosome 20 band 13.2 found amplified in 10% of primary breast tumors and correlated with poor prognosis in node-negative patients. The goal is to determine the complete genomic organization of the one-mega base interval spanning this amplicon. In a follow-on project, the group will select DNA clones from all known oncogenes for use in a DNA chip, to be used as a diagnostic tool. Collaborators conduct their work in San Francisco, Berkeley, Vancouver, Toronto and Finland: sites so geographically distant as to make travel between them inconvenient and inefficient. Data concerning physical map assembly, genomic sequencing, Northern hybridization, tumor mapping, public database searches and extensive sequence annotation efforts including graphical maps and detailed data on clones, loci, exons, sequences and cDNA must be made equally available to all collaborators independent of location. The computing technology brought to bear on this problem combines a standard World-Wide Web sever engine with a tailored ACEDB database and a number of supporting CGI scripts connecting Unix file system documents with retrieved data from the database and with related data both local and at remote sites on the WWW. The ACEDB-style database (CAGdb) is accessible three ways; as a normal X-window application, as an aceclient/aceserver application, and via a set of interface scripts, from the CAG Web page. Additional WWW services include the display of Genotater (LBNL-developed sequence annotation visualization tool) images, XGrail (ORNL gene-prediction tool) images, a set of HTML pages of annotated sequence, a second set of HTML pages providing first-pass results of dbest and non-redundant nucleotide public database searches, the Northern hybridization images, two maps graphically representing the region of interest and connections to related databases. For tracking progress on the oncogene project, a second web server and database were deployed using off-the-shelf Java-language Rapid Application Development (RAD) tools. It is used to report progress in selecting targets, developing primers, and extracting clones, and in keeping track of collaborators' suggestions for further oncogene chip targets.


The Genome Channel and Genome Annotation Consortium

Ed Uberbacher, Richard Mural, Manesh Shah, Ying Xu, Sheryl Martin, Sergey Petrov, Jay Snoddy, Morey Parang - Oak Ridge National Laboratory
Manfred Zorn, Sylvia Spengler, Donn Davy - Lawrence Berkeley National Laboratory
Terry Gaasterland - Argonne National Laboratory
Peter Schad - The National Center for Genome Resources
Stan Letovsky, Bob Cottingham - The Genome Database
David Haussler - University of California Santa Cruz
Pavel Pevzner - University of Southern California
Chris Overton - University of Pennsylvania
ube@ornl.gov
http://compbio.ornl.gov

Human and model organism sequencing projects will soon be producing data at a rate which will require new methods and infrastructure for users to be able to effectively view and understand the data. A multi-institutional project was recently funded to provide large-scale analytical processing capabilities and we will present the results of several pilot efforts related to this project. The goals of the project are as follows:

  • Provide an environment where annotation can be constructed based on multiple interoperable analysis tools and significant available computing power.
  • Provide an environment where characterization of long genomic sequence regions can be facilitated and analysis can maintained and updated over time.
  • Provide an interactive graphical environment where predictions, features and evidence from many tools can be combined by users into high-quality annotation and visualized by the community.
  • Provide high-throughput automated analysis methods which can be configured by genome centers for their use in constructing annotation and facilitating data submission.
  • Provide high-quality annotation to large genomic sequence regions which would otherwise go unannotated.
  • Provide the community with the best sequence level view of genomes possible.
The components of this system are a number of services, a broker that oversees the distribution and management of tasks and a data warehouse, with services implemented using distributed object technology. Multiple gene prediction is accomplished using several gene finding tools including the GRAIL-EST system and gene annotation from databases such as Genbank is also captured. The data warehouse supporting the Genome Channel view is updated daily by automated Internet agents and event triggers which facilitate analysis procedures. Real time operation of the Genome Channel browser will be demonstrated. A more detailed description of the basic components follows:

The Genome Channel
The Genome Channel provides a graphical user interface to comprehensively browse and query assembled sequence placed in the public domain by the Human Genome Project and sequencing of model organisms. It is a JAVA interface tool which relies on a number of underlying data resources, analysis tools and data retrieval agents to provide up-to-date view of genomic sequences, as well as computational and experimental annotation. Navigation from a whole chromosome view to contigs provided by sequencing centers allows one to zoom in on regions of interest to see information about clones, markers, ESTs, computationally and experimentally determined genes, the sequence and sequence source information, related homology and functional information, and hyperlinks to numerous underlying primary data resources.

Analysis Methods
Current analysis methods which combine pattern recognition with EST information and protein similarities are capable of accurate and automated analysis of large genomic regions containing complex multiple gene structures. Analysis methods will include the GRAIL EST/Protein homolog system, Procrustes (Pavel Pevsner et al.), and Genie (David Haussler et al.) as well as other tools. The results of multiple tools can be viewed in a common environment, combined mathematically in user specified ways, and used as the basis for the automated or interactive construction of annotation.

Data Mining Agents
Maintenance of an up to date description of genomic regions will be based on the use of data mining agents formulated for particular information resources. These agents will make use of several different technologies, such as OPM (Victor Markowitz), Kleisli (Chris Overton), and database indices to facilitate meaningful information retrieval from remote sources. We expect new information related to each gene or genome region to continue to be discovered and actively linked in a long-term ongoing process. The current state of these links will be maintained in the data warehouse.

The Data Warehouse
The data warehouse provides and maintains a snap shot of the current view or status of the genomes or genomic regions analyzed in the project. The warehouse is designed to facilitate rapid access by users, visualization tools, and analysis systems. The views contained in the warehouse will be constructed and maintained by processes within this project (such as sequence analysis and information retrieval agents), with additional help from central databases like GDB. It will contain and make available a synthesized best view of genomes from multiple underlying sources.

Internet Object Request Broker
Services for data input, analysis, visualization and submission will be facilitated with a distributed underlying Internet architecture using CORBA with an object request broker to manage processes. Compute platforms, analysis servers, databases, etc. will be at a variety of locations and in some cases duplicated depending on need. Specialized computing hardware will be used to facilitate some tasks.

This project is jointly sponsored by the Computational Grand Challenge program of the Office of Computational and Technology Research and the Human Genome Program of the Office of Biological and Environmental Research of the Department of Energy.


GRAIL and GenQuest Sequence Annotation Tools

Ying Xu, Manesh B. Shah, J. Ralph Einstein, Morey Parang, Jay Snoddy, Sergey Petrov, Victor Olman, Ge Zhang, Richard J. Mural and Edward C. Uberbacher
Computational Biosciences Section, Life Sciences Division, Oak Ridge National Laboratory, Oak Ridge, TN 37831-1060
GRAILMAIL@ornl.gov

Our goal is to develop and implement an integrated intelligent system which can recognize biologically significant features in DNA sequence and provide insight into the organization and function of regions of genomic DNA. GRAIL is a modular expert system which facilitates the recognition of gene features and provides an environment for the construction of sequence annotation. The last several years have seen a rapid evolution of the technology for analyzing genomic DNA sequences. The current GRAIL systems (including the e-mail, XGRAIL, JAVA-GRAIL and genQuest systems) are perhaps the most widely used, comprehensive, and user friendly systems available for computational characterization of genomic DNA sequence. In the past 2 years of the project we have:

  • Developed improved systems for recognition of exons, splice junctions, promoter elements and other features of biological importance, including greater sensitivity for exon prediction (especially in AT rich regions) and robust indel error detection capability.
  • Developed improved and more efficient algorithms for constructing models of the spliced mRNA products of human genes.
  • Developed and implemented methods for the analysis and visualization of sequence features including poly-A addition sites, potential Pol II promoters, CpG islands and repetitive DNA elements.
  • Designed and implemented new methods for detecting potential sequence errors which can be used to "correct" frameshifts, add quality assurance to sequencing operations, and better detect coding regions in low pass sequences such as ESTs.
  • Developed systems for a number of model organisms including mouse, Escherichia coli, Drosophila melanogaster, Arabidopsis thaliana, Saccharomyces cerevisiae and a number of microbial genomes.
  • Implemented methods for the incorporation of protein, EST, and mRNA sequence evidence in the multiple gene modeling process.
  • Constructed a powerful and intuitive graphical user interface and client-server architecture which supports Unix workstations and JAVA Web-based access from many platforms.
  • Improved algorithms and infrastructure in the genQuest server, allowing characterization of newly obtained sequences by homology-based methods using a number of protein, DNA, and motif databases and comparison methods such as FASTA, BLAST, parallel Smith-Waterman, and algorithms which consider potential frameshifts during sequence comparison.
  • An improved "batch" GRAIL client allows users to analyze groups of short (300-400 bp) sequences for coding character (with frameshift compensation options) and automates database searches of translations of putative coding regions.
  • Provided support for GRAIL use in more than a thousand laboratories and at a rate of over 4000 analysis requests per month.
The imminent wealth of genomic sequence data will present significant new challenges for sequence analysis systems. Our vision for the future entails incorporation of a more sophisticated view of biology into the GRAIL system. Computational systems for genome analysis have thus far focused on generic or textbook-like examples of single isolated genes which can be described fairly simply using the most usual assumptions, and fall far short of the intelligence necessary to interpret complex multiple gene domains. In its next phase the GRAIL project will involve the development of new pattern recognition methods and modeling algorithms for DNA sequence, expert systems for interpretation using experimental evidence and comparative genomics, and interoperation with other tools and databases. More specifically we will focus on several development areas:

(1) Improved accuracy of feature recognition and greatly increased tolerance to sequencing errors,

(2) development of technology to describe the structure and regulation of large, complex genomic regions containing multiple genes,

(3) automated and interactive methods for the incorporation of experimental evidence such as ESTs, mRNAs, and protein sequence homologs in multi-gene domains (GRAIL-EXP),

(4) more comprehensive feature recognition and increased biological sophistication in the areas of expression and regulation,

(5) capabilities for direct comparison of genomes,

(6) a comprehensive suite of microbial genome analysis systems,

(7) infrastructure for use of high-performance computing systems and specialized hardware to facilitate analysis and annotation of large volumes of sequence data,

(8) improved interoperation with other tools, databases and methods for integrating information from multiple sources, particularly within the Genome Annotation Consortium framework and Genome Channel, and

(9) continued community and user support, technology transfer, and educational outreach.

These developments will enable GRAIL to become more comprehensive and biologically sophisticated, and yet remain a user-friendly analysis environment which can be used interactively or in fully automated modes.

GRAIL and genQuest related tools are available as a Motif Graphical client (anonymous ftp from grail.lsd.ornl.gov (134.167.140.9)), through WWW interfaces (URL http://compbio.ornl.gov/), or by email server (grail@ornl.gov) and genQuest at (Q@ornl.gov). Communications with the GRAIL staff should be addressed to GRAILMAIL@ornl.gov. (Supported by the Office of Health and Environmental Research, United States Department of Energy, under contract DE-AC05-84OR21400 with Martin Marietta Energy Systems, Inc.)


The Stationary Statistical Properties of Human Coding Sequences

David C. Torney, Clive C. Whittaker, and Gouchun Xie
Los Alamos National Laboratory, Theoretical Division, Los Alamos, New Mexico 87545, and Merck and Company, Inc., West Point, Pennsylvania
dct@lanl.gov

We analyzed approximately 0.5 Mb of in-frame, nonredundant coding sequences from the NFRES database. These were reversibly encoded as binary sequences, using 1 1 for A, 1 -1 for C, -1 1 for G and -1 -1 for T. This binary encoding is appropriate for characterizing all stationary statistical features of the data, either in terms of moments or of cumulants.

Moments are expectations of products of digits. Cumulants are constructed from the moments, aiming to remove features that could have been predicted from subsets of the digits. The simplest cumulants are covariances of two digits, vanishing if these are independent. In fact, the cumulants offer a more concise description of the statistical properties of coding sequences than do the moments.

Turning to the covariances of digits from coding sequences, as each codon is represented by six binary digits, there are 36 types of pairs of digits for separate consideration. Some of these pairs exhibit a nonzero asymptote as the separation between the two digits increases. Most pairs exhibit interesting transients, out to about 50 bases. We will show plots of these 36 types of covariances.

We will also report results for cumulants of larger numbers of digits. For cumulants of three digits, the asymptotes are all effectively zero, as the spacings increase indefinitely, except for those cases in which two digits are derived from one base position. Similarly, the cumulants of four digits that have the largest asymptotic values are the ones in which these four are derived from two base positions. We will show plots for the cumulants corresponding to the nine pairs of codon bases. The cumulants of six digits, corresponding to three bases, appear to have a zero asymptotic value, as the two spacings between the bases increase.

A complete list of the nonzero cumulants--a tabulation of the stationary statistical properties of human coding sequences--is in hand. This tabulation could greatly facilitate the classification of anonymous DNA sequences and provide a natural starting point for the non-stationary analysis of DNA sequences.


WIT/WIT2: A System for Supporting Metabolic Reconstruction and Comparative Analysis of Sequenced Genomes

Ross Overbeek,* Natalia Maltsev,* Gordon Pusch,* and Evgeni Selkov * **
* Mathematics and Computer Science Division, Argonne National Laboratory, and
** Institute of Theoretical and Experimental Biophysics, Russian Academy of Sciences, 142292 Pushchino, Moscow region, Russia
maltsev@mcs.anl.gov

The WIT/WIT2 system has been developed to support metabolic reconstruction from sequenced genomes. Specifically, it supports

1. derivation of initial assignments of function to ORFs for an organism,
use of these assignments to construct an initial estimate of the metabolic pathways present in the organism,
3. use of consistency analysis to refine the functional assignments, and
4. provide a framework for presenting and refining the emerging metabolic models for a set of organisms.

WIT2 is a UNIX-based system that is made available to anyone wishing to support Web-based access to genomic sequence data. It includes in the standard distribution a set of integrated genomes from the public archives. For each of the distributed organisms, the user has access to the ORFs, RNAs, contigs, function assignments, and asserted pathways that characterize the current state of the analysis of the genome. The user has the option of adding new public or proprietary genomes, and then analyzing the new genomes based on clustering of ORFs with the distributed genomes. The system supports both shared and non-shared annotation of features and the maintenance of multiple models of the metabolism for each organism. WIT2 comes in two parts: a Web-based system offering access to the data and a set of batch tools that offer extensible query access against the data. The Web-based tools integrate WIT2 with other sources of data on the Web, while the batch tools allow one to do pattern matching, extract regions of sequence for analysis using other tools, look for operons, and so forth.

Currently, the released system offers access to data for the following organisms:

Archaeoglobus fulgidus, Caenorhabditis elegans, Deinococcus radiodurans, Escherichia coli, Haemophilus influenzae, Helicobacter pylori, Methanobacterium thermoautotrophicum, Methanococcus jannaschii, Mycobacterium tuberculosis, Mycoplasma genitalium, Mycoplasma pneumoniae, Rhodobacter capsulatus SB1003, Saccharomyces cerevisiae, Synechocystis sp., and Treponema pallidum.

The release at Argonne National Laboratory is accessible via http://wit.mcs.anl.gov/WIT2/.

Acknowledgment. This work was supported by U.S. Department of Energy under Contract W-31-109-Eng-38.


Internet Release of the Metabolic Pathways Database, MPW

Evgeni Selkov, Jr.,* Yuri Grechkin,* and Evgeni Selkov * **
* Institute of Theoretical and Experimental Biophysics, Russian Academy of Sciences, 142292 Pushchino, Moscow region, Russia, and
** Mathematics and Computer Science Division, Argonne National Laboratory, Argonne, 9700 S. Cass Ave., MCS-221, IL 60439-4844, USA
selkovjr@turtle.stack.net

The Metabolic Pathway Database, MPW [1, 2], a subset of EMP [3. 4], plays a fundamental role in the technology of metabolic reconstructions from sequenced genomes under the PUMA, WIT, and WIT2 systems [5-7]. It is the largest and the most comprehensive metabolic database, including some 2,800 pathway diagrams covering primary and secondary metabolism, membrane transport, signal transduction pathways, intracellular traffic, translation, and transcription.

The MPW diagrams were originally encoded and distributed as a formatted ASCII text [2]. In the current public release of MPW [8], the encoding is based on the logical structure of the pathways and is represented by the objects commonly used in electronic circuit design. This facilitates drawing and editing the diagrams and makes possible automation of the basic simulation operations such as deriving stoichiometric matrices, rate laws, and, ultimately, dynamic models of metabolic pathways. Individual pathway diagrams, automatically derived from the original ASCII records [1, 9], are stored as SGML instances supplemented by relational indices. An auxiliary database of compound names, encoded as SMILES strings [10], is maintained to unambiguously connect the pathways to the chemical structures of their intermediates. In accordance with the IUPAC nomenclature for chemical compounds, the current release supports super- and subscripts, Greek letters, Italic fonts, etc.

Acknowledgment. This work was supported by U.S. Department of Energy under Contract W-31-109-Eng-38, and award OR00033-97CIS001.

References

1. Selkov E., Galimova M., Goryanin I., Gretchkin Y., Ivanova N., Komarov Y., Maltsev N., Mikhailova N., Nenashev V., Overbeek R., Panyushkina E., Pronevitch L., Selkov E., Jr. Nucleic Acids Res., 1997, 25 (1), 37-38

2. http://www.biobase.com/emphome.html/ homepage.html/pags/pathways.html

3. Selkov E., Basmanova S., Gaasterland T., Goryanin I., Gretchkin Y., Maltsev N., Nenashev V., Overbeek R., Panyushkina E., Pronevitch L, Selkov E., Jr., Yunus I. Nucleic Acids Res., 1996, 24 (1), 26-29

4. http://www.biobase.com/EMP

5. http://www.mcs.anl.gov/home/compbio/PUMA/ Production/ReconstructedMetabolism/reconstruction.html

6. http://www.mcs.anl.gov/home/compbio/WIT/wit.html

7. http://www.mcs.anl.gov/home/overbeek/WIT2/CGI/user.cgi

8. http://beauty.isdn.mcs.anl.gov/WIT2.pub/CGI/org.cgi

9. http://www.cme.msu.edu/MPW/

10. http://www.daylight.com/dayhtml/smiles/index.html


Metabolic Reconstruction from Sequenced Genomes

Evgeni Selkov,*+ Natalia Maltsev,* and Ross Overbeek*
* Mathematics and Computer Science Division, Argonne National Laboratory, Argonne, 9700 S. Cass Ave., MCS-221, IL 60439-4844, USA,
+ Institute of Theoretical and Experimental Biophysics, Russian Academy of Sciences, 142292 Pushchino, Moscow region, Russia
evgeni@mcs.anl.gov

With the availability of increasing numbers of complete genomes, the possibility of developing accurate models of metabolism for these organisms becomes of central interest. We have initiated a project to "reconstruct the metabolism" of organisms from the sequence data supplemented by available biochemical and phenotypic data, and we have developed initial reconstructions for a number of organisms. These reconstructions for over twenty organisms (some based on incomplete sequence data) have been made available via the WIT and WIT2 systems [1-3].

The reconstructions are based on the collection of metabolic pathways, MPW. This collection now includes over 2,800 diagrams and is continually being enhanced. Each diagram represents a grouping of functional roles, and these groupings provide a resource in analyzing the functional assignments made to ORFs in newly-sequenced genomes; when several functions in a pathway have been clearly identified, more focused analysis on the "missing functions" provides a powerful means of improving function assignments that were originally made without access to an overall understanding of the metabolism of the organism.

The actual process of metabolic reconstruction involves a number of steps which we describe in detail. We have made the WIT/WIT2 system available to support such efforts, but the actual process is independent of this software and will undoubtedly be adopted by other efforts based on the same overall goal of using sequence data as a foundation to develop an accurate model of the metabolism of an organism.

We are entering a new phase of the project in which substantial benefit can be achieved via our growing understanding of which parts of metabolism are universal, of gene families (allowing more rapid development of initial function assignments made to ORFs), and of insights achieved from one genome that have obvious application in others. What is emerging is not just a large number of isolated metabolic portraits for a set of diverse microbial organisms, but rather an integrated understanding of the evolution of metabolism and the technology for developing higher-level functional models.

Acknowledgment. This work was supported by U.S. Department of Energy under Contract W-31-109-Eng-38.

References

1. http://www.mcs.anl.gov/home/compbio/WIT/wit.html

2. http://www.mcs.anl.gov/home/overbeek/WIT2/CGI/user.cgi

3. http://beauty.isdn.mcs.anl.gov/WIT2.pub/CGI/org.cgi


From Genomic Sequence to Protein Expression: A Model for Functional Genomics

Tracy J. Wright, Simone Abmayr, Min S. Park, Darrell O. Ricke, Cleo Naranjo, Becky Welsh-Breitinger, Karen Denison and Michael R. Altherr
Genomics Group, Mail Stop M888, Los Alamos National Laboratory, Los Alamos, New Mexico 87545

The goal of Functional Genomics is to provide useful and effective annotation of genomic sequence data. While effective annotation is likely to be defined by the needs of the end user, there are a variety of routine procedures that geneticists employ in the characterization of any gene or hypothetical gene. Included in these collections of procedures are techniques to identify: genetic variation (or polymorphism), transcript complexity and distribution, as well as protein coding capacity. However, the first step in the process is the identification of putative coding segments within a genomic sequence. This can be accomplished by interrogating the expressed sequence tag data base (dbEST) with the genomic sequence of interest; or by submitting the genomic sequence to a gene identification program like GRAIL in a effort to identify potential transcriptional domains. Sequences and clones identified in this way should be considered putative genes until their functional significance is substantiated by additional biological data.

We have initiated a process to evaluate the biological function of putative genes identified in a large segment of genomic sequence. The genomic sequence we have focused on represents a portion (~200 kbp) of a 2.2 Mbp sequence contig derived from a gene rich region on human chromosome 4 (4p16.3). This region is significant because it has been demonstrated to represent the smallest region of overlap for deletions in Wolf Hirschhorn syndrome (WHS) patients. WHS is a multiple anomaly dysmorphic malady characterized by mental and developmental defects. Due to the complex and variable expression of this disorder, it is thought that WHS is a contiguous gene syndrome with an undefined number of genes contributing to the phenotype. The analysis of genomic sequence data by BLAST, FASTA and GRAIL identified a number of putative transcription units. Potential coding segments were subjected to full insert sequencing of cDNA (when available) and Northern blot analysis. These initial analyses identified two putative coding segments with a high probability (intron::exon structure and positive Northern results) of representing bona fide genes. In addition, highly similar clones have been isolated from mouse providing additional support that these represent true coding segments. The human coding segments have been cloned into expression vectors and will ultimately be used to generate antibodies to confirm the existence of the hypothetical proteins.

This work has allowed us to evaluate coding potential and possible function of genes identified by genomic sequencing efforts. It has identified a number of potential bottlenecks and allowed us to conceptually develop a scheme to provide functional annotation for genomic sequence.

Support: DOE Contract W7405-ENG-36 and LANL LDRD funds.

E-mail Contact: ALTHERR@LANL.GOV
Day Phone: (505)665-6144


FAKtory: A Customizable Fragment Assembly System

Eugene W. Myers, Susan J. Larson, Brad W. Traweek, Kedarnath A. Dubhashi
University of Arizona, Tucson, AZ
slarson@cs.arizona.edu

FAKtory supports a large range of sequencing protocols and specialized fragment database information. Customizations include a processing pipeline for clipping, tagging, and vector trimming stages (prescreeners), and a configurable fragments database. Each pipeline stage can run in Automatic, Supervised, or Manual mode depending on the degree of user control desired. The FAKII Fragment Assembler provides high-sensitivity overlap detection, near-perfect multialignments, alternate assemblies, and support of user specifiable assembly constraints. While initial development on FAKtory emphasized its customizability, recent work provides sophisticated prescreeners, an editor for contig layout manipulation, a Finishing editor, and I/O filter capabilities for easy translation of data formats and linking to post-analysis programs.

A pipeline can include any number of prescreener stages for vector removal, data trimming, and tagging. Each prescreener includes a number of recognizers, which locate within a fragment selected intervals, frequencies of specified bases, trace signal characteristics, overlap with reference sequences, or matches to regular expressions. Because alternate assemblies may be generated for a given data set, FAKtory provides a Layout Edit panel for comparing and interacting with the potential solutions. Portions of contigs can be locked together, split apart, or checked for possible joins to other contigs. Constraints may be added or deleted, and reassembly with additional data generates additional assemblies to consider.

A Finishing editor displays a multialignment with a scrollable canvas of all trace data for the current location. FAKtory allows automatic tabbing to the previous or next unedited problem in the contig, minimizes the number of keystrokes needed in an editing sweep, and allows unlimited undo. All edited regions are shown to indicate finishing progress.


Divide-and-Conquer Multiple Sequence Alignment

Dan Gusfield, Jens Stoye
Department of Computer Science, University of California, Davis, CA 95616, USA
stoye@cs.ucdavis.edu

We present a fast heuristic algorithm for the simultaneous alignment of multiple sequences which provides near-to-optimal results for sufficiently homologous sequences.

The algorithm makes use of the optimal alignments of all pairs of sequences which give rise to secondary matrices containing additional charges imposed by forcing the alignment path to run through a particular vertex of the distance matrix. From these "additional-cost" matrices, we compute suitable positions for cutting all of the given sequences simultaneously, thus reducing the problem of aligning a family of sequences in a divide-and-conquer fashion to aligning two families of sequences, each of approximately half the original length. When, after re-iterating this division procedure sufficiently often in a recursive manner, the subsequences are sufficiently short, these are aligned optimally.

This procedure allows us to align simultaneously up to twelve amino acid sequences of the usual length (< 500) within a few minutes. The poster will also present several results concerning running time and memory usage as well as the quality of the obtained alignments.

Acknowledgment:

Part of this work was supported by Dept. of Energy grant DE-FG03-90ER60999 and by the German Academic Exchange Service (DAAD).


Sequence Assembly Validation by Restriction Digest Analysis

Eric C. Rouchka and David J. States
Washington University
ecr@ibc.wustl.edu or states@ibc.wustl.edu

DNA sequence analysis depends on the accurate assembly of fragment reads for the determination of a consensus sequence. Genomic sequences frequently contain repeat elements that may confound the fragment assembly process, and errors in fragment assembly may seriously impact the biological interpretation of the sequence data. Validating the fidelity of sequence assembly by experimental means is desirable. This report examines the use of restriction digest analysis as a method for testing the fidelity of sequence assembly. A dynamic programming algorithm to determine the maximum likelihood alignment of error prone electrophoretic mobility data is derived and used to assess the likelihood of detecting rearrangements in genomic sequencing projects.

Restriction digest fingerprint matching is an established technology for high resolution physical map construction, but the requirements for assembly validation differ from those of fingerprint mapping. Fingerprint matching is a statistical process that is robust to the presence of errors in the data and independent of absolute fragment mass determination. Assembly validation depends on the recognition of a small number of discrepant fragments and is very sensitive to both false positive and false negative errors in the data. Assembly validation relies on the comparison of absolute masses derived from sequence with masses that are experimentally determined, making absolute accuracy as well as experimental precision important. As the size of a sequencing project increases, the difficulties in assembly validation by restriction fingerprinting become more severe. Simulation studies are used to demonstrate that large-scale errors in sequence assembly can escape detection in fingerprint pattern comparison. Alternative technologies for sequence assembly validation are discussed.


Segmentation Based Analysis of Genomic Sequence

Eric C. Rouchka and David J. States
Washington University
ecr@ibc.wustl.edu or states@ibc.wustl.edu

The human genome is patchy and non-uniform in composition. We have developed a method for analyzing genomic sequence based on segmental variation in sequence composition using a heuristic algorithm employing classic changepoint methods and log-likelihood statistics. Our approach models the genome as composed of linear segments each characterized by its own compositional characteristics. The most informative description of a sequence is that description that maximizes the likelihood of the observed sequence given the model while simultaneously minimizing the cost of specifying the model. A Java interface has been developed to aid in visual inspection of the segmentation results (http://www.ibc.wustl.edu/ ~ecr/CPG/segment.html). The software has been tested using CpG dinucleotide distribution as a well-established example of biologically significant non-uniform composition.

Regions of DNA rich in CpG dinucleotides, also known as CpG islands, are often found upstream of the transcription start site in both tissue specific and housekeeping genes. Overall, CpG dinucleotides are observed at a density of 25% the expected level from base composition alone, partially due to 5-methylcytosine decay . About 56% of human genes have associated CpG rich islands. Since CpG dinucleotides typically occur with low frequency, CpG islands can be distinguished statistically in the genome. Our model is tested using several sequences obtainable from GenBank, including a 220 Kb fragment of human X chromosome from the filanin (FLN) gene to the glucose-6-phosphate dehydrogenase (G6PD) gene which has been experimentally studied. Results demonstrate a breakpoint segmentation that is consistent with observable manual analysis.

In addition to segmenting DNA according to the location of CpG islands, other compositions are explored as well. Among these are C + G content, mononucleotide content, and dinucleotide content. Segmentation according to higher order oligomers is also under consideration.


Swedish and Finnish Quality Based Finishing Tools for a Production Sequencing Facility

Matt P. Nolan, Jane E. Lamerdin, Stephanie A. Stilwagen, Glenda G. Quan, Ami L. Kyle, Anthony V. Carrano
Joint Genome Institute, Lawrence Livermore National Laboratory
nolan1@llnl.gov

Our modified shotgun sequencing effort has three phases. In the random phase we sequence a fixed number of plates resulting in 80%-95% of the cosmid bases meeting our quality-based, double-stranded, finish criteria (QbDsFc). During pre-finishing we resequence clones attempting in one round of forwards and reverses to meet the QbDsFc for 95% of the bases and close most gaps. During directed closure we close any remaining gaps and complete double-stranding. To reduce finishing costs and speed time to completion for our ~40KB cosmid clone projects we created software to automate selection of finishing reads. We describe our SaF (Swedish and Finnish) software tools developed to 1) facilitate the specification of clones for resequencing and to 2) quantify the state of project contigs with respect to our QbDsFc.

For a project assemblage our SaF tools identify bases not meeting the QbDsFc, then conglomerate these problem bases into problem regions using parameterized filtering and clustering algorithms. They produce reports listing each problem region and a contig summary. Within each problem region, we quantify the problem with respect to our QbDsFc, the phrap consensus quality values and whether or not the consensus sequence overlaps a database of repeats. We can generate a list of sequences (candidates for resequencing) intersecting the problem region. For each sequence, we identify where the read intersects the region. Within the intersection we compute the mean and sigma of the phred basecall quality values, quantify how well it matches the consensus sequence, and compute a value proportional to resequencing suitability.

In our production sequencing we use the SaF tools to fully automate clone selection in the pre-finishing phase and we require finishers to address each region identified during directed closure. The SaF applications consist of ~8000 lines of C++ and ~1000 lines of perl. We run the script Swedish-setup (~4000 lines of perl) to coordinate the conversion of assembly data through multiple formats to construct a CAF file representation of the phrap assemblage which the SaF applications input. One application produces a file used to direct a robotic workstation to rearray clones to prepare chemistries for finishing reads. We expect other new SaF functionality to migrate into our production environment to meet the ten fold increase projected for JGI sequencing in the upcoming year.

Work performed under the auspices of the US DOE by Lawrence Livermore National Laboratory under contract No. W-7405-Eng-48


Informatics to Support Increased Throughput and Quality Assurance in a Production Sequencing Facility

Arthur Kobayashi, David J. Ow, Tom Slezak, Mark C. Wagner, Matt P. Nolan, T. Mimi Yeh, Stephan Trong, Anthony V. Carrano
LLNL HGC; Joint Genome Institute
kobayashi1@llnl.gov

We are developing an integrated informatics infrastructure to support the increasing throughput and quality assurance demands of our production sequencing facility. The LLNL Human Genome Center utilizes a modified shotgun-based approach to sequence human chromosome 19 and targeted gene regions of interest. To date we have finished over 1.5 Mb of high-quality genomic sequence, including a contiguous 1-Mb region. Our laboratory strategies and protocols are described in more detail in other abstracts in these proceedings (e.g., Lamerdin, McCready, et al).

To support this increase in sequence data volume, we have designed and implemented our informatics system to support our current sample prep and sequencing strategy: bulk generation of dye primer and dye terminator reads early in the random phase, followed by increasing automation in the pre-finishing and directed closure phases. We have developed a number of software programs (predominantly Sybperl or PerlTk) which are integrated through our Sybase relational database.

To support increased throughput, we have developed Sybperl programs to generate HTML WWW forms to create and manage sequencing projects, track samples, and create sample sheets. We have also implemented an automated sample file sorting system that analyzes and distributes sample files from our 14 ABI sequencers, which currently are loaded and run twice each day. Analysis results for E .coli contamination, sequence read length, and percent vector are archived in our relational database for reports, trend analysis, and troubleshooting. We have recently completed a rearraying procedure for the pre-finishing and gap closure phases using laboratory robots, which are also integrated into our database and sample-tracking system.

To support quality assurance procedures, we have developed an automated reporting system which uses archived analysis results as well as project information to generate and distribute a series of reports which include an estimate of sequencing efficiencies and project consensus quality. We also have provided WWW access to project directories, which display current quality plots and assembly information for each project. In addition, we have developed a suite of tools which can be used for analyzing specific projects for sequencing efficiencies, read length, etc. We plan to continue to develop and expand our informatics capabilities as we continue to significantly scale up our throughput over the next several years.

This work was performed by Lawrence Livermore National Laboratory under the auspices of the U.S. Department of Energy, Contract No. W-7405-Eng-48.


Software Tools for Data Analysis in Automated DNA Sequencing

Michael C. Giddings, Jessica Severin, Michael Westphall, and Lloyd M. Smith
University of Wisconsin-Madison Chemistry Dept., 1101 University Ave., Madison, WI 53703
giddings@whitewater.chem.wisc.edu
http://smithlab.chem.wisc.edu

A crucial component of the automated DNA sequencing process is the analysis software. The software analysis can be roughly divided into four primary steps: gel analysis, base calling, assembly, and finishing. In each of these steps, the software is responsible for the difficult task of accurately analyzing large quantities of complex data and deriving information that is useful to end users of the data.

We have developed software for gel analysis and base calling utilizing a cross-platform, modular, object oriented architecture. The core of this software is BaseFinder, a framework for trace processing, analysis, and base-calling. BaseFinder is highly extensible, allowing the addition of trace analysis and processing modules without recompilation. Powerful scripting capabilities combined with modularity allow the user to customize BaseFinder to virtually any type of trace processing. It currently runs on Windows/NT (Microsoft) and OpenStep/Mach (NeXT), with ongoing work to port it to additional operating systems, including: Solaris (Sun), Rhapsody (Apple), and Linux (GNU/freeware). The Solaris is in progress and is expected to be completed shortly.

Base calling is currently performed by an adaptive, iterative algorithm which can be quickly tuned to work on data from a large variety of sequencing instruments. The base calling module, when applied in combination with appropriate pre-processing of the data, provides fast, accurate base labeling along with confidence measures on the bases called. It has been extensively tested with data from a number of machines, including the ABI 373A with and without the "stretch" (long gel) upgrade, our in-house Horizontal Ultrathin Gel Electrophoresis (HUGE) system, our vertical slab-gel scanning system, and a capillary sequencing system.

We are also working on a project to incorporate these programs and software developed elsewhere, along with a relational database engine, utilizing distributed object technology. This system has the promise of providing a robust means of dealing with the large quantities of complex data that must be handled in a genome sequencing center, while providing a simple and consistent view of the entire process to a human operator.


Treasures and Artifacts from Human Genome Sequence Analysis

Darrell O. Ricke and Larry L. Deaven
Los Alamos National Laboratory, Center for Human Genome Studies, Los Alamos, New Mexico 87545
ricke@telomere.lanl.gov

A very interesting aspect of the Human Genome Project (HGP) is the analysis of the sequences being generated. Sequence analysis of genomic sequence data is yielding treasures and interesting artifacts. Human genomic sequence analysis finds significant similarities to (1) known genes, (2) known repetitive sequences (Alu, L1, etc.), (3) human and mammalian ESTs, (4) trapped exons, (5) gene orthologs and paralogs (both at the DNA and protein level), (6) tRNAs, (7) snRNAs, (8) rRNAs, (9) CpG DNA sequences, etc. Multiple significant protein sequence similarities have been observed between translated human genomic regions (i.e., exons) and protein sequences from drosophila, yeast, and C.elegans. Similarities with less than 90% identity between a human genomic sequence and a human EST sequence usually represent two members of a multigene family or a possible pseudogene. Surprisingly, human genomic DNA includes multiple regions with significant similarities to mitochondrial DNA. Analysis of over 6 megabases of human genomic sequence data has detected 244 EST clusters, 21 known genes, 55 novel genes, and 35 trapped exons. These and many other treasures await us as the HGP progresses.

Identification of the treasures in the genomic sequence data is confounded by the multiple artifacts that are encountered. Artifacts that need to be ignored include: low complexity similarities, similarities to unannotated repeats, similarities to unannotated contaminating sequences (vector, yeast, and E.coli), exon predictions within repetitive sequences, etc. Many EST, intron, and genomic sequences contain novel repeats that have not been previously described or annotated. One hallmark of a novel repeat in an EST is the lack of sequence similarity between the rest of the EST sequence and genomic sequence surrounding the region of similarity. One particularly interesting artifact is the presence of a human Alu repeat in a bacterial database sequence (pva). Most likely this represents a bacterial sequence contaminated with human sequence. However, horizontal DNA transfer can not be ruled out. Other interesting examples include similarities between human DNA and plant DNA database sequences.

http://www.jgi.doe.gov and https://www.chgs.lanl.gov


Annotating and Masking Repetitive Sequences with RepeatMasker

Arian F.A. Smit and Phil Green
Human Genome Center, Department of Medicine, University of Washington, Seattle, WA 98195

Probably over half of mammalian genomic DNA is comprised of sequences that still can be recognized to be derived from transposable elements. These interspersed elements as well as simple repeats and low complexity DNA need to be masked before performing database searches, since interesting sequence similarities may be overlooked amidst the many spurious matches to these repeats. Furthermore, recognition of interspersed repeats dramatically can improve gene prediction and interspecies sequence comparison and is useful as a finishing tool in genomic sequencing. RepeatMasker is a widely used program to annotate and mask repetitive DNA sequences. Here, we report on recent improvements in the program and databases as well as development currently in progress. We also present examples of the usefulness of RepeatMasker in database searches, gene prediction and evolutionary studies. The program can be run locally on any computer with sufficient memory, or can be accessed via the web at http://ftp.genome.washington.edu/ cgi-bin/RepeatMasker or by e-mail at


Why Is Basecalling Hard to Do Well? Sources of Variability in DNA Sequencing Traces and Their Consequences

David O. Nelson
Human Genome Center
Lawrence Livermore National Laboratory
daven@llnl.gov

Terence P. Speed
Statistics Department
University of California, Berkeley

"Basecalling" is the process of inferring the sequence of a segment of DNA from data arising from electrophoresis traces. Assigning a realistic, probabilistic quality measure to the inferred bases in a DNA sequence requires knowledge of the variability of the data, given the underlying sequence. Treating the sequencing process as an attempt to decode a message in a digital communications system provides us with a framework for analyzing that variability, quantifying it, assigning it to various features of the system, and assessing the performance consequences.

We have examined traces of known sequences from locally produced ABI 377 data to assess the extent of variability in DNA traces. We present data showing the variability of interpeak distance and peak width as a function of base number. We also present data showing the relationship between base number and the amount of information in the signal available for basecalling. We show that, as the number of bases called increases, the primary stumbling block to accurate basecalling is similar to the problem of "synchronization" in a digital communications system. We examine the relationship between the probabilistic quality scores produced by Phil Green's basecaller "phred" and the local resolution of the signal, defined for a consecutive pair of peaks as the ratio of the interarrival time between the two peaks and the sum of the scale parameters for the peaks. We derive a simple linear model for the probability that the resolution is less than one. Our data shows that beyond a few hundred bases phred quality scores appear to be driven mainly by the probability that the resolution of the signal is less than one. These statistical characteristics of the underlying signal help to explain the abruptness of the well-known "phase transition" from high quality to low quality decisions seen in DNA sequence data.


A Statistical Model for Basecalling

Lei Li, David O. Nelson, and Terence P. Speed
Department of Statistics, University of California, Berkeley
lilei@stat.berkeley.edu, daven@llnl.gov, terry@stat.berkeley.edu

Base-calling is that part of automated DNA sequencing which takes the time-varying signal of four fluorescence intensities and produces an estimate of the underlying DNA sequence which gave rise to that signal.

We approach the problem of base-calling from a statistical perspective, hoping to make use of a statistical model to call bases, and also to attach suitable measures of uncertainty to the bases we call. We model the automated Sanger sequencing process by the following steps. First, the underlying sequence of bases is encoded by a hidden Markov model into a virtual signal, consisting of four spike trains. Second, each component of the virtual signal is distorted by a slowly-changing point spread function, to represent the diffusion effect of electrophoresis. Third, mobility shifts are applied to the components separately, with the result being the (model) time varying concentrations of the four bases. Finally, these dye concentrations are converted into fluorescence intensities by the application of an instrument-dependent cross-talk matrix.

Signals simulated according to this model exhibit many, but certainly not all of the features found in real sequencing traces. We hope to have captured the important ones. Our base-calling strategy is to invert the process described in the model. Specifically, we combine algorithms for color separation, mobility adjustment and deconvolution with a hidden Markov model decoder. The output of this analysis will be a "best" estimate of the sequence of bases that gave rise to the data. In addition, for each base called, the analysis will provide the marginal probability of any alternative base call at that position.

We will report our progress in the design of the hidden Markov model, and the deconvolution and color-separation steps.


An Expert System for Base Calling in Four-Color DNA Sequencing by Capillary and Slab Gel Electrophoresis

Arthur W. Miller and Barry L. Karger
Barnett Institute, Northeastern University, Boston, MA 02115
miller@ccs.neu.edu

An important consideration in fluorescent DNA sequencing strategies is to extract as much information as possible from each run. In previous work in which we described the sequencing of more than 1000 bases per run by capillary electrophoresis (Carrilho et al., Anal. Chem. 1996, 68, 3305-3313), one characteristic contributing to the extended read length was the use of sophisticated base calling algorithms. We have recently developed a new base calling method that shows promise to increase read lengths for both capillary and slab gel (e.g., ABI) electrophoresis by decreasing the number of errors at long migration times. For example, errors between 800 and 1100 bases in the above cited paper are reduced by 40% relative to the graph-theoretic method employed at that time. Accuracy in this late region is better by the new procedure than by any other one we have tested to date. Data processing with the new system begins by determining the dye spectra from the raw data file in order to perform color separation. This step is followed by baseline subtraction, and bases are subsequently assigned in a moving time window, roughly ten bases wide. Using a set of empirical rules, each channel in the window is divided into the smallest sections likely to contain at least one call, and then a second set of rules is applied to determine the final calls. Confidences on the base assignments are supplied by statistical correlation of the variables used in the rules, such as peak height and width, with errors made on known sequence. The system has been applied to four-dye separations using four-channel or full-spectral detection. Results on a large body of data will be shown, along with comparisons to several widely used base callers.

This work is being supported by D. O. E. grant #DE-FG02-90ER60895.


Web-Based Tools for the Analysis and Display of DNA Trace Data

Judith D. Cohn, Mark O. Mundt, A. Christine Munk, Larry L. Deaven and Darrell O. Ricke
Los Alamos National Laboratory, Los Alamos, New Mexico 87545
cohn@lanl.gov

The Human Genome Project worldwide is currently in transition towards a new era of vastly increased production of DNA sequences. Anticipating the requirements of increased sequence production, the Center for Human Genome Studies (CHGS) at Los Alamos began work in 1996 on a new suite of integrated, web-based tools for processing and evaluating sequence data. Major development goals were: 1) to limit the need for human intervention in the analysis process; 2) to design highly modular software, which could run on multiple platforms while accessing a single, centralized (though possibly distributed) database; and 3) to build user-friendly GUI applications. In order to meet these goals, we chose the Java programming language as our primary development environment. Currently, a number of pieces are in every day use and will be described here as well as in other posters/presentations by our Informatics Team.

Working applications for the analysis and display of DNA trace data from automated DNA sequencing instruments include Trace Viewer, Base Caller, Sequence Trimmer and Production Statistics. While these applications have been designed to work with data from a variety of sources, they have been implemented thus far for use with data from ABI sequencing instruments stored in a central flat-file database. We are in the process of moving the data to an Informix object-relational database.

The Trace Viewer displays trace data from a single sample file at multiple resolutions and includes such features as semantic zooming and coordinated scrolling. At present, it is possible to view ABI, phred and our own base calls along with associated quality numbers and output from the Sequence Trimmer.

The CHGS Base Caller was designed to replace our use of the ABI base caller. At the time, we did not have access to the phred base caller. Designing our own base caller, however, opened up the possibility of achieving additional goals: 1) more accurate base calling under a variety of conditions (e.g. dye-primer versus dye-terminator, new dye sets, new polymerases, new sequencing instrumentation, etc); 2) a robust set of quality numbers at each base position, which can be used to fine tune analysis further down the pipeline, e.g. assembly or primer picking. To date these quality numbers include an overall quality assessment (similar to phred), a quality number for each base and a gap statistic.

The Sequence Trimmer performs two automated functions, which together define a "clear" region for each trace sequence. The first function is to designate a "high quality" region. This is determined using the overall quality scores produced by our base caller and can be modified by manipulating one or parameters. The second function is to trim vector sequence. Vector trimming has been optimized to recognize even very poor quality sequence if it appears at the appropriate position in the sequence. The vector trimming function is also used to confirm the status of gel control sequences.


Joint Genome Institute's (JGI) Informatics Plans and Needs

Darrell O. Ricke, Tom Slezak, Sam Pitluck, and Elbert Branscomb
Joint Genome Institute
ricke@telomere.lanl.gov

The DOE funded human genome centers at LANL, LBNL, and LLNL are joining together to form the DOE Joint Genome Institute (JGI). Under the JGI, the informatics teams at LANL, LBNL, and LLNL are working together to focus on meeting JGI needs. The JGI informatics needs are considerable. JGI projects include scaling up both shotgun and transposon-based sequencing, scaling up sequence ready physical map production, new functional genomics projects, systems and data integration, and building software systems for the new JGI Production Sequencing Facility (PSF). To illustrate this complexity, functional genomics projects will generate annotation information for JGI sequenced regions of the human genome in the form of (1) human cDNA sequences, (2) mouse cDNA sequences, (3) sequences for targeted syntenic mouse genomic regions, and (4) gene expression data. To manage data and software distributive across four sites, integrated databases and WWW software projects are being designed and planned. While these integrated systems are being put into place, existing software and systems will continue to be used and in some instances enhanced to meet production scale-up needs. A summary of plans, status of major projects, and informatics needs will be presented.

See URL:http://www.jgi.doe.gov


Restriction Map Display on the World Wide Web

Mark C. Wagner, Thomas R. Slezak, Arthur Kobayashi, David J. Ow, Linda K. Ashworth, Laurie A. Gordon, Anne S. Olsen, Anthony V. Carrano
Joint Genome Institute, Lawrence Livermore National Laboratory, Livermore, CA 94550
wagner5@llnl.gov
http://www.jgi.doe.gov

The amount of data generated by a Genome Center precludes anything but a graphical interface to the database. The graphical user interface used by the Human Genome Project at Lawrence Livermore National Laboratory predated the World Wide Web, being written for an environment consisting solely of Unix workstations, and was sufficient to meet the needs of our local researchers and more sophisticated collaborators.

However, current collaborative agreements require immediate public release of data, and this has necessitated a shift from a specific type of hardware platform to a much wider audience. The advent of Web based technologies (particularly Java) have made it possible to write interfaces to our database for public use, without the necessity of dealing with software upgrades and hardware dependency problems. We have written a WWW version of the Restriction Map display of our Genome Browser software which will enable access to our database from anywhere on the Internet, and from any type of system supporting Web browsers.

This software will serve the clone resources task of the Joint Genome Institute (JGI) in addition to the Genome Center at LLNL. It is our intention that this software will become a prototype for a larger package that will include other types of displays for data generated by the JGI.

This work was performed by Lawrence Livermore National Laboratory under the auspices of the U.S. Department of Energy, Contract No. W-7405-Eng-48.


The BCM Search Launcher -- Providing Enhanced Sequence Analysis Search Services

Kim C. Worley, Pamela A. Culpepper, Daniel B. Davison
Department of Molecular and Human Genetics and Department of Cell Biology, Baylor College of Medicine, Houston, TX
kworley@bcm.tmc.edu

We provide Genome Program investigators access to a variety of enhanced sequence analysis search tools via the BCM Search Launcher. The BCM Search Launcher is an enhanced, integrated, and easy-to-use interface that organizes sequence analysis servers on the WWW by function, and provides a single point of entry for related searches. This organization makes it easier for individual researchers to access a wide variety of sequence analysis tools. The Search Launcher extends the functionality of other WWW services by adding additional hypertext links to provide easy access to Medline abstracts, links to related sequences, and additional information which can be extremely helpful when analyzing database search results.

For frequent users of sequence analysis services, the BCM Search Launcher Batch Client provides access to all of the searches available from the BCM Search Launcher web pages in a convenient drag and drop (on the Macintosh) or command line (Unix) interface. The BCM Search Launcher Batch Client is a Unix and Macintosh application that automatically 1) reads-in sequences from one or more input files, 2) runs a specified search in the background for each sequence, and 3) stores each of the search output files as individual documents directly on a user's system. The HTML formatted result files can be browsed at any later date, or retrieved sequences can be used directly in further sequence analysis. For users who wish to perform a particular search on a number of sequences at a time, the batch client provides complete access to the Search Launcher with the convenience of batch submission and background operation, greatly simplifying and expediting the search process.

One of the tools unique to the Search Launcher is BEAUTY, our Blast Enhanced Alignment Utility. BEAUTY makes it much easier to identify weak, but functionally significant matches in BLAST protein database searches. BEAUTY generates an alignment display showing the relative locations of annotated domains and the local BLAST hits in each matched sequence, greatly facilitating the analysis of BLAST search results. Recent improvements make BEAUTY searches available for DNA queries (BEAUTY-X) and for gapped alignment searches (using WU-BLAST2). In addition, new releases of the Annotated Domains database used with BEAUTY are produced for each full GenBank release. These up-to-date versions of the database present annotation information for many more sequences than previous editions. From the Search Launcher, users can submit sequences to the NCBI's BLAST network server to search the non-redundant, daily-updated database, and have their search results returned with BEAUTY displays added.

Our future development will focus on the analysis of large scale genomic sequences to support the efforts of the Genome Annotation Collaboratory.

This research is supported by a grant from the U.S. Department of Energy Office of Health and Environmental Research (DE-FG03-95ER62097/ A000), and grants from the National Human Genome Research Institute, National Institutes of Health (1F32-HG00133-03, 1R01-HG00973-03).


SubmitData Data Submission Framework

David Demirjian, Sushil Nachnani, Manfred Zorn
Lawrence Berkeley National Laboratory
sknachnani@lbl.gov

SubmitData is a data translation and submission framework. It is being built to provide a common user representation to public genome databases having different internal formats. SubmitData parses in the schema of a particular database and provides the user with a forms-like display to enter his or her data. The user can define a number of different types of variables to incorporate data files generated by another program e.g. Excel. SubmitData will replace the variables with actual fields read in from these data files, when submitting a transaction. Thus the user can define a template consisting of fields having a constant value along with fields with variables defined as their value. This template can then be used to process a number of data files over a period of time. SubmitData also takes care of translating the transaction into a format expected by the specific database.

The framework implemented initially in Smalltalk and subsequently in the Java programming language contains five main categories of classes: data representation, user interface, parser/builder, printer and batch processing. The data representation objects are the unchanging internal common representation of objects and fields in the various databases. The user interface objects serve as the views of the data objects in a forms-like display. The parser/builder classes are responsible for parsing the schemas of public databases and building the definitions for the common data objects. The printer set of classes translate the internal data object representation into a format required by a particular public database for submitting a transaction. The batch process objects allow the user to define different types of variables and incorporate data files for submitting batch transactions.


Graphical Ad hoc Query Interface for Federated Genome Databases

Dong-Guk Shin,1 Lung-Yung Chu,1 Wally Grajewski,1 Joseph Leone,2 Thomas Barnes,2 and Rich Landers2
1 Computer Science & Eng., University of Connecticut, Storrs, CT 06269-3155
2 CyberConnect EZ, LLC, Storrs, CT 06268
shin@cse.uconn.edu

We have been developing the Graphical SQL Query Editor capable of aiding genome scientists in learning and/or examining third-party database schemas in a relatively short time and assisting them in rapidly producing correct SQL queries. Specifically, our goal has been to allow a user to form an SQL query within a 5 - 10 minute time frame despite a lack of familiarity with the schemas of the public federated genome databases. Using the SQL Editor, genome scientists can construct queries targeting not only a single database but also distributed queries targeting multiple databases. Currently the SQL Editor interlinks GDB, GSDB, EGAD and SST at TIGR, MGD at Jackson Laboratory, and CHR 12, the chromosome 12 database at Yale University.

The SQL Editor is a client program written in Java which can be downloaded as an applet via the Internet or an Intranet. Database schema information is imported by clicking on the button representing a given database. For example, clicking on the EGAD button causes the EGAD table list to be imported, on-the-fly, from the remote server at TIGR. Similarly, the list of attributes for each table can also be imported, on-the-fly, with a single mouse click. Users express SQL queries by browsing imported database schema information, choosing the items of interest, and graphically specifying SQL selection clauses, restriction conditions, and join links. These join links are capable of linking tables in different databases, and are the means of constructing distributed queries.

One unique feature of this interface is the SchemaViewer. Through the SchemaViewer, the system can suggest to the user semantically correct ways of joining tables. This feature, which we call join path discovery, can even discover join paths that cross database boundaries (i.e. distributed join paths). Join path discovery is supported by a meta-data repository, which is constructed for each participating genome database in the federation. SchemaViewer's Selection, Zooming and Abstraction features allow the user to browse, and choose among multiple system-discovered join paths. Once a join path is chosen, it can be exported to the main SQL Editor window for refinement into a complete query. The SQL Editor also includes other features designed to enhance the user's query formulation.

Acknowledgements

(1) The author's work was supported in part by the NIH/NHGRI Grant No. HG00772-03.

(2) The author's work was supported in part by the DOE SBIR Phase II Grant No. DE-FG02-95ER81906.


Towards a Comprehensive Conceptual Consensus of the Expressed Human Genome: A Novel Error Analytical Approach to EST Consensus Databases

Robert Miller, John Burke, Alan Christoffels and Winston Hide
South African National Bioinformatics Institute. The University of the Western Cape, Private Bag X17, Cape Town, South Africa
winhide@sanbi.ac.za
http://www.sanbi.ac.za

Human gene data is currently mostly available in fragments in the form of expressed sequence tags (ESTs) and only a relatively minor fraction of all human genes are completely sequenced. The fragmentary but redundant nature of ESTs provides a large but complex and error prone resource for gene discovery.We have developed a novel set of highly portable tools to manufacture and process a database of publicly available assembled consensi of Human ESTs and alignments. The database represents an easily distributable, core information resource upon which a comprehensive knowledgebase can be built. We maximise the coverage of accurate high quality consensus sequences of human genes, perform error compensation and analysis, and provide a measure of error so that we can derive the best possible estimate of the makeup of the human gene set from the data as it becomes available. The system has been designed to derive extended consensus representation of the expressed human genome.

The database differs markedly from indices such as TIGR Gene Index (1), and also databases of clusters of ESTs such as UniGene (2) because it does not discard noisy information. Instead, all possible information is used for clustering using d2-cluster (Burke, Davison, Hide, in prep), a global word based clustering methodology. The "dirty" information is carefully checked for useful constituent subsequences. As a result, extended gene consensi are manufactured containing both high quality and poor quality regions; these are duly annotated. The database has relational access to annotated alignments via the Genome Sequence Database (3). Alignment of the clusters can be on a system with a specific consensus builder using a combination of two error analysis systems: DRAW and CONTIGPROC. Each entry contains all fragments and isoforms of the gene in serial association, separated by spacers. Thus the maximum possible consensus for each gene exists, providing a useful reagent for functional analysis. Comparison of gene sequence, aligned clusters and "alternative splice" frequencies from STACK now allows a more comprehensive understanding of the nature of expressed genes to be performed. We are in the process of discovering what is artifact, and what is genome biology.

References

(1) http://www.tigr.org/tdb/tgi/hgi/

(2) http://www.ncbi.nlm.nih.gov/UniGene/index.html

(3) http://www.ncgr.org/


Query and Display of Comprehensive Maps in GDB

Stanley Letovsky, Robert Cottingham and GDB Staff
Genome Database, Johns Hopkins University, Baltimore MD
letovsky@gdb.org

GDB, the human Genome Database (http://www.gdb.org), stores whole chromosome and regional maps of the human genome from a variety of sources. Mapping methologies representing in the database include linkage, radiation hybrid, content contig, and various forms of cytogenetic mapping. An important class of queries against GDB look for markers in some region of interest, possibly combined with additional restrictions on the types of markers or their properties. It is useful for such positional queries to be able to search all maps, regardless of type, source, or whether or not they happen to contain the markers used to specify the region of interest. We do this by combining the various maps of each chromosome into a single comprehensive map; positional queries are then expressed against the coordinates of the comprehensive map.

The comprehensive maps are generated by a novel integration algorithm that constructs nonlinear warpings of maps in order to bring common markers into correspondence. The comprehensive maps produced by this algorithm are a significant improvement over the linear deformation method used previously for this purpose. The improvement translates into an increased accuracy of positional querying.

The comprehensive maps can also be viewed using our Mapview display program, which is accessible from the Web. A recent software enhancement allows the results of queries on any type of mapped object (gene, clone, amplimer, etc.) to be displayed as dynamically generated maps, as well as in tabular form. These map displays are based on the comprehensive maps. Queries which can be expressed in this fashion include "find genes in region", "find sequenced clones in region", "find polymorphic amplimers in region", and so on. As the content of GDB is extended in various areas, it becomes possible to ask other queries using this same basic capability. For example, we are currently adding the capability to store gene function and expression data; when this is in place it will be possible to ask for a map of genes that are highly expressed in a given tissue, or genes that have a certain function.


GSDB - New Capabilities and Unique Data Sets

A. Farmer, C. Harger, S. Hoisie, P. Hraber, D. Kiphart, L. Krakowski, M. McLeod, J. Schwertfeger, A. Siepel, G. Singh, M. Skupski, D. Stamper, P. Steadman, N. Thayer, R. Thompson, P. Wargo, M. Waugh, J.J. Zhuang, and P.A. Schad

The Genome Sequence DataBase (GSDB), at the National Center for Genome Resources (NCGR), is moving forward to provide the research community, with new data access capabilities and unique data sets. GSDB has improved sequence data access by creating a web-based program (Excerpt) that allows researchers to extract desired portions of sequences from the database; designing and implementing a GSDB flatfile that displays all of the data which can be represented in text format; and improving a web-based query tool (Maestro). In addition, major improvements have been made to the software suite that imports data from the IC databases into GSDB.

GSDB has created unique data sets by constructing alignments to high-profile sequences (such as the complete E. coli genome with over 5000 alignments to other E. coli sequences in the database); and by constructing and maintaining discontiguous sequences that represent sequence-based chromosome maps (such as human chromosome X which is comprised of over 1400 sequences markers and several larger genomic sequences). In addition, there are ongoing projects to improve the quality of data within the database, such as the identification and subsequent removal of vector contamination from the 5' and 3' ends of sequences.

These tools and data are available to the public from the GSDB web site (http://www.ncgr.org/).

GSDB has been gaining momentum over the past six months and will continue to develop new and innovative ways to access data in the database. GSDB will also continue to create unique data sets to provide to the public. Planned enhancements include the development of a web-based sequence viewer to replace Annotator, improvements to Excerpt and Maestro, collaborations with JGI and ORNL, and the incorporation of unique data sets such as SANBI's STACK data.


Database Transformations for Biological Applications*

G. Christian Overton, Susan B. Davidson, Peter Buneman
Dept. of Computer and Information Science, University of Pennsylvania, Philadelphia, PA 19104
Tel: (215) 898-3490; Fax (215) 898-0587
{susan,peter,coverton}@central.cis.upenn.edu

The overall goal of the Kleisli project is to develop a suite of tools to perform data source transformation and integration. The central components under development are the high-level query language and system, CPL/Kleisli, a general schema description language and a transformation constraint language, TSL. (Development of TSL is funded through an NSF grant.) The two main tools--Morphase and CPL--complement one another; Morphase is a heavier-weight system designed to transform an entire database or dataset according to user specifications and constraints written in TSL, a language which allows one to naturally express a broad class of such operations. Conversely, CPL is of most use precisely when there are multiple distributed source databases and transforming each in its entirety to a uniform representation would be infeasible. In such situations, CPL permits complex queries over the distributed heterogeneous data sources, performing integration "on the fly", while simultaneously allowing powerful structural transformations to be done. CPL's extensible query optimizer ensures that this all takes place in a timely fashion.

Together, the tools can be used either to perform data integration by providing dynamic user-defined views, or to create specialized data warehouses. This layer can then be used underneath data mining, OLAP or other decision making tools.

Recent developments include:

  • Coupling CPL to OPM-Based Databases. CPL/Kleisli can now query against GDB 6.0's Object Broker Server (documented at http://www.gdb.org/ob/top.html) using the OPM query language. Having reimplemented our existing Web-based queries to take advantage of GDB 6.0's richer OPM-based schema, we envision as a next step embedding some or all of OPM's query translator in the CPL optimizer in a modular fashion. Since OPM queries are ultimately translated to one or more relational SQL queries by the OPM tools anyway, we can perform this translation at an earlier stage of the query process, producing SQL to which CPL can apply its maximal subquery migration capabilities, pushing to the servers operations which would otherwise have to be performed locally. Furthermore, opening up the query in this manner allows CPL to apply other RDBMS-specific techniques, such as semijoins, essentially leveraging the work already done on relational multidatabase optimization to OPM queries.
  • Complex Object Libraries. As a first step towards migrating CPL onto more ubiquitous and widely-accepted language platforms we have implemented a prototype set of complex object manipulation libraries in each of the C++, Perl, and Java(TM) languages. These libraries allow application programs written in any of the target languages to dispatch queries to a central CPL server (in the first stage of migration the server will still run under ML) and receive query results in an abstract representation appropriate to the specific host language, with an API which is uniform across all of them. The Perl and C++ complex object libraries are currently being used in an automated annotation testbed, GAIA (http://agave.humgen.upenn.edu/gaia/). This approach allows CPL queries to be seamlessly integrated as part of the annotation process.
  • Local Storage Management. The customizable optimizer of CPL employs a multi-pass source-to-source rewriting strategy to attempt to minimize both the response time and intermediate storage consumption of queries. However, to be able to efficiently store and retrieve ever-larger intermediate query results, preferably along with one or more indices, we are currently exploring the use of a more sophisticated local backing store using either the freely-available SHORE system, the relational Sybase database system, or perhaps OPM.
  • User Interfaces. As a first step towards developing a simple and powerful user interface to CPL, we have created a number of Web-based stereotypic queries. To date, we have focused on the implementation and optimization of queries which were chosen to test system performance in formulating and executing distributed queries while providing functionality to a growing user community. Eight major classes of queries are available with new ones being added by request from the user community. These queries, shown below with the databases accessed indicated in parenthesis, demonstrate how Kleisli can be used to integrate data stored in disparate formats and physical locations. More details on the queries can be found at the Kleisli homepage, http://agave. humgen.upenn.edu/cpl/cplhome.html
  • "Complete Genome" query, e.g., "Return all complete mitochondrial genomes larger than 20kb." (GSDB.)
  • EST Location query, e.g., "Find the location of a mapped EST." (dbEST or GenBank or GSDB and GDB.)
  • Gene/Location query; i.e., "Find protein kinase genes on human chromosome 4." (GDB.)
  • Sequence/Size query; i.e., "Find mapped sequences longer than 100,000 base pairs on human chromosome 17." (GDB and GSDB.)
  • Mapped EST query; i.e., "Find ESTs mapped to chromosome 4 between q21.1 and q21.2." (GDB and GSDB.)
  • Primate alu query; i.e., "Find primate genomic sequences with alu elements located inside a gene domain." (BLAST and GSDB.)
  • BLAST Sequence/Feature query; i.e., "Find sequence entries with homologs of my sequence inside an mRNA region." (BLAST and GSDB.)
  • Human genome map search; i.e., "Find human sequence entries on human chromosome 22 overlapping q12." (GDB, GSDB and ASN.1 GenBank.)
*Supported by a grant from the Department of Energy, 94ER61923.

Recent Kleisli References

-------------------------

"Querying an Object-Oriented Database Using CPL," S.B. Davidson, C. Hara and L. Popa. Proceedings of the Brazilian Symposium on Databases (October 1997).

"WOL: A Language for Database Transformations and Constraints," S.B. Davidson and A. Kosky. Proceedings of the International Conference of Data Engineering, April 1997 (Glasgow, Scotland).

"BioKleisli: A Digital Library for Biomedical Researchers," S.B. Davidson, C. Overton, V. Tannen and L. Wong. Journal of Digital Libraries 1:1 (November 1996).

"A Data Transformation System for Biological Data Sources," P. Buneman, S.B. Davidson, K. Hart, C. Overton and L. Wong. Proceedings of VLDB, Sept. 1995 (Zurich, Switzerland).

"Challenges in Integrating Biological Data Sources," S.B. Davidson, C. Overton and P. Buneman. J. Computational Biology 2 (1995), pp 557-572.

"Semantics of Database Transformations," A. Kosky, S.B. Davidson and P. Buneman. Semantics of Databases, edited by L. Libkin and B. Thalheim.

"Transforming Databases with Recursive Data Structures," A. Kosky. PhD Thesis, December 1995.


Exploring Heterogeneous Biological Databases with the OPM Multidatabase Tools

Victor M. Markowitz, I-Min A. Chen, Anthony Kosky, Ernest Szeto
Lawrence Berkeley National Laboratory, Berkeley, CA 94720

The Object-Protocol Model (OPM) data management tools provide support for rapid development, documentation, and flexible exploration of scientific databases. Several archival molecular biology databases have been designed and implemented using the OPM tools, including the Genome Database (GDB) and the Resource Center Primary Database (RZPD) of the German Human Genome Project, while other databases, such as the Genome Sequence Database (GSDB) and NCBI's Genbank have been retrofitted with semantically enhanced views using the OPM tools.

The multidatabase OPM tools provide powerful facilities for: (1) assembling (federating) heterogeneous databases into a multidatabase system, while documenting their structure and inter-database links; (2) processing ad-hoc multidatabase queries via uniform OPM interfaces; and (3) assisting scientists in specifying and interpreting multidatabase queries. Incorporating a database into a multidatabase system involves constructing one or more OPM views of the database and entering information about the database and its views into an Multidatabase Directory. This Directory records information necessary for accessing and formulating queries over the component databases, including: general information required for accessing the database; structural information on the schemas of each database; and information on known links between databases, including semantic descriptions of the links, and data manipulations necessary in order to traverse such links.

Queries over an OPM based multidatabase system are expressed in an extension of the OPM query language, that includes additional constructs necessary for accessing multiple databases. Multidatabase queries are processed by generating queries over individual databases, and combining the results using a local query processor. A Java graphical multidatabase query construction tool provides support for dynamic construction and automatic generation of Web query forms that can be then either used for further specifying conditions, or can be saved and customized. The OPM multidatabase tools have been applied to several projects, including the construction of a federation of molecular biology databases involving GDB, GSDB, and Genbank, and are currently used for setting up a federation of biocollections databases and a federation of several databases involved in the German Human Genome Project.

The main problem we are currently addressing consists of identifying, documenting, and formally defining a comprehensive set of relevant inter-database links that drive multidatabase queries and applications.

This work is supported by a grant from the Director, Office of Energy Research, Office of Health and Environmental Research, of the U.S. Department of Energy under Contract DE-AC03-76SF00098.


The OPM Data Management Toolkits Anno 1997

Victor M. Markowitz, I-Min A. Chen, Anthony Kosky, Ernest Szeto, William Barber, Thodoros Topaloglou
Lawrence Berkeley National Laboratory, Berkeley, CA 94720
VMMarkowitz@lbl.gov

The Object-Protocol Model (OPM) data management tools provide support for rapid development, documentation, and exploration of scientific databases. These tools are based on OPM, an object data model that is similar to the ODMG standard, but also has additional constructs for modeling scientific data. Databases designed in OPM can be implemented with commercial relational DBMSs, using the OPM Database Development Toolkit that includes OPM schema translators for generating complete DBMS database definitions from OPM schemas, a Java based OPM schema editor and browser, tools for specifying and maintaining multiple OPM views, and OPM schema publishing tools for documenting OPM databases in a variety of formats and notations. OPM schemas can be also retrofitted on top of existing relational databases or structured files defined using notations such as the ASN.1 data exchange format. Native or retrofitted OPM databases can be queried using the OPM Database Query Toolkit that includes OPM query language translators that interpret queries expressed in the ODMG compliant OPM query language (OPM-QL) and translate them into the languages supported by the underlying DBMS. A Web-based OPM query interface allows graphical construction of ad-hoc OPM queries and can be used for generating Web query forms.

An OPM Multidatabase Toolkit contains tools that support: (1) assembling heterogeneous databases into an OPM based multidatabase system, while documenting their schemas and inter-database links; (2) processing ad-hoc multidatabase queries via uniform OPM interfaces; and (3) assisting scientists in specifying and interpreting multidatabase queries.

Several archival molecular biology databases have been designed and implemented using the OPM tools, including the Genome Database (GDB) and the Resource Center Primary Database (RZPD) of the German Human Genome Project, while other biological databases, such as the Genome Sequence Database (GSDB) and NCBI's Genbank, have been retrofitted with semantically enhanced OPM views. The OPM multidatabase tools have been applied for constructing a molecular biology database federation and for providing a common interface on top of a variety of different biological databases.

Current OPM work includes further development of the OPM multidatabase tools and extending the OPM toolkit to support complex data types, such as DNA sequences and 3-dimensional crystallographic data, irrespective of the underlying DBMS facilities.

Examples, documentation, and papers regarding the OPM toolkits are available at http://gizmo.lbl.gov/opm.html.

This work is supported by a grant from the Director, Office of Energy Research, Office of Health and Environmental Research, of the U.S. Department of Energy under Contract DE-AC03-76SF00098.


Quality Control in Sequence Assembly Analysis

Mark O. Mundt, Judith D. Cohn, Tracy L. Ricke, P. Scott White, Larry L. Deaven and Darrell O. Ricke
Los Alamos National Laboratory, Life Sciences Division and Center for Human Genome Studies, Los Alamos, New Mexico 87545
mom@telomere.lanl.gov

As participants in the Human Genome Project produce more sequence data at higher rates, there is less chance that any given base pair will be manually edited by interactive means. Moreover, assembly decisions made by automatic programs must be scrutinized in a more efficient manner. To further complicate matters, natural variation among individual human sequences is at least five to ten fold higher than the desired quality standard for finished sequence. On average, we expect to see approximately one polymorphism in every 1000 base pairs, while the community's sequencing error standard is now projected at one in every 10,000 base pairs. Assessment of quality and variation must therefore be efficiently automated.

At the Center for Human Genome Studies in Los Alamos, we are using our own base caller and quality values to derive input for the TIGR assembly program. Qualitative comparisons are then made between individual cosmid clone assemblies and ones using overlapping clones. Phred and phrap results are also contrasted by means of algorithms operating on a novel, standard assembly format. Differences in assemblies are now automatically detected, and our future objectives would include using quantitative measures and adding repeat profiles to make decisions favoring the results which best meet the goals of the project. These tools along with graphical analysis displays are being integrated into a web-based Java system.

In addition to quality assessment, we are concurrently searching for single nucleotide polymorphisms (SNPs). Because of the higher mutation rate relative to other sequences (17 to 25 fold higher) we are presently targeting CpG islands in our 7q telomeric sequence for resequencing. For this automated process, we have developed software that targets regions of high densities of CpG dinucleotides, and we use Whitehead Institute's Primer3 software to design PCR and sequencing primers. These are used to amplify and sequence genomic DNA templates from several diverse individuals in search of SNPs. Low quality regions of sequence are targeted in a similar manner, and the multiple assembly information is used to target weak joins.

In conclusion, an automated system for designing PCR and resequencing primers, targeting specific regions to assess sequence quality and natural variation, is being implemented and optimized.


Automating the Detection of Human DNA Variations

Scott L. Taylor, Mark J. Rieder, and Deborah A. Nickerson
Department of Molecular Biotechnology, University of Washington, Seattle, WA 98195
debnick@u.washington.edu

Fluorescence-based sequencing is playing an increasingly important role in efforts to identify DNA polymorphisms/mutations in genes of biological and medical interest. We have developed a computer program known as PolyPhred that automatically detects the presence of single nucleotide substitutions in their heterozygous state using fluorescence-based sequencing of PCR products. The operation of PolyPhred will be described as well as its integration with the Phred base-calling program, the Phrap assembly program, and the Consed viewing program. Additionally, we will illustrate how Consed can be leveraged to display a set of highly annotated reference sequences that greatly simplifies the analysis of DNA variations with respect to existing information on gene structure, PCR primers, and previously known DNA polymorphisms or mutations. Lastly, we will document the ease and speed of performing high quality and accurate fluorescence-based resequencing on long-tracks of mitochondrial and nuclear DNA as well as the application these new tools to automatically find and view DNA variations within these sequences.


Fast and More Accurate Distance-Based Phylogenetic Construction

William J. Bruno, Aaron L. Halpern, Nicholas D. Socci
Los Alamos National Laboratory
billb@lanl.gov; http://www.t10.lanl.gov/billb/

Phylogenetic reconstruction is relevant to genomic analysis whenever profile methods are used to assess gene function, because accurate profile construction depends on historical relationships [Bruno, 1996]. The Neighbor-Joining algorithm of Saitou and Nei [1987] has the advantage of being fast enough for constructing a profile from hundreds of sequences, but it does not do the best job of constructing the correct tree, and in fact we show it to be biased.

Gascuel recently [1997] introduced the BIONJ method, which improves the Neighbor-Joining method by using a weighted average to compute the new distances. BIONJ is still biased, and performs exactly the same as Neighbor-Joining in the case of 4 taxa.

We introduce a new, weighted neighbor-joining method called Weighbor. This method uses weights that accurately reflect the exponential dependence of variances and covariances on distance. The weights are used both in determining which pair is joined and in computing new distances. As a result, the method is not biased, and gives better results than Neighbor-Joining, even in the case of 4 taxa.

The current implementation of Weighbor has a computational complexity of N^4, but an N^3 version, comparable in speed to Neighbor-Joining, is underway.

Bruno, W. J., "Modeling Residue Usage in Aligned Protein Sequences via Maximum Likelihood," Mol. Biol. Evol. 13:1368-75 (1996).

Gascuel, O., "BIONJ: an Improved Version of the NJ Algorithm Based on a Simple Model of Sequence Data," Mol. Biol. Evol. 14:685-95 (1997).

Saitou, N. and M. Nei, "The Neighbor-Joining Method: A New Method for Reconstructing Phylogenetic Trees," Mol. Biol. Evol. 4:406-25 (1987).


Determining the Important Physical-Chemical Parameters Within Various Local Environments of Proteins

Jeffrey M. Koshi and William J. Bruno
Los Alamos National Laboratory, Theoretical Biology and Biophysics
jkoshi@lanl.gov

In this work the importance of various physical-chemical parameters of the amino acids are analyzed in a position specific manner for various protein secondary structure and surface accessibility classes. This is done on the basis of a previously published maximum likelihood method that finds the optimal site-specific residue frequencies for a set of aligned homologs. These vectors representing the site-specific residue frequencies can be broken up into subsets based on position within various secondary structures and surface accessibility. An analysis of the eigenvectors (PCA analysis) of the resulting covariation matrix shows the most important physical-chemical parameters for a given subset of data. Preliminary results indicate the importance of hydrophobicity, agreeing with the conclusions of many other researchers, and we are investigating the importance of size, charge, and aromatic rings at specific positions within various secondary structures.

Work is also in progress to expand the model of evolution used in the above work. Rather than looking only at amino acid frequencies, we are implementing a method that uses a limited, amino acid mutation matrix optimized for each position in a set of aligned homologs. This more realistic model of evolution is still computationally feasible and should improve the performance of the model in phylogenetic tree reconstruction, homolog detection, analysis of the importance of physical-chemical parameters, or any other application.


Improving Software Usability and Accessibility

Ryan Carroll and Ruth Ann Manning
ApoCom Inc., 1020 Commerce Park Drive, Suite F, Oak Ridge TN 37830-8026
carroll@apocom.com

Over recent years the number of software tools being produced to assist in research related to the Human Genome Project has increased rapidly. However, most of these tools are currently being under-utilized because either they are not available on a wide variety of computer platforms or they have an interface that cannot be learned without a significant time investment. An additional shortcoming that is inherent in most of the available sequence analysis tools is the use of opaque pattern analysis systems such as statistical, syntactic, Markovian, or artificial neural network methods. Although such systems can be very accurate, the reasoning methods they use cannot be described intuitively. This means that the user has very little information (typically only a single numerical score) to assist in appraising the value of system predictions. ApoCom is addressing these problems by implementing comprehensive Java and CORBA interfaces through which it will make accessible a large assortment of computational and database related tools. ApoCom is also developing a fuzzy logic-based gene hunting algorithm.


bioWidgets: Visualization Componentry for Genomics

Steve Fischer, Jonathan Crabtree, Mark Gibson and G. Christian Overton (PI)
Department of Genetics, University of Pennsylvania; Philadelphia, PA 19104
{sfischer,crabtree,gibson,coverton}@cbil.humgen.upenn.edu

bioWidgets is a software package for the rapid development and deployment of graphical user interfaces (GUIs) designed for the scientific visualization of molecular, cellular and genomics information. The overarching philosophy behind bioWidgets is componentry: that is, the creation of adaptable, reusable software, deployed in modules that are easily incorporated in a variety of applications, and in such a way as to promote interaction between those applications. This is in sharp distinction to the common practice of developing dedicated applications. The bioWidgets project additionally focuses on the development of specific applications based on bioWidget componentry, including chromosomes, maps, and nucleic acid and peptide sequences.

The current set of bioWidgets has been implemented in Java with the goal in mind of delivering local applications and distributed applets via Intranet/Internet environments as required. The immediate focus is on developing interfaces for information stored in distributed heterogeneous, databases such as GDB, GSDB, Entry, and ACeDB. The issues we are addressing are database access, reflecting database schemas in bioWidgets, and performance. We are also directing our efforts into creating a consortium of bioWidget developers and end-users. This organization will create standards for and encourage the development of bioWidget components. Primary participants in the consortium include Gerry Rubin (UC Berkeley), Nat Goodman (Jackson Labs), Stan Letovsky (GDB) and Tom Flores(EBI).

Current progress includes the development of an Inter-widget Communication package. This package consists of a set of Java(tm) classes and interfaces, and is based on the Java(tm) JDK 1.1 Delegation Event Model. It defines a set of inter-widget events that control: (1) mutual selection of elements in multiple widgets and (2) coordinated scrolling and zooming of multiple widgets. We have used this package in our implementation which integrates our genome viewer and our sequence viewer.

We have also developed an object-oriented data specification for the Sequence and Map widgets. This specification will more generally apply to widgets that display sequence and sequence annotation. With the increasing use of object-oriented databases, object brokers and standardized remote object formats (CORBA and JAVA's RMI), application developers using our bioWidgets will have the capability to serve data objects directly to the widgets. Our object-oriented specification is written using Java(tm) interfaces, and will likely migrate to CORBA IDL. The interfaces we define are as generic as possible, with the goal of allowing diverse applications to use them.

The Inter-widget communication model, component-based design and object oriented data format are intended in the near future to be used with the Java Beans(tm) technology. This will allow users of bioWidgets to incorporate the widgets into an application simply by interacting with an application builder tool rather than writing actual integration code.

The current implementation includes widgets which display Sequence, Map, Blast results, Chromosomes and Sequence alignments.

For more details, please see http://agave.humgen.upenn.edu/bioWidgetsJava/.

*Supported by a grant from the Department of Energy, 92ER61371.


Research on Data and Workflow Management for Laboratory Informatics

Nathan Goodman
The Jackson Laboratory
nat@jax.org

We have been pursuing a strategy for laboratory informatics based on three main ideas: (1) component-based systems;(2) workflow management; and (3) domain-specific data We have been pursuing a strategy for laboratory informatics based on three main ideas: (1) component-based systems; (2) workflow management; and (3) domain-specific data management. The workflow and data management software we have developed pursuant to this strategy are called LabFlow and LabBase respectively. LabFlow provides an object-oriented framework for describing workflows, an engine for executing these, and a variety of tools for monitoring and controlling the executions. LabBase is implemented as middleware running on top of commercial relational database management systems (presently Sybase and ORACLE). It provides a data definition language for succinctly defining laboratory databases, and operations for conveniently storing and retrieving data in such databases.

Both LabFlow and LabBase are implemented in Perl5 and are designed to be used conveniently by Perl programs. The total quantity of code is modest, comprising about 10,000 lines of Perl5. The software is freely available and redistributable, but be forewarned that this is research software and is incomplete in many ways.


Method of Differentiation Between Similar Protein Folds

I. Dubchak, I. Muchnik, S. Spengler, M. Zorn
Lawrence Berkeley National Laboratory, Berkeley, CA 94720
ildubchak@lbl.gov

Predicting the protein fold and implied function for a target sequence whose structure is unknown is the problem of significant interest. The information derived from such a prediction is substantial, guaranteed by the similarity between the three-dimensional (3D) structures and the functions of class members. Prediction is especially complex when a distinction of one particular fold from other highly similar three-dimensional folds is needed.

For the development of the prediction technique we used a particular example of a separation of 4a-helical cytokines (fold of interest, FI) from similar folds. We applied the method based on global descriptors of a protein in terms of the biochemical and structural properties of the constituent amino acids [1]. Neural networks were used to combine these descriptors in a specific way to discriminate members of the FI. The following steps were necessary:

1. Defining the neighborhood of the FI, i.e. spatially close protein folds which are hard to distinguish from the FI.

2. Collecting the non-redundant set of proteins to represent a wide range of available members of protein folds in the context of the comprehensive Structural Classification of Proteins (SCOP).

3. Selecting several attributes which represent various groups of physico-chemical and structural properties.

4. Analyzing the prediction accuracy achieved by each parameter set in order to chose the most efficient sets for separation of the FI from a particular neighbor.

Voting among predictions made by different sets of parameters to achieve higher reliability of prediction.

The developed procedure is simple and efficient. Further improvement of the prediction method greatly depends on the possibility of automation of the steps 1-3 and growth of existing databases.

1. I. Dubchak, I. Muchnik, S. R. Holbrook (1995). PNAS 92, 8700-8704.

2. Murzin, A. G., S. E. Brenner, T. Hubbard and C. Chothia. (1995). J. Molec. Biol. , 247: 536-540.


Embedding HMMs: A Method for Recognizing Protein Homologs in DNA

David Kulp, David Haussler
Baskin Center for Computer Engineering and Information Sciences, University of California, Santa Cruz, CA 95064
kulp@cse.ucsc.edu or haussler@cse.ucsc.edu

Gene-finding involves the development of algorithms for annotating anonymous DNA using statistical and database information. Our research is concerned with the application of hidden Markov models (HMMs) in gene-finding, particularly the use of HMMs for effectively modelling gene structure and DNA-to-protein homology. We present a gene-finding method in which a linear hidden Markov model (HMM) of a protein family is embedded in a generalized hidden Markov model (GHMM) of gene structure.

The trend in gene-finding strategies recently has been the inclusion of homology models as additional evidence in the identification of coding regions. The method we describe uses a linear HMM representing a protein family, but the emitting states of the model correspond to individual nucleic acids. In this way, mutations, insertions, and deletions between the coding DNA and the protein family can be modelled at both the nucleotide and amino acid level. In addition, by embedding a linear HMM within the GHMM we can identify the alignment of the DNA to the protein family across splice junctions. The resulting gene-finder is both robust to noise and sensitive to remote homologs.

The approach extends the successful development of the Genie system for gene identification (Reese). The Genie system uses a generalized hidden Markov model (GHMM) to represent the high-level structure of genes, i.e., individual states represent variable length regions, such as exons and introns. The transitions among the states corresponds to gene structure. For example, probabilities are given for the likelihood of transitioning from an intron to a final exon or to an internal exon. For each state we establish a probabilistic state model, and these state models may be arbitrarily complex; for example, a state model of an intron may combine discriminant models for the flanking splice regions, a model for intronic repeats, and a model for non-coding DNA. Until now, these different state models were independent, but we describe a novel means of modelling dependencies between states.

For this work, we wish to represent the dependency of the coding nucleotides flanking an intron and to use this dependency to better predict the coding sequence. In essence, when we reach an intron in the current gene being analyzed, we want to note what position we might be in a protein that is homologous to the one that the current gene codes for, and then try to find the end of the intron by looking for the beginning of an exon that codes for subsequent amino acids similar to those in the homolog. One effective method for doing this is the "spliced alignmen" of Gelfand, Mironov, and Pevzner (Gelfand, et al). Another is the "dynamite" system developed by Ewen Birney (Birney).

In our previous work, we used a simple scheme for this called "adjacency constraints" (Kulp et al). Here we propose using an HMM for the particular protein family that conatins the gene being analyzed to carry information across an intron. Our approach is similar in spirit to Gelfand, (et al), but we use an HMM for an entire protein family rather than a single homolog, our alignment strategy differs, and we combine homolog identification with additional statistical information provided by the gene struture GHMM. The approach is similar to Birney's approach in the use of the protein HMM, but we combine the protein HMM with a full gene structure GHMM, rather than a small, single-letter-per-state HMM of the classical kind.

The linear HMM used to accomodate DNA-protein alignments has several interesting features. First, the traditional protein sequence alignment HMM (Krogh, 1994) is translated into a DNA model by replacing each insertion and match state with three match/insert/delete state triples. Each of the match/insert/delete triples represents a nucleotide in a codon. Second, the distributions over the four-letter nucleotide database uses simple Markov chains to estimate the probabilities of each base. Thus, the second position in a codon uses a first-order Markov chain and the third position uses a second-order Markov chain, similar to the work of Krogh (Krogh, 1997a). Third, the complete state of the dynamic program at any position can be saved and recalled later, essentially allowing for arbitrary insertions of non-coding nucleotides without disrupting the alignment.

E. Birney. Pairwise and searchwise: finding the optimal alignment in a simultaneous comparison of a protein profile against all dna translation frames. Nucl. Acids Res., 24(14):2730--2739, 1996.

M. S. Gelfand, A. A. Mironov, and P. A. Pevzner. Gene recognition via spliced sequence alignment. PNAS, 93(17):9061--9066, 1996.

A. Krogh. Two methods for improving performance of a {HMM} and their application for gene finding. In T. Gaasterland, P. Karp, K. Karplus, C. Ouzounis, C. Sander, and A. Valencia, editors, Proc. of Fifth Int. Conf. on Intelligent Systems for Molecular Biology, pages 179--186, Menlo Park, CA, 1997. AAAI Press.

A. Krogh, M. Brown, I. S. Mian, K. Sjolander, and D. Haussler. Hidden {Markov} models in computational biology: Applications to protein modeling. JMB, 235:1501--1531, February 1994.

D. Kulp, D. Haussler, M. Reese, and F. Eeckman. A generalized hidden Markov model for the recognition of human genes in DNA. In ISMB-96, St. Louis, June 1996. AAAI Press.

M. G. Reese, F. H. Eeckman, D. Kulp, and D. Haussler. Improved splice site detection in genie. In M. Waterman, editor, Proceedings of the First Annual International Conference on Computational Molecular Biology (RECOMB), Santa Fe, New Mexico, 1997. ACM Press, New York.


Data Visualization for Distributed Bioinformatics

Gregg Helt, Suzannah Lewis, Nomi Harris, and Gerald M. Rubin
Berkeley Drosophila Genome Project

A significant challenge for genome centers is to make the data being generated available to biologists in a way they can use. We are addressing this problem by creating reusable tools for distributed data visualization over the Internet.

Using JavaTM we have developed a Drosophila genome browser that incorporates a three-tiered graphical view of genomic maps: a physical map, a sequence map, and a DNA display. Annotated biological features are displayed on the physical and sequence maps, and the different views are interconnected. The applet is linked to several databases to retrieve features and to display hyperlinked textual data on selected features. Different types of analysis can be performed and the results displayed on the maps, and the code to do so is dynamically loaded when needed. Our genome browser is built on top of extensible, reusable graphical components specifically designed for bioinformatics. Other groups can reuse this work in various ways: genome centers can reuse large parts of the genome browser with minor modifications, bioinformatics groups working on sequence analysis can reuse components to build front ends for analysis programs, and biology labs can reuse components to publish results as dynamic Web documents. We are participating in the bioWidget consortium to arrive at standard APIs for these kinds of graphical bioinformatics components.

Over the past year we have refined our initial design and reimplemented several of the underlying widgets to make them significantly more powerful, stable, and easy to reuse. The genome browser is also being extended to create a genomic annotation tool. While this tool is being used first to aid in the human curation of sequences at the Drosophila genome center, we intend it to evolve into a distributed annotation tool the entire community of Drosophila biologists can use.


Gene Hunting Without Sequencing Genomic Clones: The "Twenty Questions" Game with Genes

Guorong Xu, Sing-Hoi Sze, Cheng-Pin Liu, Pavel A. Pevzner and Norman Arnheim
Molecular Biology Program, Department of Computer Science Department of Mathematics, University of Southern California, Los Angeles, CA 90089-1340

We propose a new experimental and computational protocol, ExonPCR, which is able to identify exon-intron boundaries in a cDNA even in the absence of any genomic clones. ExonPCR can bypass the isolation, characterization and DNA sequencing of subclones to determine exon-intron boundaries: a major effort in the process of positional cloning. Given a cDNA sequence, ExonPCR uses a series of "adaptive" steps to analyze the PCR products from cDNA and genomic DNA thereby revealing the approximate positions of "hidden" exon boundaries in the cDNA. The nucleotide sequence of adjacent intronic regions is determined by ligation-mediated PCR. Primers adjacent to the "hidden" exon boundaries are used to amplify genomic DNA followed by limited DNA sequencing of the PCR product. The method was successfully tested on the 3 kb hMSH2 cDNA with 16 known exons and the 9 kb PRDII-BF1 cDNA with an unknown number of exons. We subsequently developed the ExonPCR algorithm and software to direct the experimental protocol using a strategy which is analogous to that used in the game "Twenty Questions". Using ExonPCR, the search for disease-causing mutations can be initiated almost immediately after cDNA clones in a genetically mapped region become available. This approach would be most valuable in gene discovery strategies that focus initially on cDNA isolation.


The Metabolic Pathway Database and Metabolic Reconstruction from Sequenced Genomes

Evgeni Selkov,1, 2 Milyausha Galimova,2 Igor Goryanin,2 Yuri Grechkin,2 Natalia Ivanova,2 Yuri Komarov,2 Niels Larsen,3 Natalia Maltsev,2 Natalia Mikhailova,2 Valery Nenashev,2 Ross Overbeek,1 Lyudmila Pronevich,2 Gordon Pusch,1 and Evgeni Selkov, Jr.,2
1 Mathematics and Computer Science Division, Argonne National Laboratory, Argonne, IL
2 Institute of Theoretical and Experimental Biophysics, Russian Academy of Sciences, 142292 Pushchino, Moscow region, Russia
3 Center of Microbial Ecology, Michigan State University

For the past three years we have actively worked on the development of functional models for organisms based on available sequence data, biochemical data, and known phenotypic data. We call this process metabolic reconstruction. Each reconstruction involves at least three distinct stages of analysis:

1. An initial estimate of function assignments for coding regions is generated from similarity data with the aid of FastA, Blast, TMpred, and ProSite. Further insights are often achieved by analyzing clusters of proteins from distinct organisms (an attempt to characterize functional correspondences between genomes) and by examining physical position of genes on the chromosome.
2. The output of the first stage will inevitably include many equivocal and often even incorrect assignments. These can frequently be resolved once a detailed estimate of metabolism has been determined. The basic process involves fitting a model of metabolism to the available sequence data, which removes many ambiguities, reveals wrong assignments, and leads to conjectures that must ultimately be resolved experimentally. The key resource required to support this activity is an encoding of known pathways. While a comprehensive collection does not yet exist, the Metabolic Pathway Database, MPW (http://beauty.isdn.mcs.anl.gov/MPW), does now contain over 2800 distinct pathways and their variants known to exist in different species.
3. The third stage of the analysis involves the prediction of missing functions (functions that are implied by the metabolic model, but for which no sequences have been identified) and the formulation of experiments that can be used to validate the model and remove remaining ambiguities.

We have developed this approach in the process of analyzing a number of sequenced (and partially sequenced) genomes. These have included a number of bacterial genomes, two archaeal genomes, and two eukaryotic genomes. In some sense none of these is complete because we continually find reasons to extend and refine these models. On the other hand, a wealth of insight is emerging as our initial efforts produce extensive and detailed models for the metabolism of these organisms. The understanding gained by developing these models sets the stage for a more thorough analysis of specific subsystems by mathematical analysis and modeling.