U.S. Department of Energy

Human Genome 1993 Program Report: Informatics

Date Published: March 1994



Human Genome Management Information System
Oak Ridge National Laboratory
1060 Commerce Park, MS 6480
Oak Ridge, TN 37830
615-576-6669, Fax: 615-574-9888
Internet: bkq@ornl.gov


Informatics: Part 1

Projects New in FY 1993

*Method for Direct Sequencing of Diploid Genomeson Oligonucleotide Arrays: Theoretical Analysis and ComputerModeling


Alexander B. Chetverin, A. R. Rubinov,M. S. Gelfand, S. A. Spirin,(1) M. E. Ivanov,(2) R. F. Nakipov, and O. I. Razgulyaev
Viral RNA Biochemistry Group; Institute of Protein Research and(1)Institute of Mathematical Problems in Biology;Russian Academy of Sciences; 142292 Pushchino, Moscow Region,Russia
(2)A. N. Belozersky Institute ofPhysical and Chemical Biology; Moscow State University; Moscow,Russia
Internet: chetveri@cgi.nks.sn

Projects to sequence several large genomes, particularly the humangenome, make the automation of nucleic acid sequencing one ofthe most important problems of biochemistry and molecular biology.Recently several groups in Great Britain, Yugoslavia, Russia, andthe United States suggested a new, easily automatable method forDNA sequencing by hybridization (SBH) with oligonucleotides,compiling the complete list of oligonucleotides of a fixed length(k-tuples) that occur in a DNA fragment (fragment vocabulary) andsubsequently reconstructing the fragment sequence by linkingmaximally overlapping k-tuples from the vocabulary.

In its current form SBH does not accomplish complete automationof genome sequencing, since the need still exists for random cloningof millions of fragments. To overcome this problem, A. B. Chetverin and F. R. Kramer recently suggesteda method for direct sequencing of large (including diploid) genomeson oligonucleotide arrays.

Unlike existing strategies for total genome sequencing, thesuggested approach would not require preliminary fragment cloningand chromosome mapping and could be completely automated.Substantial reductions in time and cost would result; preliminaryestimates show that the cost of human genome sequencing could bereduced 10- to 100-fold. Furthermore, total sequencing of anindividual diploid genome can be considered with this method, asopposed to sequencing a haploid set of fragments arbitrarilycompiled from different individual genomes. One of the stages (i.e.,fragment sorting and pool reconstruction, see below) can be usedfor direct sequencing of the total pool of cellular RNA, withobvious advantages for genetics, developmental studies, andmedicine.

In this project we plan to develop, implement, and test computeralgorithms to solve the problem of reconstructing fragmentsequences in a pool. This is important for the following reasons.First, computer processing of biochemical data is an integral andessential part of the method. Second, practical implementation ofthe method should be preceded by intensive computer modelingand theoretical analysis to determine such optimal biochemicalparameters as oligonucleotide size, mean and maximal values offragment length and pool size measured as their combined length, and signal-to-noise (s/n) ratio in hybridization procedures. Such apreliminary analysis would substantially increase reliability anddecrease the cost of collecting biochemical data. Finally, somearising mathematical problems are novel, and they haveindependent value for discrete mathematics and computerscience.

Accurate Restoration of DNA Sequences


Gary A. Churchill
Biometrics Unit; Cornell University; Ithaca, NY 14853-7801
607/255-5488, Fax: -4698, Internet:gary@amanita.biom.cornell.edu

This project will develop statistical methods and algorithms todetect potential errors in DNA sequence data and computesummary measures of sequence quality. Methods will be based on astochastic model of sequencing errors that will apply to anytechnology that generates overlapping linear sequence fragments,including gel and capillary electrophoresis. New high-speedsequencing technologies are expected to be developed that mayproduce errors in raw fragment sequences at a much higher ratethen existing methods. The problem of assembling overlappingsequence fragments will also be addressed.

Our long-term goal is to develop an efficient and fully automatedquality-control procedure that should be an integral part of anylarge-scale sequencing project. This automated procedure shouldeventually replace much expensive human effort expended inrechecking raw sequence data. Model-based statistical methods canhelp improve the efficiency of existing sequencing technology (forexample, by allowing for longer gel runs) and will be an essentialcomponent of any system that can produce high-quality finishedsequence at the target rate of 1 Mb/d.

Computer-Aided Genome Map Assembly with SIGMA(System for Integrated Genome Map Assembly)


Michael J. Cinkosky, Michael A. Bridgers, William M.Barber, Mohamad Ijadi, and James W. Fickett
Theoretical Biology and Biophysics Group; Los Alamos NationalLaboratory; Los Alamos, NM 87545
505/665-0840, Fax: -3493, Internet:michael@t10.lanl.gov

SIGMA (System for Integrated Genome Map Assembly), a recentlyreleased graphical genome map editor, supports the following:

  • graphical, mouse-based genome map editing;

  • integration of data from many different types of physical andlinkage experiments and at all appropriate resolution levels, frombanded ideograms to restriction fragments;

  • creation of multiple "views" on a single map;

  • creation by users of new classes of map objects on demand;and

  • workgroup map building through the use of client-server databasemanagement system technology.

In addition, SIGMA enables users to store map-based data as partof the map itself. This feature:

  • keeps underlying data as part of the map, allowing users to knowthe real support for any given map;

  • automatically evaluates the map against underlying data, pointingout places where the two disagree; and

  • creates a platform for automatic map assembly algorithms beingdeveloped by many groups worldwide.

The software, documentation, and a number of sample maps inSIGMA format, including current Genome Data Base maps, areavailable by anonymous FTP from atlas.lanl.gov.Additional information may be obtained by sending a messagecontaining only the word sigma-info tobioserve@t10.lanl.gov.

Informatics for the Sequencing by HybridizationProject


Aleksandar Milosavljevic and RadomirCrkvenjakov
Center for Mechanistic Biology and Biotechnology; ArgonneNational Laboratory; Argonne, IL 60439-4833
708/252-3161, Fax: -3387, Internet:crk@everest.anl.gov

Methods for the design and analysis of massive hybridizationexperiments have been developed on a solid theoretical basis.Algorithmic information theory and minimal length encoding arebeing used to design methods for comparing partial sequence dataand databases of sequenced DNA. Preliminary experiments led tothe first identification of 55 gamma-actin cDNA clones basedon their hybridizations to 110 heptamer probes. In addition,clustering methods are being developed and applied to discovergroups of similar clones in cDNA and genomic libraries throughcomparison of their hybridization signatures. Clone clusteringrevealed transcriptional structure in human brain cDNA libraries. Apreliminary experiment demonstrated the potential of the clusteringmethod to reduce by 10 times the sequencing effort needed tocover a 12-kb segment of human genomic DNA. Mutualinformation and other concepts from information theory are beingapplied to interpret measurements optimally and approach design ofmassive hybridization experiments.

A relational database that contains a complete record of massivehybridization experiments has been developed using Sybaseclient-server technology. A complete suite of programs forexperiment design, entry of experimental data, and data analysis hasbeen built into the UNIX C-shell in order to facilitate the writingof C-shell scripts for these functions. The programs are written inC++ and interfaced with the database by using Sybase db-library.A new level of experiment design and debugging is being developedto facilitate a complete experiment design in the computer; thisdesign will then be automatically converted into robotic instructionsfor such routine laboratory operations as microtiter platemanipulation and dot-blot filter printing. Preliminary research isbeing performed using the Quintus Prolog system to interface thedatabase of hybridization experiments with databases of annotatedDNA sequences. The logic programming level will enable the rapiddesign of computational experiments as well as a uniform datarepresentation level, both of which are necessary for machinediscovery of biologically relevant patterns.

Sequencing by Hybridization Algorithms andComputational Tools


Radoje Drmanac, Ivan Labat, and NickStavropoulos
Integral Genetics Group; Center for Mechanistic Biology andBiotechnology; Argonne National Laboratory; Argonne, IL60439
708/252-3175, Fax: -3387, Internet:rade@everest.bim.anl.gov

Sequencing by hybridization (SBH) requires sophisticatedcomputational procedures for data acquisition and evaluation andfor DNA screening, mapping, and sequencing applications. We havebeen developing algorithms and programs and performingsimulations to prove many SBH possibilities other than the straightsequencing of short DNA fragments. For example, we demonstrateda 10- to 50-fold increase of SBH efficiency by using overlapped andsimilar sequences in the assembly process. Furthermore, we showedthat partial sequences obtained by 100 to 1000 probes aresufficient for gene identification and recognition of overlapped andsimilar sequences.

Recently we have started to produce large sets of real hybridizationdata that define practical requirements and serve as a final checkfor the necessary programs. Several computational tools have beendeveloped to enable use of the data. The programs are based onheuristic rules and resemble expert systems resistant to commonexperimental imprecision. All the programs use hybridizationintensities without conversion to 0/1 (binary) form.

Acquisition of hybridization data from filters containing 31,000 dots(1 dot/mm2) is the first step. An image-analysisprogram (DOTS) has been developed (J. Jarvis and R. Drmanac)that automatically defines filter position and reports hybridizationintensity for each dot. Programs for evaluating and normalizing data(SCORES), identifying groups of similar clones (CLUSTERS), andordering clones (CORD) are in the final phase of development.The programs are written in C for UNIX platforms with anX-Windows interface. Through SCORES and CLUSTERS, 20,000cDNA clones have been sorted into 13,000 groups. Further, analgorithm has been developed for matching hybridization signatureswith known sequences by simulating the expected probe scores forknown sequences. The CORD program defines contigs of 1- to2-kb clones hybridized by 200 probes and provides sequence-readymaps. The maps allow clone selection for complete sequencing byless then 2 reads/bp. In various simulation experiments, CORD hasshown a tolerance to more hybridization errors than observed inour experiments and to the high abundance of Alu repeats found inhuman sequences.

Key programs remain to be developed for assembling sequence by(1) combining single-pass gel sequences having as much as a20% error rate with hybridization data from 3000 probes and (2) integrating hybridization data from similar sequences. Aprogram is also needed for using partial sequence data to identifygenes and other genome elements.

*Computer Analysis of Functional Regions of theHuman Genome


Nikolay A. Kolchanov
Laboratory of Theoretical Molecular Genetics; Institute of Cytologyand Genetics; Siberian Branch of the Russian Academy of Sciences;Novosibirsk 630090, Russia
+7-3832/353-335, Fax: /356-558, Internet:kol@cgi.nsk.su

This work is directed toward computer analysis of thestructural-functional organization and evolution of functionalregions of human and mammalian genomes. Attention will befocused on genes coding for proteins, 5' regulatory regions,3' flanking regions, and repeat sequences.

We plan to study the distribution of short oligonucleotides adjacentto transcribed regions. Our specific aim is to identify statisticallysignificant oligonucleotide patterns in the transcription startingregion of RNA polymerase II. Particular emphasis will be onoligonucleotide distribution that can be approximated to lineartrends within promoter regions (200 to 400 bp).

Evolution of the distribution pattern of short nucleotides adjacentto transcription start regions (200 to 300 bp) will also bestudied. For this purpose, we intend to perform computer modelingof primate and rodent genes. Simulation results will be comparedwith real data, and the most adequate evolutionary models ofpromoter regions in neutral and adaptive variants will be chosen. Acomputer method will be designed to recognize starting points oftranscription by RNA polymerase in eukaryotic genes.Linear-discriminant analysis of many features relevant tooligonucleotide distribution in the transcription start region will beperformed for this purpose.

Contextual features of genes coding for proteins will be analyzed.Analysis will take into account the exon-intron structure in thesegenes. We intend to identify contextual features specific to eachgiven functional region by analyzing exons, introns, and theirboundaries containing donor and acceptor splicing sites. Thesefeatures will be used to develop computer methods for recognizingexons, introns, and whole genes in nucleotide sequences. Methodswill include traditional approaches based on discriminant analysisand methods of classification theory and recursive contextualsystems.

Computer analysis will be performed to determine molecularmechanisms of mutation emergence in the coding parts of genes inhuman and other vertebrate genomes. Our goal is to determine therole of polynucleotide context and estimate the contribution ofdifferent mutagenesis mechanisms to mutation emergence.Mechanisms to be studied include template chain dislocation, geneconversion, heteroduplex repair, and effects of specific andnonspecific signals of polynucleotide context on mutationemergence.

Polyadenylation sites in the 3' ends of genes transcribed byRNA polymerase II will be analyzed. These sites, key signals in3' flanking regions of a given group of genes, determinefeatures of 3'-end processing of pre-mRNAs. We intend to analyzespecific features of nucleotide context to determine the location ofpolyadenylation sites in genomic DNA. Based on these results, amethod will be developed for recognizing polyadenylation sites insequences of human and other vertebrate genomes. We intend toidentify specific contextual features of polyadenylation sites thatdetermine the magnitude of their functional activities. Methods willbe developed for estimating functional activities of given sitesderived from sequences.

We will also study the insertion-site contextual features ofAlu repeats in the primate genome. These featuresinclude homology and complementarity between Alusequences and insertion regions as well as distribution patterns ofshort oligonucleotides within insertion sites of Alurepeats. The number of potential insertion points of Aluinto human and primate genomes will be estimated.

21Bdb: A Database for Human DNA SequenceInformation


Suzanna Lewis, John McCarthy,Edward Theil, Arun Aggarwal, Donn Davy, Sam Pitluck,and Michael Palazzolo
Human Genome Computing Group; Lawrence BerkeleyLaboratory; Berkeley, CA 94720
McCarthy: 510/486-5307, Fax: -4004, Internet:jlmccarthy@lbl.gov

21Bdb is a variant of ACEDB, a suite of database and displaysoftware originally developed by Richard Durbin and JeanThierry-Mieg to meet the needs of the Caenorhabditiselegans project. 21Bdb includes all the functionality ofACEDB and extends those capabilities to meet new requirementsof the Lawrence Berkeley Laboratory (LBL) chromosome 21sequencing project. 21Bdb is being used to maintain and provideinformation for laboratory personnel and the chromosome 21research community.

Three aspects of ACEDB that have been especially useful for LBLare schema design, data presentation, and collaboration. ACEDBmakes relatively easy the continuous refinement of the databaseschema to match ongoing research needs and permit timelyresponses to rapidly changing laboratory requirements. Second,ACEDB already includes numerous graphical displays for genomicdata and an independent simple graphics library that is completelyportable across many platforms. This feature enables us toformulate quickly our own customized data displays, as we havealready done for flyDB (developed for the Drosophilaphysical mapping project). Finally, LBL staff have established a veryproductive ongoing collaboration with the original developers ofACEDB to extend ACEDB via the Internet. Many LBLenhancements have already been incorporated into the standardACEDB distribution. ACEDB will also be used by the new SangerSequencing Center in Cambridge, England, thus making availablemore opportunities for collaboration on human genome sequencingtools.

LBL is exploiting a directed sequencing technique on itschromosome 21 project. The implications for laboratory datamanagement are significant. By definition, a directed strategyrequires that biologists know the complete heritage of each DNAsequence and its position in relationship to other DNA sequences.This knowledge simplifies and makes more tractable thesequence-assembly process. The database must record all subclonesderived from each P1 clone, the P1 subclone map, thetransposon-insertion map of each subclone, descriptions of alltransposon-inserted priming sites derived from each subclone, andsequencing status and results for every priming site. Recorded data are available both graphically and computationally.

This physical map and chromosome 21 sequence data generated bythis project will be available to the community through the 21Bdbdatabase. These data include not only the LBL P1 physical map butalso corresponding linkages to the Genethon yeast artificialchromosome map via sequence tagged sites derived at bothlaboratories. The database also incorporates high-resolution maps ofindividual P1s made before sequencing and the sequence data itself.Collaborative work on providing public access is under way.Emphasis is on a graphical presentation that looks and feels naturalfor biologists.

Multiple Alignment and Homolog Sequence DatabaseCompilation


Hwa A. Lim
Florida State University; Tallahassee, FL 32306
904/644-1010, Fax: -0098, Internet: hlim@scri.fsu.edu

In the last few years, especially with the prolific growth of sequencedata since the Human Genome Project began, the problem ofsequence alignment has been a main research subject in theoreticaland computational molecular biology. Although different algorithmshave been forwarded, they are all based on very simple evolutionarymodels. The underlying evolutionary mechanisms are not studied indetail and therefore are not adequately considered in existingalignment algorithms. Most efforts have been concentrated onaccuracy and speed improvement, but the objective function hasnever been considered carefully.

The first goal of this project is to develop an alignment procedurecapable of perceiving and considering features of evolutionarypattern for query sequence samples. To achieve this, the followingapproaches will be pursued: (1) alignment of many sequencefamilies contained in databases, (2) study of various alignmentparameter values on the basis of available three-dimensionalstructure alignments and evolutionary models, and (3) alignment ofcomputer-simulated sequences. This study should also improveunderstanding of mechanisms of sequence evolution. Thestatistically based approach of diagonal fragment analysis will beused as a basic alignment technique.

The second goal is to compile probable sequence families in ahomolog sequence database useful in various applications. To thisend, a massively parallel machine such as the Connection Machineallows investigators to perform the task within a reasonabletime.

The third goal of this project is to apply databases developed forhuman genome sequence investigations. This will includeinvestigation of human multigene families and comparative analysisof human sequences with other species, leading to a bettercharacterization of their functional roles.

Construction of an Integrated Database to SupportGenetic Sequence Analysis


Ross Overbeek and Patrick Gillevet(1)
Mathematics and Computer Science Division; Argonne NationalLaboratory; Argonne, IL 60439
708/252-7856, Fax: -5986, Internet:overbeek@mcs.anl.gov
(1)National Center for Human Genome Research;National Institutes of Health; Bethesda, MD 20855
301/402-2540 or -2534, Fax: -2120, Internet:gillevet@uranus.nchgr.nih.gov

This 3-year project will attempt to create an integrated database tosupport comparative analyses of genomes. The database will initiallyinclude data from GenBank®, SwissProtein Data Bank, Swiss Enzymes Data Bank, EcoSeq Database,ProSite Dictionary of Protein Sites and Patterns, the compilation ofcompounds distributed by Peter Karp, a representation of the moresignificant metabolic pathways, genetic maps for a number ofbacterial genomes, and additional data relating to the specificsequencing project at Harvard University. This first version will beextended as rapidly as feasible to include data relating to thephysical structure of proteins (largely from the Brookhaven ProteinData Bank), manually and automatically generated alignments,phylogeny (most notably the tree and supporting data distributed bythe Ribosomal Database Project), and other curated databases such as CarBank and Genome Data Base.

The immediate goal is to compile a database of information aboutMycoplasma capricolum, and the long-range goal is toidentify functional components of the organism, relating genes totheir roles in pathways and explicating regulatory mechanisms forpathways. These sweeping goals require the establishment of aframework for comparative analysis.

Although the initial use of the system will be for comparativeanalysis of microorganisms and their genetic properties, we plan todevelop and distribute a tool that will include data on all organismsand have widespread usefulness within the community. This toolwill emerge from connecting (1) an X-Windows-based systemdevoted to integration of analysis tools and (2) the ArgonneGenoBase project focused on integrating existing databases.

Since the advent of the genome project, substantial advances havebeen made in the availability of genomic data. The number ofdatabases for retention of DNA sequence is increasing, andspecialized databases for peptides, enzymes, motifs, alignments,metabolic pathways, and two-dimensional protein gels arealso now available. Quality and variety will continue to improverapidly, necessitating a system to integrate the growing collection ofheterogeneous databases.

Chromosome 21 Physical Mapping andAnalysis


Stewart Scherer
Human Genome Center; Lawrence Berkeley Laboratory; Berkeley,CA 94720
510/486-4856 or -5468, Fax: -6816, Internet:stew@genome.lbl.gov orstew@lenti.med.umn.edu

Large-scale sequencing of human DNA will be initiated in the nextfew years. Although moderate-resolution physical maps of humanchromosomes are now being assembled with yeast artificialchromosome (YAC) clones, these clones clearly will not be usefulsources of templates for DNA sequencing. To provide such asource, the Lawrence Berkeley Laboratory Human Genome Centeris constructing a contig of P1 clones from chromosome21q22.3 that will also be useful in efforts to saturate the region forcDNAs and genetic markers. We focused on this gene-rich 3-Mbregion because it is known to be involved in Down'ssyndrome.

As part of this work, we are enhancing and correcting the existingchromosome 21 YAC map based on sequence tagged sites. We willattempt to understand why certain YAC clones are unstable andwhether these regions are better propagated in bacterial hosts.

Analysis of P1-sized genomic sequences with existing computersoftware is ill-suited to the high-throughout DNA sequencinganticipated in the genome project. We have developed a series ofalgorithms for rapid comparison of new sequences with statisticalmodels of underlying genome structure. The output is presentedgraphically so the user is directed rapidly to regions of unusualsequence organization. We intend to integrate our own novelalgorithms into a single package with sequence-analysis proceduresdeveloped by others. This package will highlight interesting featuresdetected by these programs and provide a graphical overview of thelargest contemplated sequences.


Proceed to Informatics Part 2