Genome Informatics Section 

DOE Human Genome Program Contractor-Grantee Workshop VII 
January 12-16, 1999  Oakland, CA


83. GRAIL-EXP: Multiple Gene Modeling Using Pattern Recognition and Homology 

Ying Xu, Manesh Shah, Doug Hyatt, Richard Mural, Edward C. Uberbacher, and the Genome Annotation Consortium 
Computational Biosciences Section, Life Sciences Division, Oak Ridge National Laboratory, Oak Ridge, Tennessee 
ube@ornl.gov 
http://compbio.ornl.gov/gac 

GRAIL-EXP is a multiple gene-modeling system that combines information from the analysis of EST homology with pattern recognition to construct accurate gene models. We believe that these current improvements in GRAIL-EXP represent fundamental advances in gene-modeling accuracy and computational performance. Currently, the system is being used extensively by the Genome Annotation Consortium to provide comprehensive genome-wide annotation for genomic DNA sequence from human, mouse, and other model organisms as well as several microbial organisms. GRAIL-EXP is used in this context to analyze long stretches of human and mouse DNA sequences (contigs that span tens of thousands to more than a million bases) to correctly identify and characterize the large numbers of genes found in such sequences. 

Computational methods for gene identification in human genomic sequences typically consist of two phases: coding-region recognition and gene modeling. Although several effective methods for coding-region recognition are available, parsing the recognized coding regions into appropriate gene structures remains a difficult problem. GRAIL-EXP addresses the problem of multiple gene identification, using a set of biological heuristics and information available from sequence homology with available EST and mRNA sequences. 

GRAIL-EXP uses GRAIL for predicting exons in a sequence. GRAIL evaluates all possible exon candidates in a DNA sequence and groups the high-scoring candidates into overlapping clusters. Those containing repetitive DNA elements are filtered out based on BLAST alignments of the exon candidates with a repetitive DNA database. 

In the next phase, the system uses BLAST to identify all EST and mRNA sequences (obtained from GenBank dbEST and TIGR's human transcript sequence database) that have a sufficiently high BLAST alignment score with the candidate exons. The system also extracts information useful for the subsequent gene-modeling phase from each matched entry in the database. This results in a set of alignments for each exon candidate. 

In the gene-modeling phase, an optimal gene model is constructed from the predicted exon candidates and the alignment information using dynamic programming. A set of nodes, one for each exon candidate or aligned-EST-sequence pair, is created. Each node is assigned a score based on its GRAIL score and the BLAST score for that alignment. The best-scoring gene model ending at each node is calculated using a recursive algorithm. Each exon is examined in three possible roles (as being the initial, middle, or terminating exon of a gene model). The algorithm assigns penalties and rewards at each step based on reading-frame mismatch, existence of in-frame stop codon, and terminating exon not ending in a stop codon. A node that uses the same EST as the previous node is assigned a reward that significantly outweighs the penalties. This guarantees that an EST that matches multiple exon candidates will have overriding influence on the gene model. 

If the optimal gene model incorporates a set of one or more matched ESTs, then the system determines if any regions of the ESTs were not covered by the gene model exons. In this case, the system tries to locate the missing EST fragments in the appropriate intervals of the genomic sequence. If located, that region is added to the gene model as an exon. 

GRAIL-EXP, a complex system with several logical components and numerous subcomponents, has been designed and implemented as a modular system. This is convenient for distributing various analysis tasks on multiple computers to achieve higher throughput. The system currently runs on a cluster of 10 DEC Alpha workstations and is able to analyze around 1 Mb of genomic sequence in about 15 minutes. Work is under way to achieve significant speedup by porting the most computationally-intensive modules to the Paragon supercomputer in the Center of Computational Sciences at ORNL and to similar platforms at Lawrence Berkeley National Laboratory. A Java-based graphical user interface has been developed to provide an interactive environment for the analysis of user-supplied DNA sequences. The system will be made available to the genome community via the public GRAIL server at ORNL in early 1999. 


 
Home Sequencing Functional Genomics
Author Index Sequencing Technologies Microbial Genome Program
Search Mapping Ethical, Legal, & Social Issues
Order a copy Informatics Infrastructure