Genome Informatics Section 

DOE Human Genome Program Contractor-Grantee Workshop VII 
January 12-16, 1999  Oakland, CA


82. Genome Channel Analysis Engine: A System for Automated Analysis of Genome Channel Data 

Manesh Shah, Morey Parang, Doug Hyatt, Michael Galloway, Richard Mural, Kim Worley1, Edward C. Uberbacher, and the Genome Annotation Consortium 
Computational Biosciences Section, Life Sciences Division, Oak Ridge National Laboratory, Oak Ridge, Tennessee and 1Baylor College of Medicine, Houston, Texas 
x9m@ornl.gov 
http://compbio.ornl.gov/gac 

The Genome Channel Annotation Toolkit currently incorporates several exon- and gene-prediction programs as well as other kinds of feature-recognition systems and database homology search systems. Most of these analysis systems have been developed by the Genome Annotation Consortium collaborators, while some systems have been obtained from other researchers who make their code available but are not currently consortium members. The exon- and gene-recognition systems include GRAIL, GRAIL-EXP, Genscan, and Genie. Feature-recognition systems include the GRAIL suite of tools: CpG island, PolyA sites, simple repeats and repetitive DNA elements. Database homology systems include NCBI BLAST and Beauty postprocessing. 

The Genome Channel Analysis Engine is an automated system that facilitates the analysis of contig sequences contained in the Genome Channel repository. It schedules and distributes the various processing tasks on several networked computer systems in a concurrent, pipelined mode to best utilize the available computer resources and to achieve optimal throughput. The scheduling is organized in terms of analysis epochs. At the start of each cycle, a data-refresh procedure is executed to detect and compile a list of all new and updated contig sequences that the sequence data-retrieval engine has incorporated in the Genome Channel staging area since the previous cycle's data refresh. It also checks the database source ftp sites for updated versions of all databases required by various analysis tools and updates local copies as necessary. 

A master process then starts up servers on the available machines, including a PVM process for GRAIL analysis modules. Using a combination of Perl scripts and C programs, the analysis engine automatically runs the analysis tools on new contigs. The master process distributes the tasks as required to servers running on other machines. Some tasks are performed in parallel, including the GRAIL analysis tools (PVM) and the GRAIL-EXP Blast search (MPI). In addition, once protein translations have been obtained for the predicted genes and exons in a contig, they are immediately piped to a Beauty postprocessing server for a detailed homology search. The ultimate goal of this scheduling and distribution scheme is to reduce the time required to process 100 Mb of data to under 24 hours. Using the currently available resources, the analysis engine can process 100 Mb in about 72 hours. 

Analysis processing is currently being performed at ORNL and is also being deployed at Lawrence Berkeley National Laboratory (LBNL). The computational infrastructure at ORNL consists of a cluster of 15 DEC Alpha workstations, 2 Sun HPC 450 UltraSparc servers, and a 200 GB Network Appliance RAID disk storage unit. These resources are barely adequate for handling the current rate of growth of sequence data. We are pursuing several strategies to deal with the anticipated rate of sequence generation and the consequent growth in compute and storage requirements. Some of the most compute-intensive tasks are being ported to the Paragon supercomputer at ORNL and to the supercomputers at LBNL. We also plan to evaluate the High Performance Storage System (HPSS) at ORNL for data storage. 


 
Home Sequencing Functional Genomics
Author Index Sequencing Technologies Microbial Genome Program
Search Mapping Ethical, Legal, & Social Issues
Order a copy Informatics Infrastructure