Genome Informatics Section 

DOE Human Genome Program Contractor-Grantee Workshop VII 
January 12-16, 1999  Oakland, CA


78. The Genome Annotation Collaboration: An Overview 

Jay R. Snoddy, Morey Parang, Sergey Petrov, Richard Mural, Manesh Shah, Ying Xu, Sheryl Martin, Phil LoCascio, Kim Worley1, Manfred Zorn2, Sylvia Spengler2, Donn Davy2, Chris Overton3, Edward C. Uberbacher, and the Genome Annotation Consortium 
Computational Biosciences Section, Life Sciences Division, Oak Ridge National Laboratory, Oak Ridge, Tennessee;  
1Baylor College of Medicine, Houston, Texas; 2Lawrence Berkeley National Laboratory, Berkeley, California; and 3University of Pennsylvania, Philadelphia, Pennsylvania 
ube@ornl.gov 
http://compbio.ornl.gov/gac 

The Genome Annotation Consortium is organizing software and database development projects toward a common goal of providing as much value-added annotation as possible on a genome sequence framework. The consortium is applying computational analysis modules and information technologies to the output of genome sequencers. We have developed a prototype system and process that will be presented at the Oakland workshop. We are also interested in forging new collaborations to add value to the genome sequence and annotation framework. Desired collaborations should improve the analysis process or the underlying technologies that are required for this analysis. This basic annotation process includes the following steps: 

1. Acquisition of genome sequence data and other data that can be readily attached to genome sequences; 
2. Assembly of sequence data into a consensus genome sequence framework; 
3. Genome-scale analysis of sequence and other data that predict genes, gene products, or other features and integration of existing experimental information onto that genome sequence framework; and 
4. Large-scale analysis of genome-based catalogs of genes and proteins that add homologous, functional, phylogenetic, and other types of relationships among the genes and proteins. 

The outputs of our desired process include: 

1. An assembled genome sequence framework; 
2. Genes and features attached to that framework; 
3. Catalogs of genes and proteins encoded by genomes; 
4. Links among genes, proteins, genome maps, homologous relationships, phylogenetic trees, and other relationships for computational and experimental genome data. 

Our current prototype is being applied to the output of all the large-scale genome sequencing centers for human sequences. We are adding genome mouse and microbial sequences to our prototype (see abstract of Larimer et al. for microbial analysis). As part of the initial prototype, we have established a data-acquisition component that retrieves data from genome center web sites and GenBank. This acquired data, for example, includes clone-contig overlap that is not always in the GenBank/EMBL/DDBJ entry. We have established a sequence-assembly component that creates a consensus genome sequence framework by assembling the different clone sequences. In addition, we acquire other experimental observations that can be linked to that genome-sequence framework during annotation (e.g., ESTs, STSs, cDNAs). 

We have developed a number of analysis modules, including GRAIL-EXP modules (see abstract of Xu et al.). We have integrated these analysis modules in a data-analysis process that creates a comprehensive genome-wide analysis (see abstract of Shah et al.). This comprehensive analysis process will be updated to ensure that new data can be added to the genome sequence framework. We have made progress in adding navigation and summary reports (see abstract of Snoddy et al.). 

We also have made progress on the difficult issue of data storage and management that can organize this diverse experimental and computational data (see abstract by Petrov et al.). We have produced different catalogs of genes and proteins including (1) GenBank annotated genes, (2) Genscan-predicted genes, and (3) GRAIL-EXP-predicted genes (including a subset of genes that have some EST evidence for expression). We have produced a Java-based interface (the Genome Channel Browser v. 2.0) and an HTML-based data-access method. These interfaces, other planned interfaces, and other progress will be presented at the Oakland meeting. 

The analysis modules used in the comprehensive genome-analysis processes also will be available as public servers (see abstract of LoCascio et al.). These servers would permit users to analyze their new data or subsets of public data. Some of these analysis modules also will be portable and could be applied at a number of sites beyond the consortium member sites, including genome centers. We expect that our data-analysis process and computational infrastructure will also foster other genome-based, large-scale computational biology, including prediction of protein structure and modeling of biological systems. 


 
Home Sequencing Functional Genomics
Author Index Sequencing Technologies Microbial Genome Program
Search Mapping Ethical, Legal, & Social Issues
Order a copy Informatics Infrastructure