Genome Informatics Section 

DOE Human Genome Program Contractor-Grantee Workshop VII 
January 12-16, 1999  Oakland, CA


80. Genome Annotation Data Management and Data Administration: Developing Summary Results for User Navigation, Genome Research, Improved Data Processing, and Quality Metrics 

Jay R. Snoddy, Miriam Land, Sheryl Martin, Morey Parang, Inna Volker, Denise Schmoyer, Manesh Shah, Sergey Petrov, Edward C. Uberbacher, and the Genome Annotation Consortium 
Computational Biosciences Section, Life Sciences Division, Oak Ridge National Laboratory, Oak Ridge, Tennessee 
v8v@ornl.gov 
http://compbio.ornl.gov/gac 

Summary reports of the genome annotation data and the underlying data management required to generate them are being constructed. A goal is to create these reports from a robust, queryable, and scalable data management system (see abstract of Petrov et al.). Some of these summary reports will be available as online HTML documents. These summaries can help improve four primary areas. 

  • User Navigation: Some reports are being constructed that can allow users to understand the available annotation data of the Genome Annotation Consortium and to easily allow users to navigate among that data. In addition, query interfaces will be needed and constructed (also see abstract of Parang et al.).

  •  
  • Research: Some reports and queries can allow GAC and external researchers to analyze the characteristics of the developing genome sequence framework and the annotation attached to that framework. This allows for a comprehensive analysis of the currently available genome annotation since the analysis is consistently applied across all the available genome data.
  • Data Processing: Some reports can help ensure that data flows smoothly through the several automated and manual steps in the annotation process. These current steps include data processing through different analysis modules (see abstract of Shah et al.). There is also a need to support data acquisition, cleansing, and curation; while these steps should be as automated as possible, there will be some manual data administration. Reports will be needed to assist the GAC personnel in the acquisition of new data. Summary reports will be needed to assist the GAC collaborators in flagging potential inconsistencies in the underlying data for data cleansing and curation. Reports could be generated so GAC collaborators could obtain the sets of annotation results applicable for research or integrating into a subsequent analysis process.

  •  
  • Quality: Some reports can provide metrics for the Genome Annotation Consortium that can assist in quality control, quality assurance, and quality improvement of data analysis, data administration, and processing. These reports can help us measure, monitor, and improve the annotation quality and integration of the overall system.
For each genome, chromosome, sequence contig, and clone, a set of summary reports is being created. Hyperlinks to underlying data and more details will be linked to these reports. 

Several general observations can be made now from the current snapshot of the data, and details from a later snapshot will be presented at the Oakland workshop. There are 7 to 10 times more predicted gene models (both GRAIL-EXP and Genscan gene models) than gene models annotated in GenBank. The majority of the predicted GRAIL-EXP genes do have one or more ESTs that are used in the gene modeling. A third to half of the gene models that predict putative protein sequences have a reasonable BLAST hit to known proteins in Swiss-Prot (BLAST with an Evalue <=1.0e-4). By this BLAST hit criteria, about 3 times more predicted genes appear to have good homolog candidates than there are annotated genes in the GenBank archival record. 

By the time of the Oakland workshop, we hope to display several online summary reports that can demonstrate the current state of genome annotation for genomes, chromosomes, contigs, and sequenced clones. This should provide users with the results of the different but integrated data-management and processing steps that we employ in genome annotation. We would be interested in suggestions for other reports or queries that others may find useful. 


 
Home Sequencing Functional Genomics
Author Index Sequencing Technologies Microbial Genome Program
Search Mapping Ethical, Legal, & Social Issues
Order a copy Informatics Infrastructure