Genome Informatics Section 

DOE Human Genome Program Contractor-Grantee Workshop VII 
January 12-16, 1999  Oakland, CA


81. Data Management for Genome Analysis and Annotation: Engineering a Fundamental Infrastructure for Data that Supports Collaboration in Genome Annotation 

Sergey Petrov, Jay R. Snoddy, Michael D. Galloway, Sheryl Martin, Miriam Land, Morey Parang, Tom Rowan, Denise D. Schmoyer, Manesh Shah, Inna E. Vokler, Edward C. Uberbacher, and the Genome Annotation Consortium 
Computational Biosciences Section, Life Sciences Division, Oak Ridge National Laboratory, Oak Ridge, Tennessee 
ptv@ornl.gov 
http://compbio.ornl.gov/gac 

The GenomeDataWarehouse, GDW, is a heterogeneous information system created to store and support data from multiple sources. GDW currently is being developed and filled with data. The purpose of GDW is to provide data management to highly diverse and distributed sets of data. One goal is to create and support a new form of on-line analytical processing (OLAP) that is suitable for the complex world of genome research data. Another goal is to provide the right kind of data management that can encourage several groups in the Genome Annotation Consortium to collaborate in adding experimental and computational data to a developing genome sequence framework. The data management will need to provide data to the different data-analysis modules that ORNL and other collaborators are creating and support the linkages among the underlying experimental data and computational data produced from those different modules. This data warehouse will assist in the production of several user interfaces that are described in other abstracts or are under development; there will not be one monolithic interface to GDW. 

Conceptually, GDW consists of three parts: archival data sources, the kernel, and data marts. Currently, GDW is based on two Sybase servers, an SRS server, and data files running on networked Sun workstations. One Sybase server is dedicated to a copy of the Genome Database and, the second to kernel databases and a developing Genome Channel database. The archival data sources include sets of data from community databases (e.g., GenBank, SwissProt, Prosite, and GDB). The GDW Kernel is a set of databases used to store identification data on biologically meaningful objects and cross-references regardless of object origin, structure, and representation. The data marts are precompiled data sets reflecting the logic of a particular interface. 

For archival data sources, we occasionally must internalize and manage community data within ORNL computers for a variety of performance, update, and querying reasons. These archival data are attached to the evolving genome sequence framework. We are using an SRS server (a product of the EMBL/EBI) for archiving and maintaining much of these data. Our implementation of the SRS server at ORNL provides access to 31 community flatfile databases. We are evaluating the use of SRS and other mechanisms for serving annotation data that we are creating. 

The data warehouse kernel provides the underlying mechanism that manages data in this complex area where we cannot enforce transactional control and data integrity of some underlying archival data. Given the difficulties and constraints in technology and available resources, our warehouse doesn't require integration of all data from different sources and does not completely enforce global integrity; data are stored "as is". At the same time, a mechanism is needed to provide cross-references between information on same objects in different data marts and the relationships that originate from the data sources and our analyses. The Kernel consists of several databases storing IDs of objects found in archival data sources, their classifications, and relationships -- including cross-references. The structure of the kernel databases itself doesn't depend on structure of objects and relationships found in original sources. All databases in the kernel have almost-identical logical structure; data was divided among several databases to improve kernel performance only. Each database represents relationships between objects and their classes in a meta-closed way; every class and relationship is represented as an object and, therefore, information expressible in the database can include relationships between classes and relationships, as well as classification of relationships. Flexibility of chosen data representation allows us to include new data sources on the fly and to represent new classes of objects and new relationships found in genomic data. This approach comes at the cost of performance, but these databases are not meant to be routinely accessed by users. 

The data marts are the read-only data sources that users routinely access to analyze and navigate the data. Data marts are compiled under the control of the GDW kernel. Each data mart reflects the internal logic of user interfaces and software systems that are focused around one aspect of genome data. Our current Genome Channel data system is the first example of these data marts. The Genome Channel data mart organizes data around genome structures and features on the assembled genome sequence framework. In the future, we will be developing other data marts, including a gene and protein catalog. Although this may contain data similar to parts of Genome Channel, the data system and interface are to be organized around genes, proteins, and the relationships among genes and proteins (including what is known and what we can predict about homology, phylogeny trees, protein families, and function). 

We are developing several interfaces that will allow users access to data marts and mechanisms to query and navigate among data marts and other data sources. We are trying to give the user some flexibility in altering the view of the data. We are also exploring the application of the Internet standard, XML, as a method of expressing some annotation data; this should allow the user a lot of flexibility in altering data presentation at the client browser without going back to the data mart and altering the underlying content. We anticipate that a number of initial HTML prototypes will be available by the time of the Oakland meeting and hope to acquire more feedback on our efforts. 


 
Home Sequencing Functional Genomics
Author Index Sequencing Technologies Microbial Genome Program
Search Mapping Ethical, Legal, & Social Issues
Order a copy Informatics Infrastructure