Genome Informatics Section 

DOE Human Genome Program Contractor-Grantee Workshop VII 
January 12-16, 1999  Oakland, CA


84. High-Performance Computing Servers 

Phil LoCascio, Doug Hyatt, Manesh Shah, Al Geist, Bill Shelton, Ray Flannery, Jay Snoddy, Edward Uberbacher, and the Genome Annotation Consortium 
Life Sciences Division and Computer Sciences and Mathematics Division, Oak Ridge National Laboratory, Oak Ridge, Tennessee 
locasciop@ornl.gov 
http://compbio.ornl.gov/gac 

Advances and fundamental changes in experimental genomics brought about by an avalanche of genome sequences will pose challenges in the current methods used to computationally analyze biological data. We are constructing a computational infrastructure to meet these new demands for processing sequence and other biological data within the Genome Annotation Consortium project, for genome centers and for the biological community at large. To cope with this 20-fold data increase, we have been developing the necessary high-performance computing tools to address this scaling challenge. As part of the Department of Energy's Grand Challenge in computational genomics, we have developed a number of applications to form part of a high-performance toolkit for the analysis of sequence data. 

The initial tools we wish to include in the toolkit are high-performance biological application servers that include BLAST codes (versions of BLASTN, BLASTP, and BLASTX), and codes for sequence assembly, gene modeling (e.g., GRAIL-EXP), multiple sequence alignment, protein classification, protein threading, and phylogeny reconstruction (for both gene trees and species trees). 

The tools and servers will be transparent to the user but able to manage the large amounts of processing and data produced in the various stages of enriching experimental biological information with computational analysis. The goal of this high-performance toolkit is not only to provide one-stop shopping to a genome sequence-data framework and interoperable tools but also to run the codes in the toolkit on platforms where the kinds of questions that the GAC and our users can ask are not greatly affected by hardware limitations. 

The system's logical structure can be thought of as having three overall components: client, administrator, and server. All components share a common infrastructure consisting of a naming service and query agent, with the administrator having policy control over agent behavior, and namespace profile. 

At the atomic transaction level of detail, clients and servers behave as expected, with clients issuing requests and servers responding. A higher level of transaction detail permits a much more complex model of operation where clients can be operated from within servers, and servers can be directed to propagate replies. This nested transaction model is very powerful for developing decoupled calculation and query facilities. The complex interaction is completely transparent to the user because all transactions are controlled by a query agent. 

The GRAIL-EXP gene-recognition application will be deployed as a server that uses this model. The application derives alignment services from BLAST servers elsewhere. Internally, the GRAIL-EXP server is composed of a number of independent components that interact as a nested set of transactions. The ability to assign different resources to different components is an extremely important feature for maintaining a credible load-balancing scheme. 

Due to the logical decoupling of the query infrastructure, we are able to produce a model with both excellent scaling abilities and fault-tolerant characteristics. In testing the ability to run multiple instances of GRAIL-EXP and BLAST we have demonstrated that the removal of any dependent services does not cause loss of data. Instead, where processing power is removed, we observe a graceful degradation of services as long as there is some instantiation of service available. 

The overall software engineering design has been constructed very carefully to provide a nonspecific use of distributed resources, through the neutral application programming interface (NAPI) layer. NAPI is used to encapsulate the functionality required for distributed operation while utilizing the currently available resources. The underlying infrastructure is subsumed using PVM (parallel virtual machine) for robust heterogeneous operation and MPI (message passing interface) for homogeneous application development, optional where available. Other infrastructure ports can be accommodated (e.g., JAVA RMI), but the focus is now on the design of the high-level component model functionality and semantics. 

Located at Oak Ridge National Laboratory within both the Center for Computational Sciences and the Computational Biosciences section, the development testbed consists of three super computers (Intel Paragons), some SGI SMP machines, and a DEC Alpha Workstation cluster. We are rapidly approaching alpha-stage deployment testing; after testing performance and stability, we can deploy the framework to NERSC, other high-performance computing sites, and other collaborators. 


 
Home Sequencing Functional Genomics
Author Index Sequencing Technologies Microbial Genome Program
Search Mapping Ethical, Legal, & Social Issues
Order a copy Informatics Infrastructure