Los Alamos National Laboratory
Lab Home  |  Phone
 
 
News and Communications Office home.story

Los Alamos-led team to sequence entire NT biological database on greengene distributed supercomputer

Contact: Kevin N. Roark, knroark@lanl.gov, (505) 665-9202 (04-193)

LOS ALAMOS, N.M., November 18, 2005 — Award-winning Los Alamos National Laboratory-developed software is helping researchers here and elsewhere better understand a database of biological information and enable a plethora of biological studies from organism "barcoding" to gene function and evolution.

The software, mpiBLAST, coupled with a supercomputer assembled over a high-speed network and distributed across the country just for this purpose, will make the biological information stored in large databases more useful for researchers by enabling a Google-like indexing structure that tracks relationships among the sequences in these large databases. Such an indexing structure can increase search speed times by a factor of 100 while at the same time providing an up to 20-fold compression in the size of the database.

mpiBLAST, an open-source project led by Los Alamos researcher Wu Feng, is being tapped to harvest the NT biological database in order to create the Google-like indexing structure. Los Alamos researchers announced at this week's Supecomputing 2005 Conference that they will lead a large-scale nationwide effort to sequence-search the entire NT database.

The NT biological database is akin to a "biological dictionary organized as a flat file." When biologists need to know if a particular genomic sequence has already been catalogued, they look through this dictionary for that genomic sequence. If they can't find the desired sequence, they add the new information to the end of the file thus making the unordered file larger and larger.

With the idea that it would be much better to organize the database and build it with some structure that is searchable in a non-linear manner, Feng at Los Alamos, and other scientists, using the "GreenGene" supercomputer, intend to give this huge database that structure by sequencing the entire database against itself.

"If this endeavor to sequence-search the entire NT database succeeds, the result of this experiment will provide critical information to the biology community, including insightful evolutionary, structural, and functional relationships between every sequence and family in the NT database," notes Feng of Los Alamos' Advanced Computing Laboratory, principal investigator for the project. In all, the large-scale experiment is expected to generate 100 terabytes of output - enough to fill-up roughly 2,000 iPods, he added.

mpiBLAST, as distributed by Los Alamos National Laboratory (http://mpiblast.lanl.gov or http://www.mpiblast.org), won an R&D 100 Award in 2004. It is a search tool that enables biologists to characterize an unknown sequence by comparing it against a database of known sequences. The similarity between sequences then enables biologists to detect evolutionary relationships and infer biological properties of the unknown sequence. On a 128-processor supercomputing cluster, mpiBLAST can deliver a speed-up of 305-fold, thus decreasing the search time of a representative 300-kilobyte query file from nearly 24 hours down to only 5 minutes. Additional speed-up, as provided by a parallelized I/O version of mpiBLAST called mpiBLAST-pio, reduces the search time further and allows the code to scale to larger system configurations.

Led by Los Alamos National Laboratory, the nationwide team working together on this collaborative endeavor includes industrial participants from Intel, Panta Systems and Foundry Networks. Academic and government participants are from North Carolina State University, Oak Ridge National Laboratory, Utah and Virginia Tech universities. In addition, the team will use the high-speed experimental facilities of the National LambdaRail(tm) (NLR) network to connect the 2,200-processor System X supercomputer at Virginia Tech to the SC|05 Supercomputing showroom floor to create the distributed, heterogeneous supercomputer dubbed "GreenGene."

For more information about this project, officially entitled "mpiBLAST on the GreenGene Distributed Supercomputer: Sequencing the NT Database Against the NT Database (An NT-Complete Problem)," see http://sc05.supercomputing.org/schedule/event_detail.php?evid=5303 online.

Los Alamos National Laboratory, a multidisciplinary research institution engaged in strategic science on behalf of national security, is operated by Los Alamos National Security, LLC, a team composed of Bechtel National, the University of California, The Babcock & Wilcox Company, and the Washington Division of URS for the Department of Energy's National Nuclear Security Administration.

Los Alamos enhances national security by ensuring the safety and reliability of the U.S. nuclear stockpile, developing technologies to reduce threats from weapons of mass destruction, and solving problems related to energy, environment, infrastructure, health, and global security concerns.


Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA

Inside | © Copyright 2008-09 Los Alamos National Security, LLC All rights reserved | Disclaimer/Privacy | Web Contact