Research Abstracts
DOE Microbial Genome Program Report

Section 2: Functional and Computational Analysis
 

Pangenomic Microbial Comparisons by Subtractive Hybridization

Peter Agron, Lyndsay Radnedge, Evan Skowronski, Madison Macht, Jessica Wollard, Sylvia Chin, Aubree Hubbell, Marilyn Seymour, Christina Nocerino, and Gary Andersen

Biology and Biotechnology Research Program; Lawrence Livermore National Laboratory; 7000 East Ave.; Livermore, CA 94550
Andersen: 925/423-2525, Fax: /422-2282, andersen2@llnl.gov

Sequencing of whole genomes is reshaping microbiology. However, as more sequence information is generated, there will be increased sequence redundancy between closely related species or strains. In the course of time, the amount of new sequence information obtained by whole-genome sequencing with current technology will become increasingly less cost-efficient. We are exploring the use of suppression subtractive hybridization (SSH) of total DNA as a means of focusing sequencing efforts on unique regions when a reference strain of known sequence is compared to a different isolate of the same species or genus. To rigorously examine this approach, two sequenced strains of Helicobacter pylori (J99 and 26695) were used as a model system to allow rapid determination and mapping of difference products based on sequencing alone.

Using highthroughput SSH methods, difference products can be rapidly cloned, sequenced, and then mapped by comparing the data to the H. pylori genome database. To increase the likelihood of amplifying difference products from any given region, several restriction enzymes were used in separate SSH experiments. We have obtained data from 2123 clones that reveal 427 (20%) unique sequences. Control subtractions with an Escherichia coli strain containing the transposon Tn5 against its isogenic parent showed a 270fold enrichment for Tn5 sequences, demonstrating that SSH is highly effective. Current efforts are focused on (1)mapping difference products onto the relevant genome using the cross-match algorithm and Percent Identity Plots, (2)assessing coverage of difference regions by subtracted clones, (3) assessing the redundancy of this coverage, and (4) determining the reproducibility of SSH.


The Genome of the Extremely Radioresistant Bacterium Deinococcus radiodurans: Comparative Genomics

Kira S. Makarova,1,2 Eugene V. Koonin,3 L. Aravind,2 Kenneth W. Minton,1Roman L. Tatusov,2 Y. I. Wolf,2 OwenWhite,3 and Michael J. Daly1

1Uniformed Services University of the Health Sciences; 4301 James Bridge Rd.; Bethesda, MD 208144799
301/295-3750, Fax: -1640, mdaly@mxb.usuhs.mil
2National Center for Biotechnology Information; National Institutes of Health; Bethesda, MD 20814
3The Institute for Genomic Research; Rockville, MD 20850

Extremophiles are nearly always defined with singular characteristics that allow existence within a singular extreme environment. The bacterium Deinococcus radiodurans qualifies as a polyextremo-phile, showing remarkable resistance to a range of damage caused by ionizing radiation, dessication, ultraviolet radiation, oxidizing agents, and electrophilic mutagens. D. radiodurans is most famous for its extreme resistance to ionizing radiation; it not only can grow continuously in the presence of chronic radiation (6000 rad/hour), but it can survive acute exposures to gamma radiation that exceed 1.5 Mrad without lethality or induced mutation. These characteristics were the impetus for sequencing its genome and the ongoing development of its use for bioremediation of radioactive wastes.

Although it is known that these myriad resistance phenotypes stem from its efficient DNA repair processes, the mechanisms underlying this repair remain poorly understood. In this work we present an extensive comparative sequence analysis of the Deinococcus genome. Deinococcus is the first representative with a completely sequenced genome from a bacterial branch of extremophilesthe Thermus-Deinococcus group. Phylogenetic tree analysis, combined with the identification of several synapomorphies between Thermus and Deinococcus, support that it is a very ancient branch localized in the vicinity of the bacterial tree root. Distinctive features of the Deinoccoccus genome, as well as features shared with other freeliving bacteria, were revealed by comparing its proteome to a collection of clusters of orthologous groups of proteins (called COGs). Analysis of paralogs in Deinococcus has revealed some unique protein families. In addition, specific expansions of several protein families including phosphatases, proteases, acyl transferases, and MutT pyrophos-phohydrolases were detected. Genes that potentially affect DNA repair and recombination were investigated in detail.

Some proteins appear to have been transferred horizontally from eukaryotes and are not present in other bacteria. For example, three proteins homologous to plant desiccationresistance proteins were identified; these are particularly interesting because of the positive correlations of resistance to desiccation and radiation. Further, the D. radiodurans genome is very rich in repetitive sequences, namely IS-like transposons and small intergenic repeats. In combination, these observations suggest that several different biological mechanisms contribute to the multiple DNA repairdependent phenotypes of this organism. The genetic mechanisms underlying the extreme radiation resistance of the organism are now being characterized experimentally using a newly developed system for analyzing gene expression patterns in D.radiodurans


Protein Expression in Methanococccus jannaschii 
and Pyrococcus furiosus

Carol S. Giometti, S. L. Tollaksen, H. Lim,1 J. Yates,1 J. Holden,2 A. Lal Menon,2 G.Schut,2 M. W. W. Adams,2 C. Reich,3 and G. Olsen3

Center for Mechanistic Biology and Biotechnology; Argonne National Laboratory; 9700 S. Cass Ave.; Argonne, IL 60439
630/252-3839, Fax: -5517, csgiometti@anl.gov

1University of Washington; Seattle, WA 98195
2University of Georgia; Athens, GA 30602
3University of Illinois; Urbana, IL 61801

Complete genome sequences are now available for both Methanococcus jannaschii and Pyrococcus furiosus . The open reading frame (ORF) sequences from these completed genomes can be used to predict the proteins synthesized, but laboratory methods are needed to verify those predictions. Two-dimensional gel electrophoresis (2DE), coupled with mass spectrometry of peptides isolated from the gels, is being used to determine the constitutive expression of proteins from these two archaea and to explore the regulation of expression of nonconstitutive proteins. The most abundant proteins (i.e., those easily detectable by staining with Coomassie Blue R250) have been isolated and analyzed from cells grown in minimal nutrient media. Using a combination of matrix-assisted laser desorption ionization (MALDI) and tandem mass spectrometry, 100 proteins expressed by M. jannaschii and 50 proteins expressed by P. furiosus have been related to specific ORFs in the respective genome sequences. The molecular weights and isoelectric points determined by protein positions in the 2DE patterns are compared with the ORF-predicted molecular weights and isoelectric points for each microbe. Numerous instances have been observed of multiple proteins with different molecular weights or isoelectric points being associated with the same ORF. Possible reasons for such multiplicity include the incomplete unfolding of these highly stable proteins prior to electrophoresis, the nondissociation of subunits, posttranslational modifications such as phosphorylation (multiple proteins with the same identity but different isoelectric points), or peptide cleavage (multiple proteins with the same identity but different molecular weights). Preliminary experiments to change the protein expression of these organisms by altering growth conditions have revealed significant quantitative changes in a small number of proteins visible in 2DE patterns. Correlation of proteins expressed with specific ORFs is now focused on proteins showing quantitative changes in expression and on less abundant proteins. The observed protein abundances and changes in abundance from these proteomic studies could be useful for validation of protein-expression predictions based on ORFs.


Microbial Genome Annotation and Display

Frank Larimer, Doug Hyatt, Miriam Land, Richard Mural, Morey Parang, 
Manesh Shah, Jay Snoddy, and Edward Uberbacher

Computational Biosciences; Life Sciences Division; Oak Ridge National Laboratory; 1060 Commerce Park Dr.; Oak Ridge, TN 37830
865/574-1253, Fax: /241-1965, larimerfw@ornl.gov
http://compbio.ornl.gov and http://genome.ornl.gov/microbial/

Once the genome of an organism has been sequenced, portions that define features of biological importance must be identified and annotated. When the newly identified gene has a close relative already in DNA or protein sequence databases, gene finding in microorganisms is relatively straightforward. The genes tend to be simple, uninterrupted open reading frames (ORFs) that can be translated and compared with the database. 

The discovery of new genes without close relatives is more problematic. Although identifying genelike ORFs is easy, it is very difficult to determine which represent real genes and which are merely statistical artifacts of the sequence. This is a serious problem in organisms with a high G+C content where random ORFs can be abundant due a lack of stop codons.

A second issue in modeling microbial genes is accurate prediction of the start codon, which is complicated further by the use of minor start codons in addition to the universal AUG. An accurate accounting and description of genes in microbial genomes is essential in determining the existence of functional metabolic pathways and other aspects of whole-organism function. Compared to simpler gene-prediction methods using ORFs or single-coding measures, recently developed gene-finding systems show excellent performance in predicting coding genes and start sites, even for the shortest microbial genes. Such highly accurate systems are effective across the phylogenetic spectrum of organisms as an essential baseline of analysis from which much biological insight can be obtained.

Microbial genome sequencing is progressing rapidly. Apart from the twenty-odd published genomes, more than 100 are being sequenced, with plans to sequence hundreds or thousands more. Since every new genome informs those that preceded it, updating genome annotation is necessary to keep these resources relevant; and consistent procedures, tools, and methodology must be applied. The unique functions of each individual organism need to be documented as functions are placed in a recognized, consistent scheme.

We are now representing all completed microbial genomes in the Genome Channel and the Genome Catalog, providing comprehensive sequencebased views of genomes from a full genome display to the nucleotide sequence level. We have developed tools for comparative multiple-genome analysis that provide automated, regularly updated, comprehensive annotation of microbial genomes using consistent methodology for gene calling and feature recognition. The visual genome browser represents around 51,000 microbial GRAIL and 45,000 GenBank gene models. Precomputed BEAUTY searches are provided for all gene models, with links to original source material and additional search engines. Comprehensive representation of microbial genomes will require deeper annotation of structural features, including operon and regulon organization, promoter and ribosome binding-site recognition, repressor and activator binding-site calling, transcription terminators, and other functional elements. Sensor development is in progress to provide access to these features. Linkage and integration of the gene-protein-function catalog to phylogenetic, structural, and metabolic relationships also will be developed.

A draft analysis pipeline has been constructed to provide annotation for the Microbial Genome Program of the Joint Genome Institute. The first two draft sequences in the pipeline, with many more to come, are the Nitrosomonas europaea and Prochlorococcus marinus genomes. Multiple gene callers (Generation, Glimmer, and Critica) are used to generate a candidate gene model set. The conceptual translations of these gene models generate similarity-search results and protein family relationships; from these results, a metabolic framework is constructed and functional roles are assigned. Simple and complex repeats, tRNA genes, and other structural RNA genes also are identified. Annotation summaries are available through the JGI microbial genomics Web site; in addition, draft results are being integrated into the interactive display schemes of the Genome Channel and Catalog.


WIT2: An Integrated System for Genetic Sequence Analysis and Metabolic Reconstruction

Ross Overbeek,1,2 Gordon Pusch,1,2 Mark D'Souza,1 Evgeni Selkov Jr.,1,2 Evgeni Selkov,1,2 and Natalia Maltsev1

1Mathematics and Computer Science Division; Argonne National Laboratory, MCS-221; 9700 S. Cass Ave.; Argonne, IL 60439
2Integrated Genomics Inc.
Maltsev: 630/252-5195, Fax: -5986, maltsev@mcs.anl.gov
http://wit.mcs.anl.gov/WIT2

The WIT2 system was designed and implemented to support genetic sequence and comparative analysis of sequenced genomes as well as metabolic reconstructions from the sequence data. It now contains data from 38distinct genomes. WIT2 provides access to thoroughly annotated genomes within a framework of metabolic reconstructions connected to the sequence data; protein alignments and phylogenetic trees; and data on gene clusters, potential operons, and functional domains. We believe that the parallel analysis of a large number of phylogenetically diverse genomes can add a great deal to our understanding of the higher-level functional subsystems and physiology of the organisms. The unique features of WIT2 include the following: (1) WIT2 is based on the unique EMP-MPW collection of enzymes and metabolic pathways developed by E.Selkov and colleagues; this collection contains extensive information on enzymology and metabolism of different organisms. (2) WIT2 allows researchers to perform interactive genetic sequence analysis within a framework of metabolic reconstructions and to maintain user models of the organism's functionality. (3)WIT2 provides access to a set of Webbased and original batch tools that offer extensible query access against the data. (4) WIT2 supports both shared and nonshared annotation of features and the maintenance of multiple models of the metabolism for each organism. (5) WIT2 supports metabolic reconstructions from expressed sequence tag data.


Microbial Proteomics at Pacific Northwest National Laboratory

Richard D. Smith, Ljiljana Pasa-Tolic, Mary S. Lipton, Pamela K. Jensen, 
Gordon A. Anderson, and Timothy D. Veenstra 

Environmental Molecular Sciences Laboratory, MS K898; Pacific Northwest National Laboratory; P.O. Box 999; Richland, WA 993522 
509/376-0723, Fax: -7722, dick.smith@pnl.gov

Bacterial strains such as Shewanella putrefaciens MR1 are key organisms in the bioremediation of metals due to their ability to enzymatically reduce and precipitate a diverse range of heavy metals and radionuclides. Additionally, Deinococcus radiodurans is an attractive candidate for bioremediation because of its unique ability to survive exceedingly high doses of ionizing radiation. The need to develop an improved understanding of their enzymatic pathways is important in refining the unique capabilities of these organisms for bioremediation. As a first step, an organism's proteome must be characterized completely. The proteome is the name given to the dynamic array of proteins expressed by a genome. A single genome can exhibit many different proteomes depending on the stage in the cell cycle; cell differentiation; response to such environmental conditions as nutrients, temperature, and stress; and the manifestation of disease states. Although the availability of full genomic reference sequences provides a set of road maps of possibilities and the measurement of expressed RNAs tells us what might happen, the proteome is the key that tells us what really happens. Therefore, the study of proteomes under welldefined conditions can provide a better understanding of complex biological processes, requiring faster and more sensitive capabilities for the characterization of microbial protein constituents.

We currently are developing technologies that integrate and refine protein separation and digestion processes with advanced Fourier transform ion cyclotron resonance (FTICR) mass spectrometric methods. In some of these studies, the cell's protein complement will be digested with a protease and the resulting peptides will be analyzed by capillary liquid chromatographymass spectrometry (LCMS). The use of tandem mass spectrometry (MSMS) provides additional sequence information that, when combined with the mass of the parent peptide, can be used to search existing databases. This results in peptide identification, which in turn is used to identify the parent protein. Additionally, we are extending this mass spectrometric technology to allow precise quantitation of changes in the protein complement upon perturbation of the microbial environment. This technology, based on the use of stableisotope labeling, allows the creation of "comparative displays" for the expression of many proteins simultaneously. Two versions of each protein are generated and simultaneously analyzed to study changes in expression (i.e., repression or induction) for hundreds to thousands of proteins. These combined technologies are planned to be developed and demonstrated in a D.radiodurans pilot project that also would follow changes in the proteome after exposure to ionizing radiation.


Protein Domain Dissection and Functional Identification

Temple F. Smith, Sophia Zarakhovich, and Hongxian He

BioMolecular Engineering Research Center; College of Engineering; Boston University; 36 Cummington Street; Boston, MA 02215
617/353-7123, tsmith@darwin.bu.edu

Using various multialignment and conserved pattern tools (e.g., psiBLAST, BLOCKS, pfam, and pimaII), protein domains as "evolutionary modules" generally can be identified. Using a set of 20 completely sequenced microbial genomes (including yeast), we have generated over 1300 profiles representing diagnostic sequence domains. The majority either cover the entire length of the proteins matching the profile or locate a sequence region clearly identifiable in multiple distinct domain contexts. We are addressing the relationship between such sequence domains and structural domains as well as problems involved in associating these domains to a given biochemical function and the cellular role played by that function.

In collaboration with Julio Collado Vides (CIFN, Mexico), we are investigating the potential for coordinate regulation among neighboring genes in various biochemical pathways. We began with sets of genes in Escherichia coli or some other bacteria or archaea organized in operons. Next, each operon set is being examined in yeast and Caenorhabditis elegans for shared regulatory sequences. Initial work led to the identification of two different types of eukaryotic operon-equivalent organizations in yeast and to our 1998 publication in Microbial and Comparative Genomics.


Genome Sequencing

Carl R. Woese and Gary J. Olsen

Department of Microbiology; University of Illinois; B103 Chemical and Life Sciences Laboratory; 601 S. Goodwin Ave.; Urbana, IL 61801
carl@ninja.life.uiuc.edu, gary@phylo.life.uiuc.edu

We prepared a sequencing-quality genomic DNA library for Methanococcus maripaludis, an organism that was being considered for sequencing as part of DOE's Microbial Genome Program (MGP). We have done some partial sequencing of clones from this library as part of a project to use comparative analysis to elucidate the differences between related high- and low-temperature proteins (this sequencing was partially supported by funding from the National Aeronautics and Space Administration).

We also prepared a sequencing-quality genomic DNA library for Giardia lamblia, a eukaryotic microorganism. This permitted Mitchell Sogin (Marine Biology Laboratory) to generate preliminary genome sequencing data for a successful grant application to the National Institutes of Health.

The sequence data resulting from our participation in MGP have stimulated additional research by our group and others. More specifically:

1. We continue to make new gene identifications through comparative analyses of sequenced genomes.

2. We have experimentally verified the function of some novel RNA methylase genes.

3. We have collaborated in the experimental identification of a novel, archaeal S-adenosyl methionine synthetase.

4. We have cloned and expressed RNA polymerase genes and transcription-initiation factors from archaea and have experimentally identified new proteinprotein interactions in the transcription apparatus.

5. We have supplied 27 research groups with genomic DNA and cell mass from organisms sequenced as part of the MGP.

6. We are contributing ideas formulated as part of an MGP proposal into a successful collaboration with Carol Giometti (Argonne National Laboratory) to study the proteomes of Methanococcus jannaschii and Pyrococcus furiosus.

7. We have worked with the research group of Ross Overbeek (Argonne National Laboratory) on the development of his WIT and WIT2 environments for genome analysis and comparison and have used the WIT2 system to help with our analyses.


A Pilot Study to Develop and Demonstrate a High-Throughput New Approach to Characterizing Total Cellular Proteins Expressed by Deinococcus radiodurans R1

Kwong-Kwok Wong, Richard D. Smith, Ljiljana Pasa-Tolic, and Owen White1

Pacific Northwest National Laboratory; P.O. Box 999; Richland, WA 99352
509/376-5097, Fax: -6767, kk.wong@pnl.gov

1The Institute for Genomic Research; Rockville, MD 20850
www.tigr.org

Deinococcus radiodurans, with its exceptional radiation resistance, was once thought to grow within nuclear reactors, but further studies now suggest that the deinococci are soil microorganisms. Besides its resistance to radiation, D. radiodurans also has extreme resistance to cellular and genetic damage that occurs in other organisms after exposure to many genotoxic chemicals, oxidative damage, high levels of uv radiation, and desiccation. Thus, D. radiodurans is a potential candidate to be engineered for degradation of hazardous chemicals at mixed-waste sites, and it is important to understand at the molecular level how the bacteria can adapt to such stressful environments. The Institute for Genomic Research has completely sequenced the D. radiodurans genome, enabling further functional analysis of putative genes encoded by the bacteria.

In a pilot study, we have established a "2-D virtual gel" method and demonstrated that this new methodology is applicable to characterizing proteins expressed by D. radiodurans. Although numerous facets of the technology need significant refinement, we have generated preliminary results that are a major step beyond any "proteome" measurements made to date in terms of speed and sensitivity. In a single capillary isoelectric focusing (CIEF) separation with online FTICR mass spectrometry, we have detected at least 800 different proteins (based on the number of discrete molecular weight species above 5 kDa). This single experiment (requiring less than 30 min) uses about 250 ng of total protein, about 20 to 30 times less than that of a typical 2-D polyacrylamide gel electrophoresis experiment. This corresponds to low femtomole quantities for the average detected protein (with some proteins being detected at levels well into the attomole range). The potential exists to greatly improve the methodology's sensitivity, thereby opening up the detection of very low copy number regulatory proteins.

Related to these efforts, we also have developed a general targeted mutagenesis method based on D. radiodurans genomic information to define gene function. Using the targeted mutagenesis method, we have shown that both catalase (katA) and superoxide dismutase (sodA) genes are required for extreme radiation resistance. We are applying the 2-D virtual gel method to analyze proteins expressed by different mutants.

Characterization of expressed proteins by 2-D virtual gel and further targeted mutagenesis analysis will provide a link to the function of the genomic data's predicted open reading frames (ORFs) and is expected to identify new small genes in the size range at which identifying ORFs is problematic. The resulting information can identify genes of interest and facilitate detailed biochemical and genetic experiments to gain a global understanding of the organism for energy and environmental and industrial applications. The developed 2-D virtual gel method will be applicable to any sequenced organism. This project was funded initially as a pilot study for 2years but we expect research to continue well beyond that period.

The online presentation of this 2000 publication is a special feature of the Human Genome Project Information Web site.