![](purplestripe.jpg) |
Research Abstracts
DOE Microbial Genome Program
Report
Section 2: Functional and Computational
Analysis
Pangenomic Microbial Comparisons by Subtractive Hybridization
Peter Agron, Lyndsay Radnedge, Evan Skowronski, Madison Macht, Jessica
Wollard, Sylvia Chin, Aubree Hubbell, Marilyn Seymour, Christina Nocerino,
and Gary Andersen
Biology and Biotechnology Research Program; Lawrence Livermore National
Laboratory; 7000 East Ave.; Livermore, CA 94550
Andersen: 925/423-2525, Fax: /422-2282, andersen2@llnl.gov
Sequencing of whole genomes is reshaping microbiology. However, as more
sequence information is generated, there will be increased sequence redundancy
between closely related species or strains. In the course of time, the
amount of new sequence information obtained by whole-genome sequencing
with current technology will become increasingly less cost-efficient. We
are exploring the use of suppression subtractive hybridization (SSH) of
total DNA as a means of focusing sequencing efforts on unique regions when
a reference strain of known sequence is compared to a different isolate
of the same species or genus. To rigorously examine this approach, two
sequenced strains of Helicobacter pylori (J99 and 26695)
were used as a model system to allow rapid determination and mapping of
difference products based on sequencing alone.
Using highthroughput SSH methods, difference products can be rapidly
cloned, sequenced, and then mapped by comparing the data to the H. pylori
genome database. To increase the likelihood of amplifying difference products
from any given region, several restriction enzymes were used in separate
SSH experiments. We have obtained data from 2123 clones that reveal 427
(20%) unique sequences. Control subtractions with an Escherichia coli
strain containing the transposon Tn5 against its isogenic parent showed
a 270fold enrichment for Tn5 sequences, demonstrating that SSH is highly
effective. Current efforts are focused on (1)mapping difference products
onto the relevant genome using the cross-match algorithm and Percent Identity
Plots, (2)assessing coverage of difference regions by subtracted clones,
(3) assessing the redundancy of this coverage, and (4) determining the
reproducibility of SSH.
The Genome of the Extremely Radioresistant Bacterium
Deinococcus radiodurans: Comparative Genomics
Kira S. Makarova,1,2 Eugene V. Koonin,3 L. Aravind,2
Kenneth W. Minton,1Roman L. Tatusov,2 Y. I. Wolf,2
OwenWhite,3 and Michael J. Daly1
1Uniformed Services University of the Health Sciences; 4301
James Bridge Rd.; Bethesda, MD 208144799
301/295-3750, Fax: -1640, mdaly@mxb.usuhs.mil
2National Center for Biotechnology Information; National
Institutes of Health; Bethesda, MD 20814
3The Institute for Genomic Research; Rockville, MD 20850
Extremophiles are nearly always defined with singular characteristics
that allow existence within a singular extreme environment. The bacterium
Deinococcus radiodurans qualifies as a polyextremo-phile, showing
remarkable resistance to a range of damage caused by ionizing radiation,
dessication, ultraviolet radiation, oxidizing agents, and electrophilic
mutagens. D. radiodurans is most famous for its extreme resistance
to ionizing radiation; it not only can grow continuously in the presence
of chronic radiation (6000 rad/hour), but it can survive acute exposures
to gamma radiation that exceed 1.5 Mrad without lethality or induced mutation.
These characteristics were the impetus for sequencing its genome and the
ongoing development of its use for bioremediation of radioactive wastes.
Although it is known that these myriad resistance phenotypes stem from
its efficient DNA repair processes, the mechanisms underlying this repair
remain poorly understood. In this work we present an extensive comparative
sequence analysis of the Deinococcus genome. Deinococcus
is the first representative with a completely sequenced genome from a bacterial
branch of extremophilesthe Thermus-Deinococcus group. Phylogenetic
tree analysis, combined with the identification of several synapomorphies
between Thermus and Deinococcus, support that it is a very
ancient branch localized in the vicinity of the bacterial tree root. Distinctive
features of the Deinoccoccus genome, as well as features shared
with other freeliving bacteria, were revealed by comparing its proteome
to a collection of clusters of orthologous groups of proteins (called COGs).
Analysis of paralogs in Deinococcus has revealed some unique protein
families. In addition, specific expansions of several protein families
including phosphatases, proteases, acyl transferases, and MutT pyrophos-phohydrolases
were detected. Genes that potentially affect DNA repair and recombination
were investigated in detail.
Some proteins appear to have been transferred horizontally from eukaryotes
and are not present in other bacteria. For example, three proteins homologous
to plant desiccationresistance proteins were identified; these are particularly
interesting because of the positive correlations of resistance to desiccation
and radiation. Further, the D. radiodurans genome is very rich in
repetitive sequences, namely IS-like transposons and small intergenic repeats.
In combination, these observations suggest that several different biological
mechanisms contribute to the multiple DNA repairdependent phenotypes of
this organism. The genetic mechanisms underlying the extreme radiation
resistance of the organism are now being characterized experimentally using
a newly developed system for analyzing gene expression patterns in D.radiodurans.
Protein Expression in Methanococccus jannaschii
and Pyrococcus furiosus
Carol S. Giometti, S. L. Tollaksen, H. Lim,1 J. Yates,1
J. Holden,2 A. Lal Menon,2 G.Schut,2 M.
W. W. Adams,2 C. Reich,3 and G. Olsen3
Center for Mechanistic Biology and Biotechnology; Argonne National Laboratory;
9700 S. Cass Ave.; Argonne, IL 60439
630/252-3839, Fax: -5517, csgiometti@anl.gov
1University of Washington; Seattle, WA 98195
2University of Georgia; Athens, GA 30602
3University of Illinois; Urbana, IL 61801
Complete genome sequences are now available for both Methanococcus
jannaschii and Pyrococcus furiosus . The open reading frame
(ORF) sequences from these completed genomes can be used to predict the
proteins synthesized, but laboratory methods are needed to verify those
predictions. Two-dimensional gel electrophoresis (2DE), coupled with mass
spectrometry of peptides isolated from the gels, is being used to determine
the constitutive expression of proteins from these two archaea and to explore
the regulation of expression of nonconstitutive proteins. The most abundant
proteins (i.e., those easily detectable by staining with Coomassie Blue
R250) have been isolated and analyzed from cells grown in minimal nutrient
media. Using a combination of matrix-assisted laser desorption ionization
(MALDI) and tandem mass spectrometry, 100 proteins expressed by M. jannaschii
and 50 proteins expressed by P. furiosus have been related to specific
ORFs in the respective genome sequences. The molecular weights and isoelectric
points determined by protein positions in the 2DE patterns are compared
with the ORF-predicted molecular weights and isoelectric points for each
microbe. Numerous instances have been observed of multiple proteins with
different molecular weights or isoelectric points being associated with
the same ORF. Possible reasons for such multiplicity include the incomplete
unfolding of these highly stable proteins prior to electrophoresis, the
nondissociation of subunits, posttranslational modifications such as phosphorylation
(multiple proteins with the same identity but different isoelectric points),
or peptide cleavage (multiple proteins with the same identity but different
molecular weights). Preliminary experiments to change the protein expression
of these organisms by altering growth conditions have revealed significant
quantitative changes in a small number of proteins visible in 2DE patterns.
Correlation of proteins expressed with specific ORFs is now focused on
proteins showing quantitative changes in expression and on less abundant
proteins. The observed protein abundances and changes in abundance from
these proteomic studies could be useful for validation of protein-expression
predictions based on ORFs.
Microbial Genome Annotation and Display
Frank Larimer, Doug Hyatt, Miriam Land, Richard Mural, Morey Parang,
Manesh Shah, Jay Snoddy, and Edward Uberbacher
Computational Biosciences; Life Sciences Division; Oak Ridge National
Laboratory; 1060 Commerce Park Dr.; Oak Ridge, TN 37830
865/574-1253, Fax: /241-1965, larimerfw@ornl.gov
http://compbio.ornl.gov and
http://genome.ornl.gov/microbial/
Once the genome of an organism has been sequenced, portions that define
features of biological importance must be identified and annotated. When
the newly identified gene has a close relative already in DNA or protein
sequence databases, gene finding in microorganisms is relatively straightforward.
The genes tend to be simple, uninterrupted open reading frames (ORFs) that
can be translated and compared with the database.
The discovery of new genes without close relatives is more problematic.
Although identifying genelike ORFs is easy, it is very difficult to determine
which represent real genes and which are merely statistical artifacts of
the sequence. This is a serious problem in organisms with a high G+C content
where random ORFs can be abundant due a lack of stop codons.
A second issue in modeling microbial genes is accurate prediction of
the start codon, which is complicated further by the use of minor start
codons in addition to the universal AUG. An accurate accounting and description
of genes in microbial genomes is essential in determining the existence
of functional metabolic pathways and other aspects of whole-organism function.
Compared to simpler gene-prediction methods using ORFs or single-coding
measures, recently developed gene-finding systems show excellent performance
in predicting coding genes and start sites, even for the shortest microbial
genes. Such highly accurate systems are effective across the phylogenetic
spectrum of organisms as an essential baseline of analysis from which much
biological insight can be obtained.
Microbial genome sequencing is progressing rapidly. Apart from the twenty-odd
published genomes, more than 100 are being sequenced, with plans to sequence
hundreds or thousands more. Since every new genome informs those that preceded
it, updating genome annotation is necessary to keep these resources relevant;
and consistent procedures, tools, and methodology must be applied. The
unique functions of each individual organism need to be documented as functions
are placed in a recognized, consistent scheme.
We are now representing all completed microbial genomes in the Genome
Channel and the Genome Catalog, providing comprehensive sequencebased views
of genomes from a full genome display to the nucleotide sequence level.
We have developed tools for comparative multiple-genome analysis that provide
automated, regularly updated, comprehensive annotation of microbial genomes
using consistent methodology for gene calling and feature recognition.
The visual genome browser represents around 51,000 microbial GRAIL and
45,000 GenBank gene models. Precomputed BEAUTY searches are provided for
all gene models, with links to original source material and additional
search engines. Comprehensive representation of microbial genomes will
require deeper annotation of structural features, including operon and
regulon organization, promoter and ribosome binding-site recognition, repressor
and activator binding-site calling, transcription terminators, and other
functional elements. Sensor development is in progress to provide access
to these features. Linkage and integration of the gene-protein-function
catalog to phylogenetic, structural, and metabolic relationships also will
be developed.
A draft analysis pipeline has been constructed to provide annotation
for the Microbial Genome Program of the Joint Genome Institute. The first
two draft sequences in the pipeline, with many more to come, are the Nitrosomonas
europaea and Prochlorococcus marinus genomes. Multiple gene
callers (Generation, Glimmer, and Critica) are used to generate a candidate
gene model set. The conceptual translations of these gene models generate
similarity-search results and protein family relationships; from these
results, a metabolic framework is constructed and functional roles are
assigned. Simple and complex repeats, tRNA genes, and other structural
RNA genes also are identified. Annotation summaries are available through
the JGI microbial genomics Web
site; in addition, draft results are being integrated into the interactive
display schemes of the Genome Channel
and Catalog.
WIT2: An Integrated System for Genetic Sequence Analysis
and Metabolic Reconstruction
Ross Overbeek,1,2 Gordon Pusch,1,2 Mark D'Souza,1
Evgeni Selkov Jr.,1,2 Evgeni Selkov,1,2 and Natalia
Maltsev1
1Mathematics and Computer Science Division; Argonne National
Laboratory, MCS-221; 9700 S. Cass Ave.; Argonne, IL 60439
2Integrated Genomics Inc.
Maltsev: 630/252-5195, Fax: -5986, maltsev@mcs.anl.gov
http://wit.mcs.anl.gov/WIT2
The WIT2 system was designed and implemented to support genetic sequence
and comparative analysis of sequenced genomes as well as metabolic reconstructions
from the sequence data. It now contains data from 38distinct genomes. WIT2
provides access to thoroughly annotated genomes within a framework of metabolic
reconstructions connected to the sequence data; protein alignments and
phylogenetic trees; and data on gene clusters, potential operons, and functional
domains. We believe that the parallel analysis of a large number of phylogenetically
diverse genomes can add a great deal to our understanding of the higher-level
functional subsystems and physiology of the organisms. The unique features
of WIT2 include the following: (1) WIT2 is based on the unique EMP-MPW
collection of enzymes and metabolic pathways developed by E.Selkov and
colleagues; this collection contains extensive information on enzymology
and metabolism of different organisms. (2) WIT2 allows researchers to perform
interactive genetic sequence analysis within a framework of metabolic reconstructions
and to maintain user models of the organism's functionality. (3)WIT2 provides
access to a set of Webbased and original batch tools that offer extensible
query access against the data. (4) WIT2 supports both shared and nonshared
annotation of features and the maintenance of multiple models of the metabolism
for each organism. (5) WIT2 supports metabolic reconstructions from expressed
sequence tag data.
Microbial Proteomics at Pacific Northwest National Laboratory
Richard D. Smith, Ljiljana Pasa-Tolic, Mary S. Lipton, Pamela K.
Jensen,
Gordon A. Anderson, and Timothy D. Veenstra
Environmental Molecular Sciences Laboratory, MS K898; Pacific Northwest
National Laboratory; P.O. Box 999; Richland, WA 993522
509/376-0723, Fax: -7722, dick.smith@pnl.gov
Bacterial strains such as Shewanella putrefaciens MR1 are key
organisms in the bioremediation of metals due to their ability to enzymatically
reduce and precipitate a diverse range of heavy metals and radionuclides.
Additionally, Deinococcus radiodurans is an attractive candidate
for bioremediation because of its unique ability to survive exceedingly
high doses of ionizing radiation. The need to develop an improved understanding
of their enzymatic pathways is important in refining the unique capabilities
of these organisms for bioremediation. As a first step, an organism's proteome
must be characterized completely. The proteome is the name given to the
dynamic array of proteins expressed by a genome. A single genome can exhibit
many different proteomes depending on the stage in the cell cycle; cell
differentiation; response to such environmental conditions as nutrients,
temperature, and stress; and the manifestation of disease states. Although
the availability of full genomic reference sequences provides a set of
road maps of possibilities and the measurement of expressed RNAs tells
us what might happen, the proteome is the key that tells us what really
happens. Therefore, the study of proteomes under welldefined conditions
can provide a better understanding of complex biological processes, requiring
faster and more sensitive capabilities for the characterization of microbial
protein constituents.
We currently are developing technologies that integrate and refine protein
separation and digestion processes with advanced Fourier transform ion
cyclotron resonance (FTICR) mass spectrometric methods. In some of these
studies, the cell's protein complement will be digested with a protease
and the resulting peptides will be analyzed by capillary liquid chromatographymass
spectrometry (LCMS). The use of tandem mass spectrometry (MSMS) provides
additional sequence information that, when combined with the mass of the
parent peptide, can be used to search existing databases. This results
in peptide identification, which in turn is used to identify the parent
protein. Additionally, we are extending this mass spectrometric technology
to allow precise quantitation of changes in the protein complement upon
perturbation of the microbial environment. This technology, based on the
use of stableisotope labeling, allows the creation of "comparative displays"
for the expression of many proteins simultaneously. Two versions of each
protein are generated and simultaneously analyzed to study changes in expression
(i.e., repression or induction) for hundreds to thousands of proteins.
These combined technologies are planned to be developed and demonstrated
in a D.radiodurans pilot project that also would follow changes
in the proteome after exposure to ionizing radiation.
Protein Domain Dissection and Functional Identification
Temple F. Smith, Sophia Zarakhovich, and Hongxian He
BioMolecular Engineering Research Center; College of Engineering; Boston
University; 36 Cummington Street; Boston, MA 02215
617/353-7123, tsmith@darwin.bu.edu
Using various multialignment and conserved pattern tools (e.g., psiBLAST,
BLOCKS, pfam, and pimaII), protein domains as "evolutionary modules" generally
can be identified. Using a set of 20 completely sequenced microbial genomes
(including yeast), we have generated over 1300 profiles representing diagnostic
sequence domains. The majority either cover the entire length of the proteins
matching the profile or locate a sequence region clearly identifiable in
multiple distinct domain contexts. We are addressing the relationship between
such sequence domains and structural domains as well as problems involved
in associating these domains to a given biochemical function and the cellular
role played by that function.
In collaboration with Julio Collado Vides (CIFN, Mexico), we are investigating
the potential for coordinate regulation among neighboring genes in various
biochemical pathways. We began with sets of genes in Escherichia coli
or some other bacteria or archaea organized in operons. Next, each operon
set is being examined in yeast and Caenorhabditis elegans for shared
regulatory sequences. Initial work led to the identification of two different
types of eukaryotic operon-equivalent organizations in yeast and to our
1998 publication in Microbial and Comparative Genomics.
Genome Sequencing
Carl R. Woese and Gary J. Olsen
Department of Microbiology; University of Illinois; B103 Chemical and
Life Sciences Laboratory; 601 S. Goodwin Ave.; Urbana, IL 61801
carl@ninja.life.uiuc.edu,
gary@phylo.life.uiuc.edu
We prepared a sequencing-quality genomic DNA library for Methanococcus
maripaludis, an organism that was being considered for sequencing as
part of DOE's Microbial Genome Program (MGP). We have done some partial
sequencing of clones from this library as part of a project to use comparative
analysis to elucidate the differences between related high- and low-temperature
proteins (this sequencing was partially supported by funding from the National
Aeronautics and Space Administration).
We also prepared a sequencing-quality genomic DNA library for Giardia
lamblia, a eukaryotic microorganism. This permitted Mitchell Sogin
(Marine Biology Laboratory) to generate preliminary genome sequencing data
for a successful grant application to the National Institutes of Health.
The sequence data resulting from our participation in MGP have stimulated
additional research by our group and others. More specifically:
1. We continue to make new gene identifications through comparative
analyses of sequenced genomes.
2. We have experimentally verified the function of some novel RNA methylase
genes.
3. We have collaborated in the experimental identification of a novel,
archaeal S-adenosyl methionine synthetase.
4. We have cloned and expressed RNA polymerase genes and transcription-initiation
factors from archaea and have experimentally identified new proteinprotein
interactions in the transcription apparatus.
5. We have supplied 27 research groups with genomic DNA and cell mass
from organisms sequenced as part of the MGP.
6. We are contributing ideas formulated as part of an MGP proposal into
a successful collaboration with Carol Giometti (Argonne National Laboratory)
to study the proteomes of Methanococcus jannaschii and Pyrococcus
furiosus.
7. We have worked with the research group of Ross Overbeek (Argonne
National Laboratory) on the development of his WIT and WIT2 environments
for genome analysis and comparison and have used the WIT2 system to help
with our analyses.
A Pilot Study to Develop and Demonstrate a High-Throughput
New Approach to Characterizing Total Cellular Proteins Expressed by Deinococcus
radiodurans R1
Kwong-Kwok Wong, Richard D. Smith, Ljiljana Pasa-Tolic, and Owen
White1
Pacific Northwest National Laboratory; P.O. Box 999; Richland, WA 99352
509/376-5097, Fax: -6767, kk.wong@pnl.gov
1The Institute for Genomic Research; Rockville, MD 20850
www.tigr.org
Deinococcus radiodurans, with its exceptional radiation resistance,
was once thought to grow within nuclear reactors, but further studies now
suggest that the deinococci are soil microorganisms. Besides its resistance
to radiation, D. radiodurans also has extreme resistance to cellular
and genetic damage that occurs in other organisms after exposure to many
genotoxic chemicals, oxidative damage, high levels of uv radiation, and
desiccation. Thus, D. radiodurans is a potential candidate to be
engineered for degradation of hazardous chemicals at mixed-waste sites,
and it is important to understand at the molecular level how the bacteria
can adapt to such stressful environments. The Institute for Genomic Research
has completely sequenced the D. radiodurans genome, enabling further
functional analysis of putative genes encoded by the bacteria.
In a pilot study, we have established a "2-D virtual gel" method and
demonstrated that this new methodology is applicable to characterizing
proteins expressed by D. radiodurans. Although numerous facets of
the technology need significant refinement, we have generated preliminary
results that are a major step beyond any "proteome" measurements made to
date in terms of speed and sensitivity. In a single capillary isoelectric
focusing (CIEF) separation with online FTICR mass spectrometry, we have
detected at least 800 different proteins (based on the number of discrete
molecular weight species above 5 kDa). This single experiment (requiring
less than 30 min) uses about 250 ng of total protein, about 20 to 30 times
less than that of a typical 2-D polyacrylamide gel electrophoresis experiment.
This corresponds to low femtomole quantities for the average detected protein
(with some proteins being detected at levels well into the attomole range).
The potential exists to greatly improve the methodology's sensitivity,
thereby opening up the detection of very low copy number regulatory proteins.
Related to these efforts, we also have developed a general targeted
mutagenesis method based on D. radiodurans genomic information to
define gene function. Using the targeted mutagenesis method, we have shown
that both catalase (katA) and superoxide dismutase (sodA)
genes are required for extreme radiation resistance. We are applying the
2-D virtual gel method to analyze proteins expressed by different mutants.
Characterization of expressed proteins by 2-D virtual gel and further
targeted mutagenesis analysis will provide a link to the function of the
genomic data's predicted open reading frames (ORFs) and is expected to
identify new small genes in the size range at which identifying ORFs is
problematic. The resulting information can identify genes of interest and
facilitate detailed biochemical and genetic experiments to gain a global
understanding of the organism for energy and environmental and industrial
applications. The developed 2-D virtual gel method will be applicable to
any sequenced organism. This project was funded initially as a pilot study
for 2years but we expect research to continue well beyond that period. |