JGI Home

Gene-Based Approach to Metagenomics
Provides Environmental Fingerprints

Assembling the genome of a single organism from sequenced reads is not a trivial task. The task becomes essentially impossible when the sample contains reads from hundreds or even thousands of microbial species, most of which are unculturable. As Murphy's Law would have it, this is the situation in a typical environmental sample. To get around this difficulty, researchers from JGI, the European Molecular Biology Laboratory, UC Berkeley, and Diversa Corporation have investigated a new gene-based approach to analyzing sequences from environmental samples--no assembly required. Their work has shown that useful genomic information can be obtained by analyzing the proteins made by a community rather than identifying the individual species.

curves for various samples break between 0 and 50

Curves for the various samples show the number of orthologous groups seen in each per megabase of sequencing. Parentheses indicate the lower bound of the total number of orthologous groups for each sample.

The researchers characterized sequences from several complex environmental samples, one from a Minnesota farm soil and three from deep-sea whale carcasses ("whale falls"). They began by sequencing small rRNA libraries from each sample, built with primers for bacteria, archaea, and eukaryotes, to determine the species diversity. The soil sample was estimated to contain more than 3000 bacterial ribotypes, and the whale fall samples were estimated to contain 25-150 ribotypes each. Based on the variation in diversity among the samples, the scientists chose to shotgun sequence 100 Mb of genomic DNA from the soil and 25 Mb from each whale fall. As expected, the genomic sequences obtained were resistant to assembly, with less than 1% of the soil library reads overlapping. Getting enough sequence to assemble even the most commonly occurring organism in each whale fall would likely require sequencing between 100 and 700 Mb.

To determine whether useful information could be gleaned from the unassembled or partially assembled sequence fragments, which they dubbed environmental gene tags (EGTs), the researchers compared the soil and whale fall sequences to assembled genome sequences from an acid mine drainage community and three surface samples from the Sargasso Sea. Gene content comparisons were consistent with the range of species diversity in the samples. Automated annotation showed that 90% of the EGTs contained putative genes, and more than a third of them contained two or more open reading frames, making nearest-neighbor analysis a possibility.

dot plots show triangular arrangements for all 4 levels

Three-way analyses of environmental samples at four functional levels. Dot placement
reflects the relative abundance for each item in the three environments.

The next test was to determine whether the proteins coded by the EGTs were representative of all the proteins in the samples. About half of the EGTs showed homology to proteins in an extended Clusters of Orthologous Groups (COG) database. Plots of the numbers of COG hits with increasing sequencing depth showed saturation after only modest amounts of sequencing. Thus, the unassembled sequences appeared to hold enough information to obtain a reasonable snapshot of protein production in the community.

The next challenge was to determine whether the EGT fingerprints of samples from similar ecological niches but geographically disparate sources were more similar to each other than to those from other niches. Comparisons were drawn on four levels of protein function: the level of individual genes, the level of operons (neighboring genes), the level of higher-order cellular processes (KEGG pathways), and the level of functional roles (broad functional categories from COG). Two-way clustering analyses at all four levels showed clear distinctions between the samples from different environments. In three-way analyses where whale fall, Sargasso Sea, and soil samples each defined one group, the EGTs from each group clustered together at all four levels. What's more, the differences reflected known differences between the environmental niches. For example, the operon analysis showed differences in the transport of ions and inorganic molecules that reflected known abundances of the compounds in the environments sampled. The EGT data thus appears to be useful in determining an environmental fingerprint for a given sample, which could be used to predict features of the sample environment, such as which energy sources are being used or what levels of pollutants are present.

An intriguing additional finding from the functional investigations is the fact that many of the most overrepresented genes in the samples were ones whose functions have yet to be characterized. This surprising result suggests that EGT analyses will lead to insights into as-yet-unknown genes and processes that relate to the interactions of microbes with their environment.

Authors

S.G. Tringe, A. Kobayashi, A.A. Salamov, and E.M. Rubin (JGI);  C. von Mering (European Molecular Biology Laboratory); K. Chen (UC Berkeley); and H. W. Chang, M. Podar, J.M. Short, and E.J. Mathur (Diversa Corp.).

Publication

"Comparative Metagenomics of Microbial Communities," Science 308: 554-557 (2005).

Funding

This research was funded by the U.S. Department of Energy Office of Biological and Environmental Research, the National Institutes of Health, and the National Science Foundation.