The data points in an integrated toxicogenomics experiment with microarray,
proteomics, and metabolomics data are virtually innumerable. With thousands--or
tens of thousands--of data points for each sample in each type of analysis,
the complexity and the sheer amount of data multiply fast. As teams of
statisticians, bioinformaticists, and biologists work to interpret this
complexity, they must also ensure that each data point is valid and can
be integrated with other data from the same or different types of experiments.
Underlying this detailed exercise are two expansive goals. One is to identify
markers of toxic exposures or disease. The other is to understand the biological
processes underlying disease. The latter is called biological inference--the
highly iterative process of inferring cause-and-effect relationships from
toxicogenomics data, using computation efforts linked to mathematics.
Efforts to identify markers of exposure are concerned primarily with
discerning patterns in output from microarray, proteomics, and metabolomics
technology. These patterns can be characterized as molecular fingerprints
and can be extremely useful in diagnosing levels of exposure, even though
researchers may not understand exactly why particular patterns appear.
In contrast, biological inference is concerned with an understanding of
how patterns in genomics data actually translate into the details of gene
transcription, protein creation, and metabolism.
By studying associations among the expression of genes, proteins, and
metabolites, researchers try to identify genes of influence, many of which
act as hubs of metabolic networks, affecting many other genes. Transient
hubs, those that act briefly as a cell changes state, can be especially
difficult to find and analyze. A goal of special importance to toxicogenomics
is to distinguish endogenous pathways--those involved in the cell’s
normal chores of metabolism and reproduction--from exogenous pathways triggered
by exposures to drugs or toxicants. The ultimate goal is to follow a pathway
from the expression of genes through the creation and modification of proteins
and metabolites, as well as all the associated gene-gene, protein-protein,
and metabolite-metabolite interactions in between.
Anchor Management
Inferring biological pathways requires research teams to mine and interpret
vast quantities of genomics data. Interpretation strategies include grouping
or clustering data to find patterns as well as use of statistical methods
to filter data on genes with the strongest signals or those expressed in
concert. One key technique in biological inference is phenotypic anchoring,
using known biological information to interpret signals, or uncharacterized
data, from genomics experiments. These signals indicate the presence of
molecules (such as mRNA or proteins in a given range of molecular weights)
and can take various forms, depending on the type of technology used. For
example, in microarray experiments signals take the form of fluorescence
generated by the bonding of strands of mRNA to the microarray slide. In
some studies, these genomics data are compared to data from traditional
toxicology tests performed on the same samples. In others, the team integrates
what is already known or suspected about biological pathways based on past
studies.
Examples of research using phenotypic anchoring can be found in acetaminophen
studies sponsored by the NIEHS National Center for Toxicogenomics (NCT).
In this research, microarray data were compared to those from traditional
toxicology tests, which both aided in interpretation of the microarray
data and led to the development of new knowledge and understanding about
the toxicity of this commonly used drug.
For example, in one study published in the July 2004 issue of Toxicological
Sciences, data from microarrays and traditional toxicity tests confirmed
previous results from other labs showing that toxic doses of acetaminophen
deplete adenosine triphosphate (ATP; molecules that store cellular energy)
and damage mitochondria, the organelles that produce ATP. In addition,
microarray data revealed other exposure effects that traditional tests
had missed. For example, liver cells begin to express genes consistent
with cellular energy loss at doses too low to cause the kind of cell
damage that can be detected by histopathology and other traditional methods.
The microarray data also provided information on a possible new signature
of acetaminophen toxicity involving the metallothionein gene and several
others, which may be involved in the liver’s antioxidant defense
system. “We didn’t previously know that those genes were involved
in acetaminophen toxicity, but it fits into the biological story of cell
defense mechanisms,” says Alexandra Heinloth, a research scientist
with the NCT and lead author of the 2004 paper.
A different type of phenotypic anchoring was used in a study of pathways
linked to inflammation. Microarray experiments with mouse strains that
exhibited both high and low levels of response to inhaled lipopolysaccharide
compounds (which trigger immune responses) identified about 500 genes that
were responsive in at least one of the strains. Researchers from Duke University,
The Institute for Genomic Research, and George Washington University, including
John Quackenbush, now a professor in biostatistics and computational biology
at the Dana-Farber Cancer Institute and the Harvard School of Public Health,
used two independent methods to filter the results of these microarray
experiments, to prioritize genes for future study.
In the first method, the team identified 30 genes whose expression levels
best distinguished the low- and high-responding mice. In the second method,
they used quantitative trait locus (QTL) analysis to find regions genetically
linked to the strength of the lipopolysaccharide-induced response. When
the researchers compared their 500-gene list to the QTL regions, they found
a set of 28 that were both differentially expressed and genetically linked
to the observed phenotypes. There was no overlap among the genes identified
by these two methods.
In a report of the study published in the June 2004 issue of Genomics,
the researchers acknowledge that they may miss genes with important roles
with these filtering methods. They argue, however, that their approaches
provided an objective way to obtain a small number of high-priority genes
for future functional studies.
Beyond Microarrays
Although researchers aim to eventually link pathways from expression
to metabolism, genomics research thus far has focused on microarrays because
this technology is more standardized and far more widely available than
methods for analyzing proteins (primarily mass spectrometry) and metabolites
(mass spectrometry and nuclear magnetic resonance). Although array-like
assays for proteins have been developed, some using antibodies as tags,
these technologies are still relatively exploratory and limited in scope,
says Terry Speed, a professor in the Department of Statistics at the University
of California, Berkeley.
But microarray data have serious limitations when it comes to biological
inference. Although they can show associations, such data can rarely indicate
cause and effect. “What people can find across
many arrays are patterns suggesting coregulation,” says Speed. “If
you look across hundreds of arrays and find that expression of two genes
[moves] up and down together, that’s highly suggestive of interactive
behavior. It’s not so much biological modeling as it is finding associations
that are suggestive of biological interactions.” One
of the greatest limitations of microarray data reflects the underlying
biology: the expression of mRNAs doesn’t always translate into proteins
because silencing RNAs and other mechanisms can block the translation process.
One way to bridge the gap between gene expression and protein creation
is to assess the proteins in a cell through proteomics analysis. Another
is to gain a better understanding of the “transcriptome” (also
called the “RNAome”)--the expression of all regulatory elements
operating to regulate the expression, stability, and translation of transcripts
(strands of RNA) in the cell. “To make genotype-phenotype correlations,
you need to have a complete catalogue of transcripts that are expressed
at each locus where such genotype-phenotype correlations are to be made,” says
Thomas Gingeras, vice president of biological science at Affymetrix.
Gingeras and other researchers at Affymetrix and the National Cancer
Institute have studied the RNAome with microarrays containing probes for
the entire nonrepetitive sequences, not just the coding regions, of 10
human chromosomes. Data from these arrays have demonstrated that RNA activity
is extraordinarily varied and complex. Although researchers have been able
to identify the roles of many sequences, such as ribosomal and protein-coding
RNAs, they have had to classify a substantial number of the newly discovered
transcripts as TUFs (transcripts of unknown function).
Sequences for TUFs are equally complicated. “Curiously enough,
many of these transcripts are sitting in the middle of genes, overlapped
on both the sense and antisense strands of coding sequences,” says
Gingeras. The sense, or template, strand of DNA is the one that is copied
or transcripted. The authors speculate that the RNAs correlating to antisense
strands may be cRNA copies, created in somewhat the same way as the cDNA
copies of RNA used in microarrays.
Another challenge in all types of genomics experiments is detecting signals
from molecules expressed at low levels. Quantities of mRNAs and proteins
in a sample can vary by a ratio of 1 million to 1. Some of these low-expression
molecules could be critical triggers to biological cascades, but may be
lost in the signal-to-noise ratio.
Proteomics technologies are making progress in detecting low-expression
molecules through more sophisticated sorting technologies such as SELDI-TOF
(a method that selects only a subset of proteins from a given sample for
analysis). To detect low-expression transcripts and quantify the number
of mRNAs in a given sample, researchers studying gene expression turn to
methods such as RT-PCR (which can involve the use of controls and fluorescent
markers to quantify the amount of a molecule produced during polymerase
chain reaction) and SAGE (which involves marking each transcript with a
unique tag and then linking and sequencing the combined transcripts to
count the number of times each tag occurs).
Analysis Issues
As great as the challenges are in developing technology to detect and
identify transcripts, proteins, and metabolites, the difficulties in analyzing
the resulting data may be even greater. As research teams plan their experiments,
they must chose from a bewildering and ever-changing assortment of statistical
methods for data analysis. Speed says no one protocol will work for all
experiments: “Usually there will be one method that will be preferred
and often several that will be acceptable. All of them have their strengths
and weaknesses.”
Part of the difficulty relates to the current nature of genomics experiments.
Whereas traditional statistics methods are based on the assumption that
a study will have far more samples than data points per sample, genomics
experiments usually involve the inverse situation: a few dozen samples,
with tens of thousands of data points per sample. New statistical methods
are being developed to deal with the peculiarities of genomics data. Simultaneously,
other teams are working to ensure that the data to be analyzed are valid
by addressing issues such as differences in experimental technologies and
laboratory procedures, and revisiting the effects of sample size.
Recent studies have shown that increased standardization of microarray
platforms has greatly reduced the influence of platform type on results.
In a study published in the May 2005 issue of Nature Methods, Quackenbush
and colleagues found that differences in microarray platforms (oligonucleotide
versus spotted cDNA) did indeed affect results. However, these differences
were eclipsed by a very high correlation between the platforms in expression
changes caused by varying exposures to angiotensin II, a potent peptide
that causes blood vessels to constrict. “The question we asked was,
does the biology or the platform dominate?” says Quackenbush. “For
more than ninety percent of genes for which we could make a reasonable
comparison, we found that biology dominated platform.”
The bad news is that variations in protocols (for RNA labeling, hybridization,
and microarray processing), statistical methods for data acquisition and
normalization, and other “lab effects” can still significantly
impact microarray data. This was the finding of two other studies also
published in the May 2005 issue of Nature Methods, one led by Rafael
Irizarry, an associate professor in the Department of Biostatistics at
The Johns Hopkins University, and another by researchers with the NIEHS
Toxicogenomics Research Consortium (TRC). Both studies compared the results
of microarray analysis of identical samples performed at multiple laboratories,
and both found that, with care, results can be comparable across labs.
However, “you have to pay close attention to how you do things. You
have to standardize your protocols from lab to lab,” says Katherine
Kerr, a coauthor of the TRC study and the director of the Bioinformatics
and Biostatistics Facility Core at the University of Washington-NIEHS Center
for Ecogenetics and Environmental Health.
The design for most microarray experiments calls for three to five samples
per treatment condition. However, some statisticians say that sample size
must be increased to generate the statistical power necessary to infer
biological pathways. “Sample sizes in most
of the current systems biology experiments are not adequate to infer the
kinds of complex networks that are the goal of such studies,” says
Gary Churchill, a staff scientist at the nonprofit Jackson Laboratory in
Bar Harbor, Maine.
Churchill is a cofounder of the Collaborative Cross, a project to develop
a panel of 1,000 new and genetically diverse mouse strains. The mice are
descendants of just eight parent strains, minimizing the need for genotyping,
and are being bred for maximum genetic variation, allowing for a plethora
of diverse yet controlled strains. Not all studies will require 1,000 mice,
Churchill says, although some will. The resource is being generated to
cover a wide range of needs.
Interpretation Is Key
Before teams can begin analyzing data, they must be confident that they
are interpreting the raw signals accurately. For microarray data, this
involves summarizing the fluorescence data for each spot--which are generated
by the individual mRNAs linked to the probes--into a single value for each
gene. “The hardest challenge is removing the component of intensity
that is due to background noise,” says Irizarry. Some background
noise--extraneous signals that can be confused with the signals being observed--can
be caused by mismatches in the attachment of mRNA to the chip.
Another potential issue, says Irizarry, is that some of the 25-base pair
probes used on oligonucleotide chips can be “stickier” than
others--that is, more likely to attract mRNA. “If one gene is represented
by a sequence that’s sticky, it will collect more than a probe that
is less sticky,” he says. As a result, results from such a probe
may reflect the chemistry of cDNA-mRNA bonding more than the biology of
the sample.
Once values have been established for each gene in an array, the signals
need to be normalized across the array. Equalizing fluorescence signals
in two-color arrays is the most basic type of normalization. If equal amounts
of two samples were hybridized to an array, then the total fluorescent
signal from each sample should also be equivalent. If one is uniformly
higher, a statistician can adjust the fluorescence values to better represent
the relationship between the samples. Several more-involved processes may
also be used to normalize data.
According to the TRC study, normalization procedures seem to increase
the accuracy of microarray data. But there still remain lots of unanswered
questions about normalization formulas, as well as the algorithms used
to analyze genomics data, says Kerr. “The data look better when we’re
done with [normalization],” she says. “But we don’t know
if we’ve really made a correction.”
After normalization, researchers can take a basic count of changes in
gene expression. Churchill calls this “making lists.” He explains: “You
measure microarray data from normal tissue and from diseased tissue and
. . . generate lists of genes that are up or down. The problem now is how
we take those lists and turn them into biological sense.”
Turning Data into Sense
The next step for many teams is to shorten the list. They may focus on
genes with the strongest or most closely correlated changes in expression.
Often this process is informed by preexisting information about gene function.
But care must be taken. Setting filters that are too tight can cause an
analysis to ignore genes of importance, while setting parameters too broadly
can cause false positives.
That is why different groups can have difficulty replicating results
on microarray experiments, says Greg Carr, a research fellow working in
product safety at Procter and Gamble. If the statistical power-or
probability of detecting genes of significance-is set relatively
low, say 10%, one lab may pick up on some of the low-power genes and a
second lab may pick up on others, but no one lab is likely to detect them
all, he says.
Another approach sometimes used with or instead of data filtering is
to cluster, or group, data according to similarities in expression patterns.
Methods include hierarchical or “Eisen” clustering, which produces
a tree-like structure; -means clustering,
which produces line graphs; and principal components analysis, which produces
a three-dimensional array that can be rotated. Clustering methods are usually
exploratory and, Speed says, don’t provide an answer to a well-defined
question; they can only show associations among genes. However, they can
provide effective ways to organize data. Scientists then have to infer
cause and effect.
Various types of modeling algorithms can aid in this inference process.
One of the simplest models, Boolean networks, can capture multivariate
gene relationships that can be inferred from measurement data. Ilya Shmulevich,
an associate professor at the Institute for Systems Biology in Seattle,
has worked to increase the flexibility of Boolean network modeling through
the development of probabilistic Boolean networks. These networks allow
for multiple functional possibilities for each gene, mimicking underlying
biological and measurement uncertainty, says Shmulevich. He and his colleagues
have applied probabilistic Boolean network analysis to gene expression
data from studies of melanomas and gliomas.
When using most modeling methods, “you have to put in a rather
limited set of genes and then learn something about that set,” says
Speed. The art of assigning genes to modeling programs and deciding how
to filter the results of other genomics data draws heavily on preexisting
knowledge about the systems in general. Researchers comb through the scientific
literature or databases of genetics, proteomics, and metabolomics data.
However, database mining can only take researchers so far. Only 60-65%
of human genes have been adequately annotated as to function, says Raymond
Tennant, director of the NCT. This makes it difficult to infer the function
of genes that haven’t yet been annotated.
The Chemical Effects in Biological Systems Knowledgebase, set to become
publicly available in late 2005, will provide access to microarray data
for about 140 reference compounds, and comprehensive data sets on about
10 hepatotoxicants, including acetaminophen. “The database will also
include reference information on the biological effects of chemicals and
other agents, and pathways related to their mechanism of action,” says
Michael Waters, NCT assistant director for database development.
Speaking the Language of Inference
Beyond having data in hand, biological inference requires a wide range
of skills and expertise. “There’s a need for transdisciplinary
efforts--notice I said transdisciplinary rather than interdisciplinary,” says
Kenneth Ramos, chair of the Department of Biochemistry and Molecular Biology
at the University of Louisville and EHP’s toxicogenomics editor. “We
need people who speak more than one language, a new type of scientist.
We’ve become so specialized that it is difficult to cross disciplines
with fluidity. I think that research in biological inference will demand
that ability. Classically trained biologists will need to reengage their
appreciation and understanding of mathematics in order to begin to tackle
some of these questions.”
Biologists may also need to have a better understanding of, and possibly
training in, statistics. “Taking Stats 101 isn’t necessarily
going to imbue the concepts you need,” says Churchill. “Statistics
isn’t about the formulas, how to crunch numbers. It’s about
the concepts. It’s about how to quantify uncertainty, about how to
take data and turn it into knowledge.”
“It would be beautiful if you had that [statistical] training when
you start in the field,” says Heinloth. “But if you include
enough statisticians and bioinformaticists on your team as equal members,
you can have that covered.” In Heinloth’s research group, statisticians
are involved in experiments from the beginning. “There’s hardly
any decision in a study that’s made by only one person,” she
says. However, she notes that this collaborative process is not “science
by committee.” As experiments are designed and implemented, team
members weigh in only in their areas of expertise.
As researchers delve into this enormous quantity of data, they are confronted
with the limits of human cognitive ability. It is not possible for a single
individual to fully comprehend the astonishing complexity of metabolism
in even a single cell type. So as researchers develop new technologies
and new statistical tools for genomics research, and for inferring the
ramifications of the data they uncover, they also need to find new ways
to approach the limits of their own understanding.
“We need to simplify to understand complexity, to start off with
the exploration and understanding of simple systems. If you try to look
at the complexity first off, you’ll never really unravel it,” says
Ramos. There’s a lot of potential for error in starting with
simple systems, he adds. But it’s a way to start, to gradually build
up to more complex models.
“As we accumulate more data, we’re understanding how limited
our understanding is and how much more there is to discover,” says
Churchill. “That can be discouraging from a diagnostic sense, but
it’s wonderful to have a new universe opening up in front of you.”
Kris Freeman