Toxicogenomics: An EHP Section

The launching of this new section of EHP marks a critical stage in the evolution of the field of toxicogenomics.

Toxicogenomics is one of the newest fields of study that will play a major role in future research breakthroughs in environmental health. In January, when we expanded Environmental Health Perspectives's (EHP) coverage of this new field by initiating a new quarterly section of the journal, the change was welcomed enthusiastically by the EHP readership. However, to clarify that articles in the Toxicogenomics Section are part of EHP and not a separate publication, we have instituted several changes: All articles will now carry an EHP citation, and each year all sections will have the same volume number with pages numbered consecutively. Articles will be abstracted immediately by abstracting services and will enjoy the same high impact factor as all other EHP articles. So that all articles in the Toxicogenomics Section have the same citation, we are re-publishing articles from the premier issue. Note that the Digital Object Identifier (DOI) code portion of the original citation remains unchanged.

We will consider articles for publication in the Toxicogenomics Section of EHP from the related disciplines of pharmacogenomics, proteomics, metabonomics, bioinformatics, molecular epidemiology, translational aspects of genomic research, and molecular medicine. The section has full color capabilities and features online publication of extensive data sets and supplementary materials. As with other EHP articles, accepted research articles will be published within 24 hours. These articles are completely citable using the DOI code that is managed by CrossRef, a licenseee of the International DOI Foundation. The EHP Toxicogenomics-in-Press articles can be found on our website (http://ehp.niehs.nih.gov/txg/).

Please join us as we explore the interactions between genes and the environment and the complexity of the biological circuitry involved in the cellular response to stressful environments. We will nurture the field by maintaining the highest standards of excellence. The launching of this new section of EHP marks a critical stage in the evolution of the field of toxicogenomics.

Kenneth S. Ramos Toxicogenomics Editor, EHP University of Louisville Health Sciences Center Louisville, Kentucky E-mail: ksramo01@gwise.louisville.edu	Thomas J. Goehl Editor-in-Chief, EHP Research Triangle Park, North Carolina goehl@niehs.nih.gov

Pluralitas non est ponenda sine necessitate.

William of Ockham (ca. 1280-1349)

Model Selection in Genomics

With the discovery of DNA, the completion of genome sequencing of a number of organisms, and the advent of powerful high-throughput measurement technologies such as microarrays, it is now commonly said that biology has gone through a revolution. But I also have heard it said that biology is only about to go through a scientific revolution, much as physics did in the 17th century. In messianic hopes, people foretell the coming of the Newton of biology, but it is up to us, the scientific community, to set the stage for that to happen.

Both views are valid, each in their own sense. The discovery of DNA and the more recent development of powerful new technologies have certainly revolutionized our understanding of the inner workings of life and allowed us to probe deep into the machinery of living organisms, much as the Copernican system and Galileo's telescope helped revolutionize astronomy. It was Sir Isaac Newton, however, who placed science on a solid footing by formalizing existing knowledge in terms of mathematical models and universal laws. In some sense, this was the real scientific revolution because it permitted prediction of physical phenomena in a general setting, as opposed to simply describing individual observations. The difference is profound. Whereas a mathematical equation can adequately describe a given set of observations, it may be missing the needed universality for making predictions. Kepler's equations pertained to planets in our solar system. Newton's laws could be used to predict what would happen to two arbitrary bodies anywhere in the universe. The universality of a scientific theory coupled with mathematical modeling allows us to make testable predictions. This ability will have a profound effect on the field of biology.

The hallmarks of a great scientific theory are universality and simplicity. Newton's law of gravity is a case in point. The fact that the force of attraction between any two bodies is proportional to the product of their masses and inversely proportional to the square of the distance between them is both universal and simple. These issues are especially important today in the rapidly evolving field of genomics, where formal mathematical and computational methods are becoming indispensable. So what should be our guiding principles, our beacons of scientific inquiry? One such fundamental principle underpinning all scientific investigation is Ockham's razor, also called the "law of parsimony."

Consider the following, seemingly straightforward problem. We are presented with a set of data, represented as pairs of numbers (x,y). In each pair, the first number (x) is an independent variable and the second number (y) is a dependent variable. The problem is to choose whether to fit a line (of the form y = a + bx ) or a parabolic function (of the form y = a + bx + cx²). The knee-jerk response might be as follows: Let's fit the parabolic function, since the linear function is clearly a special case of it, just by letting c = 0; thus, the parabola will always provide a better fit to our data set. After all, if it so happens that our data points are arranged on a line, the estimation of parameters (a, b, and c) will simply reveal that c is indeed equal to zero and the parabolic function will reduce to a linear one. Thus, it would seem, three "adjustable" parameters are better than two. Of course, such reasoning could be taken ad absurdum if we had freedom to choose as many parameters as we like. Thus, there must be a tradeoff. Although three parameters surely provide a better fit to the data, the model becomes more complex and so, we sacrifice simplicity. But why is that bad?

To give a general answer, by making a model overly complex, we forfeit predictive accuracy. A complex model may be able to describe the observed data very well, but will it accurately predict future instances? For example, if the data contain random fluctuations or noise, an excessively complex model will "overfit" the data along with the noise and will obviously provide a poor fit to future (unseen) data. The chief goal of model selection is to find the right balance between simplicity and goodness-of-fit.

Consider gene expression-based cancer classification. The basic idea is simple: Take a number of tumor samples of a known type, measure expressions of thousands of genes for each one, and on the basis of these observations, construct a classifier (model) that will predict the tumor type when presented with an unknown sample. A fundamental question is "What type of classifier should we choose?" This is a crucial step in model selection (in machine learning, the model is called the "hypothesis space"). The next step--actually selecting a particular classifier from the model class (i.e., selecting a particular hypothesis)--is fairly well understood, as it involves the estimation of parameters.

As discussed, it would be unwise to devise an overly complex classifier, consisting of hundreds or thousands of parameters, especially in light of rather small sample sizes (number of tumors) available, which is typically below 100. Such a classifier may have extremely small or even no error on the seen data but may exhibit very high error on unseen data. Hence, its predictive accuracy would be very poor.

So, suitable criteria or methods are needed that would help us strike the right balance between simplicity and goodness-of-fit, such that predictive accuracy can be maximized. Fortunately, recent statistical literature is replete with various approaches, such as the Bayesian information criterion, Akaike's information criterion, minimal description length principle, and cross-validation methods.

In the field of toxicogenomics, issues related to prediction and model selection are of vital importance. For example, toxicogenomic biomarkers should reliably predict toxic effects to help us develop safer drugs and chemicals and understand molecular mechanisms of pathogenesis. Models of genetic networks and gene expression-based classifiers are expected to predict consistently a cell's response to a stressful challenge and to classify unknown compounds. A keen awareness of Ockham's razor will help guide us on our quest to understand the nature of living systems and their behavior under various environmental conditions.

Ilya Shmulevich
Cancer Genomics Laboratory
The University of Texas M. D. Anderson Cancer Center
Houston, Texas, USA
E-mail: is@ieee.org

Ilya Shmulevich is an assistant professor at the Cancer Genomics Laboratory at The University of Texas M. D. Anderson Cancer Center. He is an associate editor of the Toxicogenomics Section of Environmental Health Perspectives. His research interests include computational genomics, systems biology, nonlinear signal and image processing, and computational learning theory.

. . . just as genetic toxicology co-evolved with the fields of genetics and molecular biology,
so will toxicogenomics co-evolve with the fields of genomics and systems biology . . .

On the 50th Anniversary of Solving the Structure of DNA

As biochemistry students at Aberdeen University in Scotland, our class studied and strategized together to prepare for our final honors degree exams, and in the British tradition, the results of those final exams would, alone, determine our final grade after four years of undergraduate study. During that final academic year (1973-1974), the 20th anniversary of the famous Watson and Crick publication (Watson and Crick 1953) was being loudly celebrated in the scientific literature. Our class predicted that questions about DNA structure and function would be heavily represented, if not overrepresented, in the final exams. We were right. Thirty years later it is an unexpected pleasure to be invited to join the chorus, indeed the symphony, celebrating the golden anniversary of the DNA double helix and the sequencing of a complete human genome and to reflect upon how deciphering the structure of DNA was fundamental to the fields of mutagenesis and genetic toxicology and more recently to the emerging field of toxicogenomics.

I have studied various aspects of mutagenesis and genetic toxicology for nearly three decades, and upon looking back at the history of genetics and molecular biology (wherein Watson and Crick obviously played a pivotal role), it becomes immediately apparent that with each insight into the structure and function of DNA came an accompanying insight into how DNA structure and function can go awry. While Watson and Crick's discovery of the complementary nature of the bases inside the DNA double helix immediately suggested a mechanism by which DNA could replicate, it did not suggest how this molecule ultimately dictates the nature of all proteins present in the cell (Watson and Crick 1953). Indeed, even with an immediate insight into how DNA might replicate, it was 5 years (1958) until the beautiful Meselson and Stahl experiment (Meselson and Stahl 1958) demonstrated semiconservative DNA replication, as predicted by Watson and Crick. It was to take 13 years (1966) before the genetic code was finally cracked, and during those 13 years there emerged a reasonably complete picture of how DNA, mRNA, tRNA, and ribosomes collaborate to produce proteins of genetically predetermined sequence.

After the Watson and Crick paper in 1953, along with every experiment that produced an ever more detailed molecular picture of how DNA replicates and of how DNA makes RNA makes proteins, there came immediate insights into how each of these processes can go wrong. For example, until we understood the workings of triplet codons and the genetic code, we could not understand (at the molecular level) how changes in the DNA sequence might ultimately produce missense, nonsense, frameshift, and other mutations. A detailed understanding of DNA chemistry also led to an exploration of how chemical and physical agents could alter that chemistry. From this followed the concept that damage to DNA might lead to permanent sequence changes and thus to different kinds of mutation. This is not to say that damage to cells had not already been shown to cause mutations. Indeed, Muller demonstrated in 1927 that X-rays could induce heritable mutations in Drosophila melanogaster, and for this he won the 1946 Nobel Prize in Physiology or Medicine (Muller 1927). But this discovery was 25 years before Hershey and Chase (1952) finally convinced the scientific world that genes reside in DNA, and 26 years before the structure of DNA was solved (Watson and Crick 1953). Thus, although the fields of mutagenesis and genetic toxicology have a history long before the structure of DNA was discovered, it was only since 1953 that a molecular picture co uld be drawn of how toxic agents might interact with DNA to produce the biological end points of mutation and cytotoxicity. Moreover, the 1953 publication of Watson and Crick launched exquisitely detailed characterization of how DNA is faithfully replicated, and from this came an understanding of the role that DNA polymerases and such processes as recombination must play in the generation of DNA sequence changes. Parallel to these fundamental revelations were the observations that all organisms are equipped with a battery of genes that produce proteins whose primary roles are to prevent or repair chemical and physical damage to DNA; such activities protect against mutation and cell death induced by DNA-damaging agents, and studies of these activities eventually evolved into the field of genetic toxicology.

Genetic toxicology has been approached in two ways: a) with questions specifically aimed at understanding the molecular processes that influence the induction of DNA damage, and the toxic effects of such DNA damage; and b) with more general questions about the genes that influence the susceptibility of cells to toxic agents. The difference between these two approaches lies in the fact that the first is concerned only with toxicity resulting from genetic damage, and the second is concerned with genes that influence the toxicity of an agent, whether or not that toxicity emanates from damaged DNA. Both of these approaches to genetic toxicology are now evolving into the field of toxicogenomics.

With the dawning of the new millennium came one of the finest achievements in the history of biological research, namely, the sequencing of a complete human genome. Surely this was one of the most profound achievements to flow from the 1953 discovery of the structure of DNA. The working draft of this roughly 3.2 billion base pair sequence, the technological advances that were developed because of it, and the rapid electronic publication of the sequence as it was generated changed forever the ways in which biological and health-related research is being conducted. It is now possible, in principle, to address questions about all human genes in a massively parallel way, that is, questions related to the entire human genome, hence the term "genomics." The National Institute of Environmental Health Sciences (NIEHS) was very quick to realize the awesome potential of being able to interrogate the role of each and every gene in protecting humans against the detrimental health effects of exposure to environmental agents. The prescience of the NIEHS led to the launch of two major extramural research initiatives that have fostered the application of genomics to the environmental health sciences, namely, the Environmental Genome Project and the Toxicogenomics Research Consortium.

Several years ago the NIEHS established the Environmental Genome Project (http://www.niehs.nih.gov/envgenom/home.htm) to identify all common DNA variants, mainly single nucleotide polymorphisms (SNPs) for more than 500 human genes known (or likely) to influence cellular responses to toxic environmental agents. In the long term we will have an inventory of common SNPs for every gene in the human genome, but in the short term the Environmental Genome Project will provide us with focused information for genes already known to influence the biological consequences of exposure to toxic environmental agents. It is not difficult to imagine that it will soon be possible to screen individuals to determine their constellation of SNPs in these 500 or so genes deemed relevant to environmentally induced disease. This foray into genomic scale analysis will provide an important first step toward our being able to predict the response of an individual upon exposure to toxic environmental agents. However, it is quite clear that being able to identify the gene variants present in an organism is simply not enough. Genomic analyses must stretch far beyond the DNA to include RNA and protein; after all, DNA makes RNA makes protein. It is clear that we need to know the temporal aspects of how the environmentally relevant genes are expressed (in each cell type), as well as how their expressed products (RNA and protein) are modified and localized in the cell. We also must be able to predict how such expression, modification, and localization will change over time when individuals are exposed to environmental agents. Finally, armed with all this knowledge we must learn how to integrate the information into a systems biology view that not only is descriptive but also is predictive of the phenotype of cells, tissues, and ultimately people. We have not yet grasped how to do this, but we will have achieved one of the most exciting and powerful insights into biology when we find the ways.

The field of toxicogenomics has thus emerged to address these genomic-scale questions; moreover, the National Center for Toxicogenomics at the NIEHS recently established the Toxicogenomics Research Consortium (http://www.niehs.nih.gov/nct/trc.htm) to help launch and foster the development of the field. At the very least, transcriptional profiling using DNA microarrays and proteomic analysis using mass spectrometry represent the current major thrusts in toxicogenomics. In addition, the development of genomic approaches to systematically assess how each gene influences the phenotypic response of cells to environmental agents is well under way for model organisms such as Saccharomyces cerevisiae, and such "genomic phenotyping" is now being initiated for mammalian cells. It seems likely that within the next few years, libraries of small inhibitory RNAi constructs will be available for the systematic knock down of expression for each and every human gene in each of many different human cell types. It is inevitable that the fields of genomics and systems biology will mature as more efficient and sophisticated technologies emerge for quantitatively measuring global gene expression, global RNA and protein modification, and the dynamic trafficking and localization of cellular molecules. And just as genetic toxicology co-evolved with the fields of genetics and molecular biology, so will toxicogenomics co-evolve with the fields of genomics and systems biology.

The future test of toxicogenomics will be in our ability to predict accurately human susceptibility to the adverse effects of environmental agents. Perhaps, long before the golden anniversary of sequencing the human genome, it will be possible to determine individualized risk to environmental agents as part of a routine annual checkup. But before a time-line for this can even be envisioned, we must first learn to apply quantitative molecular assessments, engineering principles, and the informatics tools necessary to conduct successful predictive toxicology in model cellular systems.

Leona D. Samson
Biological Engineering Division and Center for Environmental Health Sciences
Massachusetts Institute of Technology
Cambridge, Massachusetts, USA
E-mail: lsamson@mit.edu

Leona Samson is professor of biological engineering and toxicology at the Massachusetts Institute of Technology (MIT), director of the MIT Center for Environmental Health Sciences, and a member of the Executive Steering Committee for a new Initiative at MIT in Computational and Systems Biology (CSBi). She is also an associate editor for the Toxicogenomics Section of Environmental Health Perspectives.