Los Alamos National Laboratory
Lab Home  |  Phone
 
 

Science 1663

The Legacy of GenBank

The DNA Sequence Database That Set a Precedent

This year the life sciences research community celebrated the 25th anniversary of GenBank, the computerized database originally founded at Los Alamos to contain the information encoded in the genes of all life on Earth. Here, GenBank’s legacy is discussed by early members of the GenBank team: Gerald Myers, eventual founder of the genetic database for the AIDS virus, a GenBank offshoot; Christian Burks, now president of the Ontario Genomics Institute; and Chang-Shung Tung, current leader of the Los Alamos Theoretical Biology and Biophysics group.

Walter Goad, founder of GenBank, studies a DNA sequence.

1663: Los Alamos is primarily a physics lab, so how did GenBank, come to be established here?

Tung: In the 1960s leading Los Alamos theoreticians, including the mathematician Stan Ulam and physicists Jim Tuck, George Bell, and Walter Goad, became fascinated by the revolution in biology—the ability to manipulate DNA and to understand how it controls an organism's development and replication. Goad, who later founded GenBank, liked to point out that biology was unlike anything known in physics because a single molecular change in DNA, a mutation, could be faithfully cloned millions of times in an organism, and then one could actually examine the mutation's consequences with the tools of physics and chemistry. These scientists met weekly for over a decade, and when DNA-manipulation tools made it possible to determine the sequence of building blocks in a DNA molecule, they became interested in using mathematical analysis to study the patterns of information contained in those sequences.

1663: So how do you decipher that information?

Tung: The information is determined by the order in which the basic building blocks—the four nucleotide bases known as adenine (A), thymine (T), cytosine (C), and guanine (G)—are strung along a strand of DNA. Based on mathematics and the rule of parsimony, theoretical physicist George Gamow proposed that every three consecutive bases in a protein-coding gene was a three-letter word specifying one of the 20 possible amino acids that make up a protein. Gamow's basic concept was correct, but scientists took 10 years to crack the genetic code—the code that tells you which triplet of bases (called a codon) stands for which amino acid. It was done through test tube experiments using synthetic pieces of DNA.

Myers: The first really interesting published sequence was not for DNA but for an RNA molecule known as transfer RNA (tRNA), which carries a single amino acid and takes part in protein synthesis. It took a year to determine the exact sequence of the tRNA's 75 bases, but the result led to an understanding of the role tRNA plays in the creation of proteins. The completed sequence revealed an exposed loop of three bases identical to a codon of a protein-coding gene. It was then clear that tRNA was the adaptor molecule that Francis Crick, a decade earlier, had predicted must exist to serve as a chemical bridge between a codon in a gene and the corresponding amino acid. Discovering that triplet of bases on the tRNA molecule, the carrier of amino acids, showed how the genetic code was implemented in the cell.

Graph illustrating nucleotide bases in GenBank.
The growth of sequence data in GenBank.

Burks: That first sequence was published in March 1965, and it took almost a year to crank it out. A decade or so later, sequencing really took off when Fred Sanger in England and Allan Maxam and Walter Gilbert at Harvard published much more rapid sequencing methods for DNA. Academic groups began producing sequences hundreds and thousands of bases long, and computers became essential for sequence storage and analysis.

Tung: People from throughout biology immediately saw how DNA sequences could be used to pinpoint and track genetic changes. Data began to accumulate very rapidly. In 1979, a meeting was organized at Rockefeller University to discuss how these sequence data could be collected and managed for public dissemination. Mike Waterman and Temple Smith, who attended that meeting, reported on it and convinced several Los Alamos people, including Walter Goad, to think about developing a data bank for DNA sequence information.

Burks: By then, George Bell and Walter Goad had started the Theoretical Division's new Theoretical Biology and Biophysics group. They, along with Charles DeLisi, who later helped start the Human Genome Project, were dedicated to bringing the mathematical and computational prowess of theoretical physics to bear on molecular biology. Walter wrote the proposal for "The Los Alamos Sequence Library," which got funding from Laboratory-Directed Research and Development, the discretionary research program at Los Alamos.

Francis Crick, James Watson, Maurice Wilkins, and Rosalind Franklin’s double-helix DNA structure.

Other groups around the country were also interested in starting a database, but Los Alamos moved quickly and in 1982, in partnership with BBN Laboratories, a Cambridge, Massachusetts, engineering company, won a competitive bid for a public sequence database to be sponsored by the National Institutes of Health (NIH). That's when the name changed to GenBank.

1663: Was there something special about the Los Alamos proposal?

Burks: On one hand, some thought the NIH should never trust such a database to "those weapons mongers" in Los Alamos. On the other hand, Los Alamos had people with very bright, very agile minds who didn't care about disciplinary boundaries and were quick to get things going. None of those people came from molecular biology and DNA sequencing, but they all came with an incredible endowment of intelligence and experience. In addition, Goad proposed a remarkably community-focused strategy for providing access to the sequence data. His model was wide open: talk to anyone, take suggestions from anyone, share data with anyone. This strategy is still reflected in GenBank's ongoing productive collaborations with two other such databases—the EMBL Data Library in Europe and the DDBJ in Japan.

1663: Were Los Alamos computers a factor in the NIH choice?

Burks: Goad certainly pitched our computational resources, but building the database was primarily a word-processing task, so over the first two years, the project migrated off the big mainframe computers and onto a single personal computer in Walter's office. The project later became one of the first at Los Alamos to adopt Sun Microsystems workstations and the Unix operating system. We were probably the first group worldwide to adopt Sybase's relational database management system for molecular biological data.

Myers: In those early days, all the sequences appeared in published papers first, so GenBank hired typists to enter them into the database manually. When a new sequence of special interest to one of us was published, we'd elbow a typist to put it at the front of the line because we were so eager to get it into GenBank and start the analysis.

Photograph of Christian Burks.
Christian Burks, president of the Ontario Genomics Institute

Burks: Very soon after GenBank began, personal computers and portable electronic media were taking hold, and it became possible for authors to submit their sequence data electronically, eliminating retyping. We knew that using electronic media would be essential to GenBank's keeping up with the ever-increasing rate of new published sequences, so we lobbied the NIH for increased resources to build up the infrastructure. GenBank's second 5-year contract quintupled the project budget. This allowed us to implement the electronic publishing paradigm, including developing the computing infrastructure to monitor the flow of electronic data.

In addition, we lobbied the journals to require authors to electronically submit their data directly to Los Alamos before publication. It was a radical proposal that stirred up a heated debate about everything from the autonomy of journalists to the civil liberties of scientists. But the interest in making the data quickly available carried the day, and in a couple of years, most journals went from saying "never in our lifetime" to making electronic submission a requirement. That set a precedent for the Human Genome Project as it was getting off the ground in the late 1980s.

Myers: Data accuracy was a key issue, however, and people began to worry about totally fallacious sequences getting into GenBank. Once in there, they'd be very, very difficult to get out. We actually had such a case. A viral sequence that turned out to be from a monkey was thought to be a new form of the human AIDS virus. It stayed that way in GenBank's human category for a decade before being corrected. Taking in the influx of new sequence data was like drinking from a fire hose all the time; once stuff got into GenBank, we had no time to go back and review it. Also, sequence analysis was not part of the charter for GenBank, which was focused first on being a complete, current archive. That fact ultimately led to specialized databases that could curate the data in a more leisurely manner.

1663: GenBank started a specialized database for the AIDS virus. Did that include sequence analysis?

Gerald Myers, founder of the HIV database

Myers: Yes. In 1986, soon after an isolate of the AIDS virus was first sequenced, GenBank was funded by NIH to start a combined sequence database and analysis center for HIV. We thought the project would run about a year, but the virus mutated very rapidly within a single individual, so we soon learned we could expect a flood of widely varying viral sequences from around the world. NIH tripled our funding, and the project is still ongoing.

1663: What have you learned?

Myers: Our initial focus was on molecular epidemiology, tracking AIDS outbreaks through the sequenced viruses rather than through people. We helped assess the virus's average rate of mutation. We got involved in the France-United States dispute about who discovered the AIDS virus, and we helped the Centers for Disease Control track unexpected transmission pathways, like a dentist's transmission of HIV to his patients. The viral genes were rapidly identified from the sequences, the biology of the virus became apparent very quickly, and sequence analysis helped with the development of a drug cocktail. But the virus mutates so rapidly that
vaccine development has been all but impossible.

Tung: There might be new hope on the horizon. Bette Korber, a scientist in the Theoretical Biology group and the present leader of the HIV database, together with her team, is using the entire set of sequence information to derive three new vaccines: the "consensus," the "best natural," and the "mosaics." All three were developed to target viral strains across the globe. Extensive animal tests are underway, and the results look quite promising. Small-scale human trials are in the initial stages of planning. These vaccines might finally deal a lethal blow to the AIDS virus. Bette won the 2004 E. O. Lawrence Award in life sciences for her work on this front.

1663: Are there other exciting developments on the horizon?

Chang-Shung Tung, leader of the Theoretical Biology and Biophysics group at Los Alamos

Tung: AIDS researchers are planning to use new machines that in one run—only hours—can sequence the DNA from 100,000 different viral particles found in a single human being. The result will reveal the diversity of the virus within an individual. Continued sampling and sequencing can then be used to track how that diversity changes under medical treatments.

Burks: A related development is metagenomics, an approach started by Craig Venter to sequence, en masse, the DNA from a broad spectrum of organisms in an environmental sample—say, a liter of seawater or a few tablespoons of soil—and use computational analysis to tie sequences to separate genes and species. It's nearly impossible to isolate and cultivate the species individually before sequencing. Information from such work can eventually be applied to developing new industrial enzymes or harnessing bacteria for environmental cleanup.

The Ontario Genomics Institute is funding the development of DNA barcodes, a short stretch of DNA that is 700 nucleotides long and that is found in a mitochondrial gene in every animal on Earth. By sequencing 10 examples of that stretch from each of a half-million species, we can build up a database of barcodes that would enable nearly automatic species identification for most animals. This will have a tremendous impact on regulatory or forensic proceedings in which exact knowledge of the species involved can make a difference in the outcome.

Myers: If we consider that the human body contains about 10 bacterial cells for every "human" cell and that the entire human genome contains an overwhelming number of sequences from viruses, we start to see a human being as a community of microbes. DNA barcoding might begin to show human diversity not just at the genomic level but also at the level of the microbes that a body contains. Both our metabolism and our mental state may depend on the organisms we're carrying around in addition to the genes we've inherited.
Burks: We're also finding surprising variations in the structure of the human genome. For example, the number of copies of a given gene can vary widely from individual to individual. If those regions code for a particular protein, they may lead to different amounts of that protein in different individuals, and the "extra" copies could be a signal that the protein is evolving new functional capabilities in that individual.

Myers: Those protein differences could have an impact on human behavior. There are multiple copies of regulatory elements, the promoter genes that are involved with controlling serotonin production. As that copy number increases, the chances for depression increase. That's an excellent example of how multiple copies of regulatory elements affect particular traits.

1663: How fast have sequence data been accumulating?

Burks: The sequence database has been doubling every 18 months since 1979. I remember the official memorandum announcing that tea and cookies would be served to celebrate entry of the first 100,000 nucleotides into the Los Alamos Sequence Library. We hit a million bases in March 1982 and had a wild celebration. Now the number is about 200 billion bases.

1663: Are databases prepared to handle and support analysis of metagenomic data?

Burks: GenBank, which has been run by the NIH since 1992, now has a section called Whole Genome Sampling that is specifically designed to archive the information from metagenomic sequencing. With metagenomics data, you don't necessarily know right away what organism the individual sequences came from or even what gene an individual sequence is associated with. It's a challenge to organize and annotate metagenomic versus traditional sequence data.

1663: Any last words?

Burks: In looking back I would say that Los Alamos has had a tremendous impact on the world through GenBank. Los Alamos got this endeavor off the ground through scientific freedom within the Lab, interdisciplinary freedom, and a strong, competitive student and postdoctoral program that attracted bright new minds to Los Alamos to explore new frontiers. It's been incredibly enabling for the whole world.

Tung: Los Alamos is holding a colloquium on August 5 to honor this proud legacy, and Mike Waterman, Bette Korber, and Gerry will be speaking.

—Necia Grant Cooper and Eileen Patterson

Key words- insert text

Dialogue logo

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA

Inside | © Copyright 2008-09 Los Alamos National Security, LLC All rights reserved | Disclaimer/Privacy | Web Contact