Learning to Read the Book of Life

Though scientists have decoded almost the entire human genome — not to mention the genomes of myriad other species — there’s a long way to go to make sense of all the data. Which genes, for example, make brain cells different from blood cells? Are there quick genetic tests that can readily distinguish relatively harmless bacteria from close relatives that could be used as terror weapons? How can what we know about genes help us understand what goes awry in conditions like cancer?

The tools of molecular biology have helped scientists translate the language of life encoded in DNA.

A new technique developed and patented at Brookhaven National Laboratory with funding from the Office of Biological and Environmental Research within the U.S. Department of Energy’s Office of Science holds enormous promise for answering such questions. Called “genome signature tagging,” the technique relies on the tools of molecular biology — gene chopping and splicing — to break the problem of identifying genes into manageable pieces, and powerful computer algorithms to speed comparisons.

“We are just beginning to decipher the meaning of the series of nucleotide bases that make up the source code for running the machinery of cells,” says Brookhaven biologist John Dunn. “It’s as if we have in our hands a giant book of life, but we are barely beginning to learn how to read it. Our technique gives us a new way to index the code.”

Of course, scientists have known since the mid 1950s that sequences of the nucleotide bases adenine, thymine, guanine, and cytosine (known by the code letters A, T, G, and C) direct living things to make proteins. In the 1960s, they worked out the specific sets of three sequential letters that code for each amino acid — the protein “building blocks” used by the cells’ construction machinery.

But finding quick ways to sequence the code and tell which genes are at work within different types of cells, how turning genes on or off can trigger cancer, or where the differences are between closely related species has been a big challenge.

Breaking down the problem

The problem has been one of sheer magnitude: The human genome, for example, consists of some three billion nucleotides and many tens of thousands of genes. Sequencing entire genomes to find the subtle differences between, say, someone with cancer and someone without, while highly specific and informative, would be extremely labor-intensive and costly.

Even using genome sequencing to identify bacterial species, which have much smaller genomes, is a daunting task when you consider that thousands of species reside in a mere teaspoonful of soil.

So scientists have been searching for ways to identify key segments of genetic code that are short enough to sequence rapidly while yielding enough information about their research goals.