Basic
Information
FAQs
Glossary
Acronyms
Links
Genetics 101
Publications
Meetings Calendar
Media Guide
About
the Project
What is it?
Goals
Landmark Papers
Sequence Databases
Timeline
History
Ethical Issues
Benefits
Genetics 101
FAQs
Medicine
&
the New Genetics
Home
Gene
Testing
Gene
Therapy
Pharmacogenomics
Disease Information
Genetic Counseling
Ethical,
Legal, Social Issues
Home
Privacy
Legislation
Gene
Testing
Gene
Therapy
Patenting
Forensics
Genetically
Modified Food
Behavioral
Genetics
Minorities,
Race, Genetics
Human Migration
Education
Teachers
Students
Careers
Webcasts
Images
Videos
Chromosome
Poster
Presentations
Genetics 101
Genética
Websites en Español
Research
Home
Sequence Databases
Landmark Papers
Insights
Publications
Chromosome Poster
Primer Molecular Genetics
List of All Publications
Search This Site
Contact Us
Privacy Statement
Site Stats and Credits
Site Map
|
Quick Links for this page are as follows:
From the Genome to the Proteome
Cells
are the fundamental working units of every living system. All the
instructions needed to direct their activities are contained within
the chemical DNA (deoxyribonucleic acid).
DNA
from all organisms is made up of the same chemical and physical
components. The DNA sequence is the particular side-by-side arrangement
of bases along the DNA strand (e.g., ATTCCGGA). This order spells
out the exact instructions required to create a particular organism
with its own unique traits.
The genome
is an organism’s complete set of DNA. Genomes vary widely in size:
the smallest known genome for a free-living organism (a bacterium)
contains about 600,000 DNA base pairs, while human and mouse genomes
have some 3 billion. Except for mature red blood cells, all human
cells contain a complete genome.
DNA in the human genome
is arranged into 24 distinct chromosomes--physically
separate molecules that range in length from about 50 million to
250 million base pairs. A few types of major chromosomal abnormalities,
including missing or extra copies or gross breaks and rejoinings
(translocations), can be detected by microscopic examination. Most
changes in DNA, however, are more subtle and require a closer analysis
of the DNA molecule to find perhaps single-base differences.
Each chromosome contains
many genes, the basic physical
and functional units of heredity. Genes are specific sequences of
bases that encode instructions on how to make proteins. Genes comprise
only about 2% of the human genome; the remainder consists of noncoding
regions, whose functions may include providing chromosomal structural
integrity and regulating where, when, and in what quantity proteins
are made. The human genome is estimated to contain 20,000-25,000
genes.
Although genes get a
lot of attention, it’s the proteins
that perform most life functions and even make up the majority of
cellular structures. Proteins are large, complex molecules made
up of smaller subunits called amino acids. Chemical properties that
distinguish the 20 different amino acids cause the protein chains
to fold up into specific three-dimensional structures that define
their particular functions in the cell.
The constellation of
all proteins in a cell is called its proteome.
Unlike the relatively unchanging genome, the dynamic proteome changes
from minute to minute in response to tens of thousands of intra-
and extracellular environmental signals. A protein’s chemistry and
behavior are specified by the gene sequence and by the number and
identities of other proteins made in the same cell at the same time
and with which it associates and reacts. Studies to explore protein
structure and activities, known as proteomics, will be the focus
of much research for decades to come and will help elucidate the
molecular basis of health and disease.
How is genome sequencing done?
Download a PDF illustration
courtesy of the Department of Energy's Joint
Genome Institute. See also their step-by-step
illustrated guide to how sequencing is done.
- Chromosomes, which range in size from 50 million to 250 million bases,
must first be broken into much shorter pieces (subcloning step).
- Each short piece is used as a template to generate a set of fragments
that differ in length from each other by a single base that will be
identified in a later step (template preparation and sequencing
reaction steps).
See a
figure depicting the sequencing reaction.
- The fragments in a set are separated by gel electrophoresis (separation
step).
New fluorescent dyes allow separation of all four fragments in a
single lane on the gel.
See an example
of an electropherogram using fluorescent dyes. Click on the image
for a caption.
- The final base at the end of each fragment is identified (base-calling
step). This process recreates the original sequence of As, Ts, Cs,
and Gs for each short piece generated in the first step.
Current electrophoresis limits are about 500 to 700 bases sequenced
per read. Automated sequencers analyze the resulting electropherograms
and the output is a four-color chromatogram showing peaks that represent
each of the four DNA bases.
After the bases are "read," computers are used to assemble the short
sequences (in blocks of about 500 bases each, called the read length)
into long continuous stretches that are analyzed for errors, gene-coding
regions, and other characteristics.
To read about all the trouble researchers go through to "finish"
this raw sequence from automated sequencers Click here
(and scroll to bottom that begins "Here are our definitions of...").
Finished sequence is submitted to major public sequence databases,
such as GenBank. Human Genome Project sequence
data are thus made freely available to anyone around the world.
For more on genome sequencing, see the Sequencing
Fact Sheet.
What We've Learned So Far
What Does the Draft Human Genome Sequence Tell Us?
By the Numbers
- The human genome contains 3164.7 million chemical nucleotide bases
(A, C, T, and G).
- The average gene consists of 3000 bases, but sizes vary greatly,
with the largest known human gene being dystrophin at 2.4 million bases.
- The total number of genes is estimated at 30,000 —much lower
than previous estimates of 80,000 to 140,000 that had been based on
extrapolations from gene-rich areas as opposed to a composite of gene-rich
and gene-poor areas.
- Almost all (99.9%) nucleotide bases are exactly the same in all people.
- The functions are unknown for over 50% of discovered genes.
The Wheat from the Chaff
- Less than 2% of the genome codes for proteins.
- Repeated sequences that do not code for proteins ("junk DNA") make
up at least 50% of the human genome.
- Repetitive sequences are thought to have no direct functions, but
they shed light on chromosome structure and dynamics. Over time, these
repeats reshape the genome by rearranging it, creating entirely new
genes, and modifying and reshuffling existing genes.
- During the past 50 million years, a dramatic decrease seems to have
occurred in the rate of accumulation of repeats in the human genome.
How It's Arranged
- The human genome's gene-dense "urban centers" are predominantly composed
of the DNA building blocks G and C.
- In contrast, the gene-poor "deserts" are rich in the DNA building
blocks A and T. GC- and AT-rich regions usually can be seen through
a microscope as light and dark bands on chromosomes.
- Genes appear to be concentrated in random areas along the genome,
with vast expanses of noncoding DNA between.
- Stretches of up to 30,000 C and G bases repeating over and over often
occur adjacent to gene-rich areas, forming a barrier between the genes
and the "junk DNA." These CpG islands are believed to help regulate
gene activity.
- Chromosome 1 has the most genes (2968), and the Y chromosome has
the fewest (231).
How the Human Compares with Other Organisms
- Unlike the human's seemingly random distribution of gene-rich areas,
many other organisms' genomes are more uniform, with genes evenly spaced
throughout.
- Humans have on average three times as many kinds of proteins as the
fly or worm because of mRNA transcript "alternative splicing" and chemical
modifications to the proteins. This process can yield different protein
products from the same gene.
- Humans share most of the same protein families with worms, flies,
and plants, but the number of gene family members has expanded in humans,
especially in proteins involved in development and immunity.
- The human genome has a much greater portion (50%) of repeat sequences
than the mustard weed (11%), the worm (7%), and the fly (3%).
- Although humans appear to have stopped accumulating repeated DNA
over 50 million years ago, there seems to be no such decline in rodents.
This may account for some of the fundamental differences between hominids
and rodents, although gene estimates are similar in these species. Scientists
have proposed many theories to explain evolutionary contrasts between
humans and other organisms, including those of life span, litter sizes,
inbreeding, and genetic drift.
Variations and Mutations
- Scientists have identified about 1.4 million locations where single-base
DNA differences (SNPs) occur in humans. This information promises to
revolutionize the processes of finding chromosomal locations for disease-associated
sequences and tracing human history.
- The ratio of germline (sperm or egg cell) mutations is 2:1 in males
vs females. Researchers point to several reasons for the higher mutation
rate in the male germline, including the greater number of cell divisions
required for sperm formation than for eggs.
Applications, Future Challenges
Deriving meaningful knowledge from the DNA sequence will define research
through the coming decades to inform our understanding of biological systems.
This enormous task will require the expertise and creativity of tens of
thousands of scientists from varied disciplines in both the public and private
sectors worldwide.
The draft sequence already is having an impact on finding genes associated
with disease. A number of genes have been pinpointed and associated
with breast
cancer, muscle disease, deafness, and blindness. Additionally, finding
the DNA sequences underlying such common diseases as cardiovascular
disease,
diabetes, arthritis, and cancers is being aided by the human variation
maps (SNPs) generated in the HGP in cooperation with the private sector.
These genes and SNPs provide focused targets for the development of effective
new therapies.
One of the greatest impacts of having the sequence may well be in enabling
an entirely new approach to biological research. In the past, researchers
studied one or a few genes at a time. With whole-genome sequences and
new high-throughput technologies, they can approach questions systematically
and on a grand scale. They can study all the genes in a genome, for example,
or all the transcripts in a particular tissue or organ or tumor, or how
tens of thousands of genes and proteins work together in interconnected
networks to orchestrate the chemistry of life.
The Next Step: Functional Genomics
The words of Winston Churchill, spoken in 1942 after 3 years of war,
capture well the HGP era: "Now this is not the end. It is not even the
beginning of the end. But it is, perhaps, the end of the beginning."
The avalanche of genome data grows daily. The new challenge will be to
use this vast reservoir of data to explore how DNA and proteins work with
each other and the environment to create complex, dynamic living systems.
Systematic studies of function on a grand scale-functional genomics-will
be the focus of biological explorations in this century and beyond. These
explorations will encompass studies in transcriptomics, proteomics, structural
genomics, new experimental methodologies, and comparative genomics.
- Transcriptomics involves large-scale analysis of messenger
RNAs transcribed from active genes to follow when, where, and under
what conditions genes are expressed.
- Studying protein expression and function--or proteomics--can
bring researchers closer to what's actually happening in the cell than
gene-expression studies. This capability has applications to drug design.
- Structural genomics initiatives are being launched worldwide
to generate the 3-D structures of one or more proteins from each protein
family, thus offering clues to function and biological targets for drug
design.
- Experimental methods for understanding the function of DNA sequences
and the proteins they encode include knockout studies to inactivate
genes in living organisms and monitor any changes that could reveal
their functions.
- Comparative genomics—analyzing DNA sequence patterns
of humans and well-studied model organisms side-by-side—has become one
of the
most powerful strategies for identifying human genes and interpreting
their function.
Send the url of this page to a friend
|