SITE MAP
Press Release
A Story of Discovery
More Information on Assembly and Annotation
Resource Links
Human Genome Page
| |
|
|
Building a Genomic Information Infrastructure at NCBI
A major challenge of the Human Genome Project is to organize, analyze,
and interpret the flood of data emerging from sequencing projects
worldwide. NCBI's Web site strives to offer an integrated, one-stop,
genomic information resource for data that promise to provide new
insights into human biology and new approaches for combating disease.
|
|
|
The Human Genome Project
The Human Genome Project is a publicly financed international
research effort whose goal is to decipher the human genetic code
and to provide these data freely and rapidly to the public. On June
26, 2000, members of the Human Genome Project announced that they
had succeeded in sequencing a "working draft" of the human
genome. An article published in the February 15, 2001 issue of the
journal Nature outlines the strategies and methodologies used
by this group to generate the draft sequence. Sequencing of the
human genome represents a scientific milestone, and the data
are of immediate use in many important ways. To further
understand and use the information coded for in this "human
blueprint", the National Center for Biotechnology Information
(NCBI) provides access to this data worldwide through its public
Web site (http://www.ncbi.nlm.nih.gov).
|
|
|
The Big Picture:
Integrating Vast Quantities of Disparate Data
Sequencing of the human genome signifies the beginning of an exciting
new era of science. As an international leader in the field of computational
biology and bioinformatics, the NCBI is playing an active
and collaborative role in further deciphering the human genome.
NCBI investigators have designed and developed, as well as manage
and operate, a number of unique and powerful public databases essential
to the Project. For example, GenBank
is the NIH sequence database maintained by the NCBI that stores
the sequence data generated by the centers involved in the Human
Genome Project. GenBank is one of three databases that makes
up the International Nucleotide Sequence Database collaboration.
NCBI's partners in this effort include the European Bioinformatics
Institute in the United Kingdom and the National Institute of Genetics
in Japan. All three institutions work together to make the sequence
data generated by the Human Genome Project rapidly and freely accessible
to scientific communities worldwide.
NCBI investigators are also developing and enhancing software tools
that will enable gene discovery. These tools, also freely accessible
to the public, are being used by NCBI to assemble, annotate, and
analyze the human genomic sequence, as well as the genomic sequences
of other model organisms. These sophisticated tools allow researchers
to store, organize, analyze, and integrate
vast quantities of diverse data, such as DNA and protein sequences,
gene and chromosome maps, and protein structures. Information derived
from these studies has allowed researchers to make new connections
between seemingly disparate data and to shape more biologically
meaningful views of these data.
|
|
|
Assembling the Human Genome
Anyone
with a computer and an Internet connection can now explore
the draft sequence of the human
genome. A companion
site has been designed to jump-start an individual
who wants to make use of this information but is not
sure where or how to start. |
|
NCBI released its first assembled view of the human genomic
sequence. This assembly is based not only on the finished
and draft sequences deposited by the Human Genome Sequencing
Centers in GenBank but also on sequences contributed to GenBank
by individual scientists from around the world. Hence, this resource
is truly an "international public sequencing effort". Assembling
the sequences is an ongoing process that involves many different
steps before the data may be merged into segments of contiguous
DNA. NCBI continues to improve the genome assembly by incorporating
new data, filling in existing gaps, and increasing overall accuracy.
|
|
|
Annotating the Human Genome
A team of NCBI scientists is also engaged in the process of annotating,
or labeling, the biologically important areas of the genome. Annotation
permits researchers to analyze the data in a systematic, comprehensive,
and consistent manner. There are two tasks involved in annotation.
The first is the correct placement of known genes into the
proper genomic context, and the second is the prediction of previously
unknown genes based on the assembled genomic sequence.
|
|
|
Aligning Known Genes
In the first task, messenger RNAs (mRNAs) from the NCBI RefSeq
collection–a non-redundant set of reference sequences, including
genomic contigs, mRNAs of known genes, and proteins–are placed on
the genome primarily by sequence alignment using tools developed
at NCBI. Computer modeling is used to compensate for and
overcome various problems associated with aligning the genomic and
mRNA sequences.
|
|
With Map
Viewer, one may visualize genes and genomic markers within
the context of additional data. |
|
|
The human genome is also being annotated with additional biological
features. Examples include markers for sequence variation such as
SNPs, or single
nucleotide polymorphisms, and genomic position landmarks such
as sequence
tagged sites (STSs). These features may be viewed using the
NCBI Map
Viewer, an online tool that allows you to view an organism's
complete genome, as well as integrated maps for each chromosome
and sequence data for a region of particular interest.
|
|
|
Predicting Novel Genes
The
whole genomes of over 800 organisms can now be found on
NCBI's Entrez Genomes Web site, representing both completely
sequenced organisms and organisms for which sequencing
is in progress. |
|
Various computational approaches are also being used by NCBI investigators
to accomplish the second task, predicting novel genes. Alignment
with small snippets of expressed genes called Expressed Sequence
Tags (ESTs) identifies new genes to be placed on the DNA sequence
and also provides information on alternative gene splicing. Use
of protein similarity analyses and gene prediction programs developed
at NCBI identifies additional predicted genes.
Comparative genomics, or the study of similar genes in different
species, is another powerful tool for predicting and identifying
new information. The genomic sequence of the mouse will be
particularly helpful in this regard, because mammals share many basic
biological functions. Gene sequences in the mouse and human often
code for similar proteins that carry out comparable biological functions.
Comparing the genomic sequences from other model organisms, such
as those from the rat, zebrafish, fruit fly, and yeast, will also
facilitate gene annotation.
|
|
|
Guides to Inherited Diseases
The
Online Mendelian Inheritance in Man database, or OMIM, is a catalog
of inherited human disorders and their causal mutations, authored and
edited by Dr. Victor A. McKusick and developed for the Web by NCBI.
OMIM entries are often linked to a reference mRNA sequence from RefSeq,
facilitating the alignment of a mRNA to a gene sequence on the working
draft.
From OMIM, one can link to NCBI's Genes
and Disease Web page, a site designed to introduce users to
the relationship between genetic factors and human disease. Genes
and Disease provides information for greater than 70 genetic diseases,
with links to related databases and allied resources. |
|
Literature Databases
To validate the findings generated through computer-based comparative
analysis, it is essential to consider the results of wet-bench biology
reported in the scientific literature. Therefore, the integration
of scientific data with the literature is a necessary step for creating
a unified information resource in the life sciences. To this end,
individuals are provided with a direct link from numerous NCBI resources
to PubMed,
NCBI's literature retrieval system. PubMed provides Web-based access
to over 11 million citations, abstracts, and indexing terms for journal
articles in the biomedical sciences. It also includes links to full-text journals.
PubMed Central
(PMC), a digital archive of life sciences journal literature, was
launched in January 2001 and offers a new model for electronic scientific
communication and data retrieval. The value of PubMed Central, in
addition to its role as an archive, lies in what can be done when
data from diverse sources is stored in a common format in a single
repository. PMC currently provides free and unrestricted access
to the full text of life sciences journals.
|
|
|
Model Organisms for Biomedical Research
The public mouse sequencing effort is also proceeding rapidly.
The desire to accumulate mouse genome sequences builds on the completion
in June 2000 of the working draft version of the human sequence.
The ultimate goals of the project include the construction of a
physical map and a high quality, finished sequence of the mouse,
as these data will provide an essential tool to identify and study
the function of human genes. The mouse genome sequence will also
increase the ability of scientists to use the mouse as a model system
to study and understand human disease.
All sequence data generated from this project are rapidly deposited
in GenBank. Data is available from NCBI's
Trace Archive database. The mouse reads are being
compared to the human genome, and homologous reads have been laid
out along the human draft sequence. This has resulted in the creation of the
Human-Mouse Homology Maps,
a table comparing genes in homologous segments of DNA from human and mouse,
sorted by position in each genome. Mouse data are also being accumulated
in the RefSeq
database, and investigators have begun to assemble the dataset to
generate larger contigs. The mouse reads are of immediate
use for both human and mouse genetics, and there are already examples
of mouse genes that have been cloned using the available public
information.
The mapping and sequencing of the genomes of all model organisms
are critical to the effort to characterize, sequence, and interpret
the human genome. Therefore, NCBI is also working toward the development
and expansion of resources to facilitate biomedical research using
other model organisms,
including the rat,
S. cerevisiae
(baker's yeast),
C. elegans (nematode),
D. melanogaster (fruit fly), and
Arabidopsis thaliana (thale cress).
|
|
|
Building an Information Infrastructure
The genomic information resources developed and disseminated
by NCBI investigators have contributed significantly to the advancement
of the basic sciences and serve as a wellspring of new methods and
approaches for applied research activities. The value of these integrated
resources will continue to grow, because NCBI has made a long-term commitment
to meet the challenge of designing, developing, disseminating, and
managing the tools and technologies enabling the gene discoveries
that will significantly impact health in the 21st century. |
|
|
Potential Applications and the Future of Medicine
SNPs
are ideal elements for constructing a genomic map to aid
in analyzing the human genome, especially because they have
a significant influence on disease processes. |
|
Analysis of the draft human genomic sequence has already led to
the identification of genes for cystic fibrosis, breast cancer,
hereditary deafness, hereditary skeletal disorders, and a form of
diabetes, just to name a few. The draft sequence has also been used
to identify an enormous number of SNPs, or single base variations in the
genetic code that play a significant role in the disease process.
These discoveries, as well as future discoveries, will have a profound
impact on the future conduct of biomedical research. The translation
of basic science advances into the clinical arena promises to revolutionize
the practice of medicine. In the coming years, clinicians will be
able to help their patients in ways they never thought possible.
Physicians will be able to rapidly diagnose existing genetic diseases;
pre-determine genetic risk for developing a disease; design novel
therapeutic agents for the treatment and prevention of disease,
rather than the treatment of the underlying symptoms; and prescribe
a medical intervention based on a persons genetic information,
reducing the chance of an allergic, or otherwise detrimental, drug
reaction.
|
|
|
Our Interactive Web Site
NCBI's Human Genome site pulls together a suite of its key resources
available for human genome research. Through this interactive Web
site, researchers may:
|
|
Revised April 9, 2003.
|
|