An Introduction to NCBI's Genome Resource

An Introduction to
NCBI's Genome Resource

PubMed

Entrez

BLAST

OMIM

Taxonomy

Structure

NCBI

SITE MAP

Press Release

A Story of Discovery

More Information on Assembly and Annotation

Resource Links

Human Genome Page

Building a Genomic Information Infrastructure at NCBI

A major challenge of the Human Genome Project is to organize, analyze, and interpret the flood of data emerging from sequencing projects worldwide. NCBI's Web site strives to offer an integrated, one-stop, genomic information resource for data that promise to provide new insights into human biology and new approaches for combating disease.

The Human Genome Project

The Human Genome Project is a publicly financed international research effort whose goal is to decipher the human genetic code and to provide these data freely and rapidly to the public. On June 26, 2000, members of the Human Genome Project announced that they had succeeded in sequencing a "working draft" of the human genome. An article published in the February 15, 2001 issue of the journal Nature outlines the strategies and methodologies used by this group to generate the draft sequence. Sequencing of the human genome represents a scientific milestone, and the data are of immediate use in many important ways. To further understand and use the information coded for in this "human blueprint", the National Center for Biotechnology Information (NCBI) provides access to this data worldwide through its public Web site (http://www.ncbi.nlm.nih.gov).

The Big Picture:
Integrating Vast Quantities of Disparate Data

Sequencing of the human genome signifies the beginning of an exciting new era of science. As an international leader in the field of computational biology and bioinformatics, the NCBI is playing an active and collaborative role in further deciphering the human genome. NCBI investigators have designed and developed, as well as manage and operate, a number of unique and powerful public databases essential to the Project. For example, GenBank is the NIH sequence database maintained by the NCBI that stores the sequence data generated by the centers involved in the Human Genome Project. GenBank is one of three databases that makes up the International Nucleotide Sequence Database collaboration. NCBI's partners in this effort include the European Bioinformatics Institute in the United Kingdom and the National Institute of Genetics in Japan. All three institutions work together to make the sequence data generated by the Human Genome Project rapidly and freely accessible to scientific communities worldwide.

NCBI investigators are also developing and enhancing software tools that will enable gene discovery. These tools, also freely accessible to the public, are being used by NCBI to assemble, annotate, and analyze the human genomic sequence, as well as the genomic sequences of other model organisms. These sophisticated tools allow researchers to store, organize, analyze, and integrate vast quantities of diverse data, such as DNA and protein sequences, gene and chromosome maps, and protein structures. Information derived from these studies has allowed researchers to make new connections between seemingly disparate data and to shape more biologically meaningful views of these data.

Assembling the Human Genome

Anyone with a computer and an Internet connection can now explore the draft sequence of the human genome. A companion site has been designed to jump-start an individual who wants to make use of this information but is not sure where or how to start.

NCBI released its first assembled view of the human genomic sequence. This assembly is based not only on the finished and draft sequences deposited by the Human Genome Sequencing Centers in GenBank but also on sequences contributed to GenBank by individual scientists from around the world. Hence, this resource is truly an "international public sequencing effort". Assembling the sequences is an ongoing process that involves many different steps before the data may be merged into segments of contiguous DNA. NCBI continues to improve the genome assembly by incorporating new data, filling in existing gaps, and increasing overall accuracy.

Annotating the Human Genome

A team of NCBI scientists is also engaged in the process of annotating, or labeling, the biologically important areas of the genome. Annotation permits researchers to analyze the data in a systematic, comprehensive, and consistent manner. There are two tasks involved in annotation. The first is the correct placement of known genes into the proper genomic context, and the second is the prediction of previously unknown genes based on the assembled genomic sequence.

Aligning Known Genes

In the first task, messenger RNAs (mRNAs) from the NCBI RefSeq collection–a non-redundant set of reference sequences, including genomic contigs, mRNAs of known genes, and proteins–are placed on the genome primarily by sequence alignment using tools developed at NCBI. Computer modeling is used to compensate for and overcome various problems associated with aligning the genomic and mRNA sequences.

With Map Viewer, one may visualize genes and genomic markers within the context of additional data.

The human genome is also being annotated with additional biological features. Examples include markers for sequence variation such as SNPs, or single nucleotide polymorphisms, and genomic position landmarks such as sequence tagged sites (STSs). These features may be viewed using the NCBI Map Viewer, an online tool that allows you to view an organism's complete genome, as well as integrated maps for each chromosome and sequence data for a region of particular interest.

Predicting Novel Genes

The whole genomes of over 800 organisms can now be found on NCBI's Entrez Genomes Web site, representing both completely sequenced organisms and organisms for which sequencing is in progress.

Various computational approaches are also being used by NCBI investigators to accomplish the second task, predicting novel genes. Alignment with small snippets of expressed genes called Expressed Sequence Tags (ESTs) identifies new genes to be placed on the DNA sequence and also provides information on alternative gene splicing. Use of protein similarity analyses and gene prediction programs developed at NCBI identifies additional predicted genes.

Comparative genomics, or the study of similar genes in different species, is another powerful tool for predicting and identifying new information. The genomic sequence of the mouse will be particularly helpful in this regard, because mammals share many basic biological functions. Gene sequences in the mouse and human often code for similar proteins that carry out comparable biological functions. Comparing the genomic sequences from other model organisms, such as those from the rat, zebrafish, fruit fly, and yeast, will also facilitate gene annotation.

Guides to Inherited Diseases

The Online Mendelian Inheritance in Man database, or OMIM, is a catalog of inherited human disorders and their causal mutations, authored and edited by Dr. Victor A. McKusick and developed for the Web by NCBI. OMIM entries are often linked to a reference mRNA sequence from RefSeq, facilitating the alignment of a mRNA to a gene sequence on the working draft.

From OMIM, one can link to NCBI's Genes and Disease Web page, a site designed to introduce users to the relationship between genetic factors and human disease. Genes and Disease provides information for greater than 70 genetic diseases, with links to related databases and allied resources.

Literature Databases

To validate the findings generated through computer-based comparative analysis, it is essential to consider the results of wet-bench biology reported in the scientific literature. Therefore, the integration of scientific data with the literature is a necessary step for creating a unified information resource in the life sciences. To this end, individuals are provided with a direct link from numerous NCBI resources to PubMed, NCBI's literature retrieval system. PubMed provides Web-based access to over 11 million citations, abstracts, and indexing terms for journal articles in the biomedical sciences. It also includes links to full-text journals.

PubMed Central (PMC), a digital archive of life sciences journal literature, was launched in January 2001 and offers a new model for electronic scientific communication and data retrieval. The value of PubMed Central, in addition to its role as an archive, lies in what can be done when data from diverse sources is stored in a common format in a single repository. PMC currently provides free and unrestricted access to the full text of life sciences journals.

Model Organisms for Biomedical Research

The public mouse sequencing effort is also proceeding rapidly. The desire to accumulate mouse genome sequences builds on the completion in June 2000 of the working draft version of the human sequence. The ultimate goals of the project include the construction of a physical map and a high quality, finished sequence of the mouse, as these data will provide an essential tool to identify and study the function of human genes. The mouse genome sequence will also increase the ability of scientists to use the mouse as a model system to study and understand human disease.

All sequence data generated from this project are rapidly deposited in GenBank. Data is available from NCBI's Trace Archive database. The mouse reads are being compared to the human genome, and homologous reads have been laid out along the human draft sequence. This has resulted in the creation of the Human-Mouse Homology Maps, a table comparing genes in homologous segments of DNA from human and mouse, sorted by position in each genome. Mouse data are also being accumulated in the RefSeq database, and investigators have begun to assemble the dataset to generate larger contigs. The mouse reads are of immediate use for both human and mouse genetics, and there are already examples of mouse genes that have been cloned using the available public information.

The mapping and sequencing of the genomes of all model organisms are critical to the effort to characterize, sequence, and interpret the human genome. Therefore, NCBI is also working toward the development and expansion of resources to facilitate biomedical research using other model organisms, including the rat, S. cerevisiae (baker's yeast), C. elegans (nematode), D. melanogaster (fruit fly), and Arabidopsis thaliana (thale cress).

Building an Information Infrastructure

The genomic information resources developed and disseminated by NCBI investigators have contributed significantly to the advancement of the basic sciences and serve as a wellspring of new methods and approaches for applied research activities. The value of these integrated resources will continue to grow, because NCBI has made a long-term commitment to meet the challenge of designing, developing, disseminating, and managing the tools and technologies enabling the gene discoveries that will significantly impact health in the 21st century.

Potential Applications and the Future of Medicine

SNPs are ideal elements for constructing a genomic map to aid in analyzing the human genome, especially because they have a significant influence on disease processes.

Analysis of the draft human genomic sequence has already led to the identification of genes for cystic fibrosis, breast cancer, hereditary deafness, hereditary skeletal disorders, and a form of diabetes, just to name a few. The draft sequence has also been used to identify an enormous number of SNPs, or single base variations in the genetic code that play a significant role in the disease process.

These discoveries, as well as future discoveries, will have a profound impact on the future conduct of biomedical research. The translation of basic science advances into the clinical arena promises to revolutionize the practice of medicine. In the coming years, clinicians will be able to help their patients in ways they never thought possible. Physicians will be able to rapidly diagnose existing genetic diseases; pre-determine genetic risk for developing a disease; design novel therapeutic agents for the treatment and prevention of disease, rather than the treatment of the underlying symptoms; and prescribe a medical intervention based on a person’s genetic information, reducing the chance of an allergic, or otherwise detrimental, drug reaction.

Our Interactive Web Site

NCBI's Human Genome site pulls together a suite of its key resources available for human genome research. Through this interactive Web site, researchers may:

Access the draft human genomic DNA sequences generated by the Sequencing Centers involved in the Project
View and explore NCBI's assembled and annotated version of the human genome, either chromosome by chromosome or by searching for biologically important regions of the genomic sequence
Apply one of NCBI's sophisticated software tools to further analyze a portion of the genomic sequence that may be of particular interest

Revised April 9, 2003.