Gene Guide

	Human Genome Project Information Genomics:GTL DOE Microbial Genomics home

The U.S. Department of Energy Biological and Environmental Research program funds this site.

Gene and Protein Database Guide
Resources for learning about genes and the proteins they encode

Descriptions of Resources

Molecular Biology Basics

Learning about Genes and Their Products

Nucleotide Sequence Databases

Protein Sequence Databases

Sequence Similarity Searching

Gene-Mutation Resources

Protein Structure Resources

Helpful Terms, Tutorials, and Examples

Glossary of Bioinformatics Terms - A quick guide to some common terms used in bioinformatics resources

Introductory Bioinformatics Tutorials - Step-by-step instructions for first-time users

Gene Profiles - Case studies as examples of the kinds of information you can find using these resources

Molecular Biology Basics

In this guide we have provided descriptions and links to bioinformatics resources that can help you learn more about genes and proteins associated with genetic disorders or traits. Most were designed for students and professionals in the life sciences, so a certain level of familiarity with genetics and molecular biology is assumed. For those of you looking for an introduction to the science behind the Human Genome Project, we have included links to basic genetics and molecular biology resources that can be reviewed before you attempt to use the bioinformatics resources.

Genomics and Its Impact on Science and Society: The Human Genome Project and Beyond - Department of Energy (DOE) publication. Defines basic genetic terms and overviews human genome mapping and sequencing, model organism research, informatics, and the impact of the Human Genome Project.

The Science Behind the Human Genome Project - Web pages that define some basic genetics concepts and explain how the Human Genome Project was implemented.

Genome Glossary - DOE Human Genome Program glossary of genetics terms. Can be searched or browsed alphabetically; links to other life science glossaries.

To Know Ourselves: DOE publication that explains the agency's involvement with the Human Genome Project. Includes sections on human genome physical mapping and sequencing; technological developments in laboratory instrumentation; management of genomic data; and the project's ethical, legal and social implications.

Return to Top

Molecular Biology Basics Gene Resources Nucleotide Sequences Protein Sequences
Sequence Similarity Searching Gene-Mutations Protein Structures

Learning about Genes and Their Products

These resources give a general overview of a gene, along with some of the following information: official symbol, locus, associated disorders or traits, mode of inheritance, name and function of gene product, and links to additional gene-specific resources.

Online Mendelian Inheritance in Man (OMIM)

Overview: Created and edited by Victor A. McKusick, MD and his colleagues at Johns Hopkins School of Medicine, OMIM is a large, searchable, up-to-date database of human genes, genetic traits, and disorders. In addition to summarizing what is known about a particular gene, trait, or disorder, each record also contains reference material and links to other NCBI resources such as literature citations in MEDLINE and related sequence records. OMIM is intended for use by genetics researchers, advanced life-science students, and healthcare professionals concerned with genetic disorders. The database is updated daily. Three different interfaces are provided for exploring particular genes and genetic conditions: Search, Gene Map, and Morbid Map. An article on OMIM is available from the NCBI Bookshelf.

Search Tips: Browse an alphabetical listing of genetic disorders featured in OMIM with Morbid Map. View a listing of genes organized by cytogenetic location using OMIM's Gene Map. Search for information about specific genes, traits, or disorders using OMIM's Search options. For step-by-step instructions describing how to search OMIM, see our OMIM Search Tutorial. Additional information about searching OMIM is available from the Help and FAQs pages.

Information Provided: Some of the types of information provided in OMIM records include: genes that have been linked to disorders, the official symbol for a gene, key mutations in genes that result in disease, functions of genes and the proteins they encode, as well as descriptions of genetic conditions and how they are inherited. Links to citations in Medline, related OMIM entries and entries in other NCBI databases also are included. The amount of information included in each entry depends upon how much researchers know about a particular gene or condition.

NCBI LocusLink

Overview: LocusLink is an NCBI database that serves as a single query interface to gene-specific information from a wide variety of bioinformatics sources. LocusLink includes descriptive information about genetic loci in human, cow, fruit fly, mouse, rat, nematode, zebrafish, and human immunodeficiency virus type 1 genomes.

Search Tips: Users can query LocusLink by typing keywords (such as disease or protein name, gene symbol, accession numbers, or other database ID numbers) into the search box at the top of the main page. Query options include truncation of terms using the asterisk (*) as a wild card, field restriction, and use of Boolean operators (and, or, not) that are not case sensitive. Grouping phrases by parentheses or quotation marks is not supported.

LocusLink has a system of controlled terms that can be used to retrieve only those records with a particular feature. One of the controlled terms is disease_known, which will return only loci associated with a known disorder. See the Query Tips in the Help file for a complete listing and more detailed descriptions of controlled terms. For more on searching LocusLink, see the Help file. LocusLink also provides a FAQ page.

Information Provided: Each LocusLink report may include the following types of information: links to gene-specific entries in other databases, official gene nomenclature, LocusID (identification number assigned to the gene by LocusLink), overview of protein function, alternate symbols and aliases, phenotypes or expressed characteristics associated with the gene, other database ID numbers, similar genes from other genomes, links to cytogenetic maps, links to sequence records, and links to other related information sources.

GeneCards

Overview: Developed at the Weizmann Institute of Science in Israel, GeneCards is a database of human genes, their products, and their involvement in hereditary disorders. GeneCards automatically extracts gene-specific information from a variety of Web-based bioinformatics resources and integrates the data into each entry. The database was designed for scientists who want to use one interface to access multiple databases for information about human genes that have been assigned approved symbols.

Search Tips: Users can search GeneCards by keyword or gene symbol/alias using the search box on the home page. Keywords can be single or multiple terms, GenBank accession number, chromosome number, or gene locus. Truncation using an asterisk (*) as a wild card at the beginning or end of a search term is supported. The Boolean operators AND and OR can be used to connect terms. The Boolean operator NOT is not supported. Examples are provided for keyword searching. Users may also browse a complete listing of genes or a subset of disease genes featured in GeneCards. To learn more about searching GeneCards, check out Quick Start, Guided Tour and The GeneCards Guidance System.

Information Provided: Each GeneCard may include the following information: official gene name and symbol, synonyms or alternative names, ID numbers assigned to the gene in other databases, chromosomal location, chromosome map showing where the gene is found, domains and protein families associated with the gene's protein product, links to sequence records, expression patterns in human tissues, links to similar genes in other organisms, SNPs and variants, disorders and mutations, links to citations in Medline, and links to other related resources. Each GeneCard also links to sources used to create the entry. GeneCards encourages feedback from its users and provides a form for submitting comments and suggestions.

Return to Top

Molecular Biology Basics Gene Resources Nucleotide Sequences Protein Sequences
Sequence Similarity Searching Gene-Mutations Protein Structures

Nucleotide Sequence Databases

International Nucleotide Sequence Database Collaboration

This collaboration is a coordinated effort among three key sequence repository centers: GenBank at the National Center for Biotechnology (NCBI), the European Molecular Biology Laboratory (EMBL), and the DNA DataBank of Japan (DDBJ). Sequence data is exchanged daily among these three organizations. Although record formats and search systems may differ, information contained in each record (accession number, sequence data, annotations) will be the same for all three databases.

NCBI GenBank and Entrez Nucleotide

Overview: GenBank is an NCBI database that serves as an archive for all publicly available DNA sequences from more than 100,000 different organisms. Submitting scientists retain complete editorial control over their sequences, so they decide on gene symbols (which may not be the official ones) and what additional information to include. Scientists contact NCBI if they wish to make any modifications to their sequence records. As an archival database, GenBank can include redundant entries, even hundreds of records for the same gene, and some entries may contain errors in their sequence data. To address some problems associated with this archival database, NCBI developed the nonredundant RefSeq. RefSeq is a curated, nonredundant source of sequence data for genomic DNA, mRNA transcripts, and proteins of major research organisms. Unlike GenBank records, RefSeq records are created, reviewed, and updated by NCBI staff. Each RefSeq entry features a distinct accession number (two characters followed by an underscore in which the first two characters describe the sequence type). For more information about RefSeq, see RefSeq FAQs.

Search Tips: There are a few different ways for accessing sequence records at NCBI: text-searching with Entrez Nucleotide, BLAST searching, or linking to sequence records from databases and tools such as LocusLink, OMIM, or Map Viewer. Entrez Nucleotide is a part of NCBI's Entrez search and retrieval system that can be used to search several linked databases, such as sequence databases, structure databases, OMIM, genome assemblies, and biomedical literature. With all Entrez databases, users can refine search strategies using fields available in Limits and Preview/Index, browse Index terms of a particular field, combine searches using History, and store selected records from different searches on a Clipboard. Some search-refining techniques available from the Limits page are to exclude certain types of sequences (e.g., ESTs) and limit the search by date or particular database (e.g., search only RefSeq). Boolean Operators AND, OR, and NOT must be in upper case. Phrase searching using double quotes and truncation using the asterisk (*) as a wild card also are supported. For more information about searching this and other NCBI Entrez databases, see Entrez Help Document. For step-by-step instructions on finding and interpreting sequence records, see our tutorial Accessing records in NCBI sequence databases.

Information Provided: Each record returned in a search will include the nucleotide sequence and annotations such as accession numbers, keywords, source organism, and citations for references. Sequence records also may contain the translated amino acid sequence. For more detailed descriptions of types of information in each sequence record, check the Sample GenBank Record provided by NCBI.

Return to Top

Molecular Biology Basics Gene Resources Nucleotide Sequences Protein Sequences
Sequence Similarity Searching Gene-Mutations Protein Structures

Protein Sequence Databases

Entrez Proteins

Overview: Part of the National Center for Biotechnology Information (NCBI) Entrez system, this database includes sequence data compiled from a variety of sources, including Swiss-Prot, Protein Information Resource (PIR), Protein Data Bank (PDB), and Protein Resource Foundation (PRF) in Japan. Some protein sequences were created from translations of coding regions in DNA sequences stored in GenBank and RefSeq.
Search Tips: As with other Entrez databases, users can refine search strategies using fields available in Limits, preview the number of search results for a query, browse Index terms of a particular field, combine searches using History, and store selected records from different searches on Clipboard. Some of the indexed fields that can be used to narrow a search include accession number, gene name, molecular weight, organism, properties, protein name, and sequence length. Users also can specify that only one particular database be searched (e.g., retrieve protein sequences from Swiss-Prot only). Boolean Operators AND, OR, and NOT must be in upper case. Phrase searching using double quotes and truncation using the asterisk (*) as a wild card also are supported. For more information about searching this and other NCBI Entrez databases, see the Entrez Help Document. For step-by-step instructions on finding and interpreting sequence records, see our tutorial on accessing sequence records.

Information Provided: Search results displayed using the default view will include locus name (a unique name assigned to each record), sequence length, protein description (definition), accession number, database source, keywords, organism, citations to references, comments concerning protein function or associated traits or disorders, information about sequence regions of biological significance, and the amino acid sequence. For detailed descriptions about fields presented in each NCBI sequence record, see the GenBank sample record.

Swiss-Prot/TrEMBL

Overview: The protein sequence databases Swiss-Prot and TrEMBL were developed by groups at the Swiss Institute of Bioinformatics (SIB) and the European Bioinformatics Institute (EBI). Swiss-Prot uses three key criteria: high level of annotation, minimal redundancy, and high level of integration with other databases. Swiss-Prot includes as much information as possible in its annotations, and external experts review current literature and provide comments and updates on different protein groups. Swiss-Prot's depth of annotation, however, requires considerable time and effort. To keep a current database of protein sequences, a subset called TrEMBL (Translation of EMBL) was developed. Translations of nucleotide sequences from EMBL (European Molecular Biology Laboratory) databases are computer annotated and stored in TrEMBL until sequences can be fully annotated and integrated into Swiss-Prot.

Search Tips: Swiss-Prot sequence records can be accessed through the NCBI Entrez Proteins database. If users choose to access the Swiss-Prot/TrEMBL Web site for sequence searching, they can query the database using a variety of methods: quick search on the main page (Boolean operators not supported), Sequence Retrieval System (SRS), full-text search (Boolean operators, phrase searching, and wild cards supported), and advanced search. Forms for searching by accession number or ID, description (entry name, gene name, species, organelle), author, or citation also are provided. To learn more about searching Swiss-Prot see the Swiss-Prot Documentation section which includes a downloadable PDF version of the user manual.

Information Provided: Swiss-Prot entries are described as containing two types of data: core data (consisting of sequence, bibliographic references, and description of the protein's biological origin) and the annotation. Detailed annotations in each entry describe protein function, post-translational modification (e.g., addition of sugars or phosphate groups after mRNA translation), domain and binding sites, secondary structure, quaternary structure (e.g., homodimer, heterodimer), disorders associated with altered protein forms or amounts, variants, and similarities to other proteins.

Protein Information Resource - Protein Sequence Database (PIR-PSD)

Overview: Established in 1984, Protein Information Resource (PIR) is a division of the National Biomedical Research Foundation associated with Georgetown University Medical Center. In collaboration with Munich Information Center for Protein Sequences (MIPS) in Germany and the Japan International Protein Information Database (JIPID), PIR has developed the PIR-International Protein Sequence Database (PSD). Its mission is to be "the most comprehensive and expertly annotated protein sequence database in the public domain" with the primary objective of achieving "properties of Comprehensiveness, Timeliness, Non-Redundancy, Quality Annotation, and Full Classification."

Search Tips: PIR sequence records can be accessed through the NCBI Entrez Proteins database. If users choose to go to the PIR-PSD Web site, the following search options are provided: search by unique identifier or accession number, basic text search, and advanced text search. For basic text searches, the Boolean operators AND, OR, and NOT are not supported, and a space between terms is interpreted as "and." Advanced searches allow users to refine a strategy with fields such as Title, Species, Author, Keyword, and Gene Name. In advanced search, search terms are case sensitive and must be at least three characters long. Boolean operators OR and NOT are supported. A space between words is interpreted as "and," so users searching for a phrase must put a character between multiple terms (e.g., enter homo-sapiens to search for "homo sapiens"). For more on searching PIR-PSD, see Help Searching PIR Databases, Sample Entry, Demo Search, and FAQs.

Information Provided: Each record includes protein name; classification and origin; literature references; protein features such as domains and motifs; primary sequence data; and links to related entries in other databases. Users have the option to create submission forms for similarity searching in PIR and NCBI databases. At the top of each record are links to annotation and sequence data within the record and a link to a composition table that summarizes total amino acid composition expressed as percentages. At the bottom of the record are direct links to Protein Data Bank (PDB) structures and sequence similarity alignments associated with the protein.

Return to Top

Molecular Biology Basics Gene Resources Nucleotide Sequences Protein Sequences
Sequence Similarity Searching Gene-Mutations Protein Structures

Resources for Sequence Similarity Searching

Scientists frequently perform sequence-similarity searching to see if a gene or protein from one organism has a similar counterpart in another organism. For example, to determine the function and biological importance of a new human protein, scientists often identify a similar mouse protein and then use that protein as a model for studying the human protein.

As we know from molecular biology's "central dogma," the order of nucleotides in a gene's DNA sequence determine the order of amino acids in a protein sequence. Each set of three nucleotides (called a codon) in the DNA sequence encodes a particular protein. See the Table of Standard Genetic Code to see which codons are associated with which amino acids.

Since more than one codon can encode the same amino acid, there is a considerable amount of variability in the nucleotide sequence that could translate into the same amino acid sequence. The genetic code's degenerate nature is the reason that similarity searching using amino acid sequences generally is more informative than using nucleotide sequences.

Users who are new to sequence-similarity searching should check out NCBI's Introduction to Similarity Page, Homology - General Rules, and BLAST Guide's Glossary.

NCBI BLAST

Overview: BLAST (Basic Local Alignment Search Tool) is a set of programs designed to perform similarity searches on all available sequence data. BLAST uses an algorithm developed by the National Center for Biotechnology Information (NCBI) that seeks out local alignment (alignment of some portion of two sequences) as opposed to global alignment (alignment of two sequences over their entire length). By searching for local alignments, BLAST can identify regions of similarity in two sequences. Some similarity searches offered by NCBI include comparing an amino acid sequence to a protein sequence database (blastp), comparing a nucleotide query sequence to a nucleotide sequence database (blastn), and comparing a nucleotide sequence translated in all reading frames to a protein sequence database (blastx).

Search Tips: From the main BLAST page, users can choose among several NCBI services. For service descriptions, click on the question mark to the right of each section title or see the Description of BLAST Services. Clicking on the desired BLAST search option will lead to a search page with a box for entering the query sequence. Accepted input includes a sequence in FASTA format (a single-line description followed by sequence data), bare sequence (sequence data without the single-line description), and identifier. The identifier may be an accession number or GenBank ID (GI number), but must be entered as a single word without any spaces between characters. For more information about input, see NCBI's Search Format page. Each search or format option on the search page links to Help documentation with more detailed descriptions of each option. For more on how to use BLAST, see our Sequence Similarity Searching tutorial and NCBI's step-by-step BLAST GUIDE, Query Tutorial for new users, BLAST Tutorial, and BLAST Help.

Information Provided: After submitting a BLAST request, users are presented with a Formatting BLAST page that displays the query statement, domain information, request for ID number, and format options. After desired format options are selected, pressing the Format button will pull up the Results of BLAST page. Using pair-wise alignment (the default alignment view) in format options, the Results page will display an image map graphically depicting retrieved database sequences (subject sequences) aligned with query sequence (depicted as the numbered line at the top). Passing the mouse over each line below the query sequence will display a description of that sequence in the text box. Clicking on each line will jump down to the corresponding pairwise alignment between the query sequence and a particular subject sequence. Below the image map is a list of sequences producing significant alignments. Accession number or identifier for each alignment links to a sequence record. The score links to the corresponding pairwise alignment at the bottom of the Results page. The blue L seen in some results links to a related entry in LocusLink. See the Sequence Similarity Searching tutorial for more on interpreting BLAST results.

PIR FASTA Similarity Search

Overview: The FASTA Similarity Search tool is part of the Protein Information Resource (PIR) collection of protein databases and bioinformatics tools. This similarity-search tool uses the FASTA algorithm, which compares a query sequence to those in the Protein Sequence Database and other PIR databases.

Search Tips: Users can query the database by inserting the single-letter amino acid code into the query box or by entering the valid PIR-PSD entry code for a particular protein of interest. See the Demo Search for an example.

Information Provided: Query results are presented in a table that lists more-similar sequences at the top and less-similar sequences toward the bottom. Clicking on ID number for a result will pull up the database entry for that protein, and clicking on the colored bar on the right will link to pairwise alignment between the submitted sequence and the subject sequence retrieved from the database.

Return to Top

Molecular Biology Basics Gene Resources Nucleotide Sequences Protein Sequences
Sequence Similarity Searching Gene-Mutations Protein Structures

Gene Mutation Resources

Genes carry instructions for building proteins, molecules that do most of the body's work. Certain variations in a gene's nucleotide sequence can affect the resulting protein's function by altering amino acid sequence and protein structure. The inability of some variant proteins to function properly can cause genetic disorders or other distinctive phenotypes.

Online Mendelian Inheritance in Man (OMIM): Allelic Variants

Overview: OMIM records for many genes include an Allelic Variants section that summarizes published research concerning selected allelic variants or mutations, many of which cause disorders. Some criteria for selecting allelic variants for inclusion in OMIM are first mutation discovered, high population frequency, distinctive phenotype, and unusual disease-causing mechanism. Each variant is assigned a ten-digit number made up of the gene's six-digit OMIM number, followed by a period and four digits unique to the variant. For more information about this database, see the OMIM entry above in the Learning about Genes and Their Products section.

HGVbase

Overview: The Human Genome Variation database (HGVbase) is a database of annotated records for known sequence variations in the human genome. This database was designed as a tool to help scientists understand how common genome sequence variations, such as single nucleotide polymorphisms, result in complex phenotypes such as disease susceptibility and reactions to drugs. Each HGVbase record features data extracted from publicly available genome databases or published literature that has been subjected to manual review and enhanced with annotations. HGVbase shares data with NCBI's dbSNP, and currently incorporates about 40% of dbSNP's records into its database. HGVbase is funded by the Karolinska Institute Center for Genomics and Bioinformatics in Sweden, the European Bioinformatics Institute, and the European Molecular Biology Laboratory (EMBL).

Search Tips: HGVbase provides text search and sequence search options for its users. In addition to the quick search box available on the HGVbase home page, there are links to four different search tools: Text Search, Text+ Search, Sequence Search, and Regional Search. The Text and Text+ search forms allow users to search for records by text strings that can be targeted to particular fields of a record. The Regional Search lets users search for SNPs by chromosomal location.

Since some characterized genes may lack standardized names, HGVbase recommends sequence searching over text-based searching. To search by sequence, simply paste DNA or RNA sequence data (in any format) into the Sequence Search form and click "Run." For more information about searching HGVbase see the "How to search" page available from the navigation menu on the left or click the "Help" link in the upper right corner of each search form.

Information provided: Some features included in each record are: the variant type, accession numbers that link to sequences that contain the variant, portions of the sequence that flank the variant, alleles or possible nucleotides at the site of the polymorphism, associated gene names and symbols, the region of the gene where the variant is found (e.g., exon, intron, etc.) and citations to source literature. For more information about the various fields of each HGVbase record see the Data Structure Record.

Human Gene Mutation Database

Overview: Human Gene Mutation Database (HGMD) is a collection of published gene lesions associated with human hereditary disorders. This database is maintained by the Institute for Medical Genetics at University of Wales College of Medicine. HGMD collaborates with Celera Genomics and is supported by Genome Database (GDB) and several biotechnology companies. The home page links to a useful overview of mutation nomenclature.

Search Tips: HGMD provides a simple search interface for querying its database by disease, gene name, and gene symbol. All punctuation marks (e.g., slashes, plus signs, double quotes, commas, and dashes) are ignored. Truncation using an asterisk (*) is supported. For more information on using HGMD, see the Help file.

Information Provided: Each search will pull up a list of gene symbols corresponding to search terms. Clicking on a gene symbol will access a record summarizing mutations and phenotypes and the number of entries associated with each mutation type and phenotype. Clicking on a mutation type will show the accession number, location, and associated phenotype and link to a reference citation for each mutation. The record for each gene also links to a mutation map, the gene's cDNA sequence, and gene-specific records in other databases.

NCBI dbSNP

Overview: One of the most common types of DNA sequence variation is the single nucleotide polymorphism (SNP), in which a single nucleotide base (A, C, T, or G) is substituted for another. NCBI's Database of Single Nucleotide Polymorphism (dbSNP) serves as a public repository for sequence variations such as small-scale insertions or deletions, polymorphic repetitive elements, and microsatellite variation, in addition to SNPs. Data can come from any part of a genome in any species. Sequence variations are submitted to the database by members of the scientific community. This database is separate from GenBank but is cross-linked to records in other NCBI resources such as GenBank, LocusLink, and PubMed.

For more about SNPs and why they are important to biomedical research, see the SNP Fact Sheet and NCBI's SNPs: Variation on a Theme.

Search Tips: Users can search dbSNP directly or access the database through other NCBI resources. One way to access SNP data mapped to a particular gene is to use NCBI LocusLink. Once you have found a gene's LocusLink record, clicking on the purple V or VAR link (if available) will open a list of SNPs mapped to that locus. Records in NCBI's sequence databases also may link to SNP data.

To search dbSNP directly, use Entrez SNP or dbSNP's Easy Search Form. dbSNP also provides a BLAST search option that compares the query sequence with sequence data contained in each SNP record. The BLAST option will generate a list of SNPs that can be found within the query sequence. See the Entrez SNP main page for descriptions of the different fields that can be used for searching the database.

NCBI will soon feature a quick how-to guide called GETTING STARTED. This guide should help novice users learn how to use and design search strategies for dbSNP. To learn more about dbSNP, see the FAQs page.

Information Provided: From LocusLink, after clicking on the purple V or VAR link, the SNP's linked from LocusLink page will open. This page provides Gene Model information with links to associated contig, mRNA, and protein sequence records. Each SNP is included in the graphic gene model and color-coded based on where the SNP is located (intron, exon, or untranslated region) and whether the change is synonymous or non-synonomous. For each SNP that occurs in an exon, the associated nucleotide, codon position, and amino acid residue are given.

Each SNP is assigned an identification number called a cluster id or rs number. The record for each cluster id is referred to as a cluster report and includes source organism, variation type (e.g., SNP (single nucleotide polymorphism) or DIP (deletion/insertion polymorphism)), the nucleotide sequence flanking the SNP in FASTA format, a LocusLink Analysis map depicting where the SNP is found within the gene, and links to other NCBI resources related to the particular SNP. Submitter records for each cluster provide one or more links to more detailed descriptions for each SNP submission.

Human Genome Variation Society: Variation Databases and Related Sites

Overview: This Web site is a collection of different types of mutation databases such as locus specific, disease-centered, national and ethnic, and non human. Locus-specific databases are arranged alphabetically by gene symbol. Links to other related databases and educational resources also are provided.

Genome Web: Human Mutation Databases

Overview: This resource, from the UK Human Genome Mapping Project Resource Centre, is a collection of links to general mutation and locus-specific databases. A brief description of each database is found below the list of links.

Return to Top

Molecular Biology Basics Gene Resources Nucleotide Sequences Protein Sequences
Sequence Similarity Searching Gene-Mutations Protein Structures

Protein Structure Resources

Databases described in this section can provide a better understanding of what a gene's protein product looks like. For some well-studied proteins, users also may find structures of mutant forms that can be compared with structures of nonmutated or wild-type proteins.

A good, basic introduction to protein structures, X-ray crystallography, and nuclear magnetic resonance spectroscopy (NMR) can be found in the National Institute of General Medical Sciences (NIGMS) 2001 publication The Structures of Life (67 pp.). A free copy can be ordered from the NIGMS Publication List or downloaded as a PDF file (requires Adobe Acrobat Reader).

For more information:

Nature of 3-D Structural Data: The Protein Data Bank's brief introduction to X-ray crystallography and NMR

Crystallography 101: Tutorial by Dr. Bernhard Rupp at Lawrence Livermore National Laboratory

The Basics of NMR: Online text book by Dr. Joseph P. Hornak, professor of Chemistry and Imaging Science at the Rochester Institute of Technology

Protein Data Bank

Overview: Protein Data Bank (PDB) is an international archive of 3D structural information for biological macromolecules. PDB is managed by the Research Collaboratory for Structural Bioinformatics (RCSD), a nonprofit consortium involving Rutgers, the State University of New Jersey; National Institute of Standards and Technology (NIST); and San Diego Supercomputer Center at the University of California, San Diego.

Search Tips: Users can query the archive by PDB ID or keyword using the search box on the main page. Other query options include SearchLite (keyword search form with examples), SearchFields (an advanced search option with customizable fields), and Status Search (used to find structures being processed by PDB). To learn more about searching PDB, take the Query Tutorial or examine the User Guides.

Information Provided: Each structure record includes a summary, structure viewing options, download and display options, links to records of structural neighbors, geometry, links to other protein information sources, and details about the structure's sequence. For step-by-step instructions on interacting with 3-D structures, see Examining a Protein's Structure.

Entrez Structure

Overview: The National Center for Biotechnology Information (NCBI) database of three-dimensional molecular structure is called the Molecular Modeling Database (MMDB). The database is searchable via NCBI's Entrez retrieval system. Structure data is derived from X-ray crystallography and Nuclear Magnetic Resonance (NMR) structure determinations from Protein Data Bank (PDB). This database is considerably smaller than Entrez's nucleotide and protein sequence databases. If a structure for a known sequence is not included, the structure of a protein homolog may be available for examination.

Search Tips: Users can use the query interface to search by keyword, or access structure records directly through links in PubMed citations and nucleotide and protein sequence records. Links to instructions for searching by keyword, protein sequence, and nucleotide sequence are on the main search page. As in other Entrez databases, users can refine searches using fields available in Limits, preview query results and browse index terms in Preview/Index, combine searches using History, and store selected records from different searches on Clipboard. Some indexed fields that can be used to narrow a search include accession number, substance name, author name, journal name, organism, properties, and text word. Boolean Operators AND, OR, and NOT must be in upper case. Phrase searching using double quotes and truncation using the asterisk (*) as a wild card also are supported. For more information about searching this and other NCBI Entrez databases, see the Entrez Help Document.

Information provided: Each structure record or summary includes MMDB and PDB identifiers, links to protein and nucleotide sequences and related MEDLINE documents, taxonomy assignments, structure authors, date the structure was deposited into PDB, PDB classification and macromolecular content, links to sequence and structure neighbors, and structure-viewing options. Entries in MMDB are cross-linked to bibliographic information, sequence database entries, and NCBI taxonomy. To view a structure, users must download NCBI's free 3D structure viewer Cn3D, which is supported by Windows, Macintosh, and UNIX platforms. To learn more about using this viewer, see NCBI's Cn3D Tutorial, Help, and FAQs.

Return to Top

Molecular Biology Basics Gene Resources Nucleotide Sequences Protein Sequences
Sequence Similarity Searching Gene-Mutations Protein Structures

Last modified: October 7, 2003

Feedback and comments about this site, contact site designer, Jennifer Bownas of HGMIS. To order a poster, click here.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y
Home Site Index Chromosome Viewer Genetic Disorder Guide Gene and Protein Guide Bioinformatics Tutorials
Bioinformatics Terms Sample Profiles Evaluating Medical Information Links FAQs Order Poster

The online presentation of this poster is a special feature of the U.S. Department of Energy (DOE) Human Genome Project Information Web site. The DOE Biological and Environmental Research program of the Office of Science funds this site.