|
Gene
and Protein Database Guide
Resources for learning about genes and the proteins they encode |
Molecular
Biology Basics
In this guide we
have provided descriptions and links to bioinformatics resources that
can help you learn more about genes and proteins associated with genetic
disorders or traits. Most were designed for students and professionals
in the life sciences, so a certain level of familiarity with genetics
and molecular biology is assumed. For those of you looking for an introduction
to the science behind the Human Genome Project, we have included links
to basic genetics and molecular biology resources that can be reviewed
before you attempt to use the bioinformatics resources.
Genomics
and Its Impact on Science and Society: The Human Genome Project and
Beyond - Department of Energy (DOE) publication. Defines basic genetic
terms and overviews human genome mapping and sequencing, model organism
research, informatics, and the impact of the Human Genome Project.
The
Science Behind the Human Genome Project - Web pages that define
some basic genetics concepts and explain how the Human Genome Project
was implemented.
Genome
Glossary - DOE Human Genome Program glossary of genetics terms.
Can be searched or browsed alphabetically; links to other life science
glossaries.
To
Know Ourselves: DOE publication that explains the agency's involvement
with the Human Genome Project. Includes sections on human genome physical
mapping and sequencing; technological developments in laboratory instrumentation;
management of genomic data; and the project's ethical, legal and social
implications.
Return
to Top
Molecular
Biology Basics Gene
Resources Nucleotide
Sequences Protein
Sequences
Sequence Similarity Searching
Gene-Mutations
Protein Structures
Learning
about Genes and Their Products
These resources give
a general overview of a gene, along with some of the following information:
official symbol, locus, associated disorders or traits, mode of inheritance,
name and function of gene product, and links to additional gene-specific
resources.
Online
Mendelian Inheritance in Man (OMIM)
Overview:
Created and edited by Victor A. McKusick, MD and his colleagues at Johns
Hopkins School of Medicine, OMIM is a large, searchable, up-to-date
database of human genes, genetic traits, and disorders. In addition
to summarizing what is known about a particular gene, trait, or disorder,
each record also contains reference material and links to other NCBI
resources such as literature citations in MEDLINE and related sequence
records. OMIM is intended for use by genetics researchers, advanced
life-science students, and healthcare professionals concerned with genetic
disorders. The database is updated daily. Three different interfaces
are provided for exploring particular genes and genetic conditions:
Search, Gene Map, and Morbid Map. An article
on OMIM is available from the NCBI Bookshelf.
Search Tips:
Browse an alphabetical listing of genetic disorders featured in OMIM
with Morbid Map. View a listing of genes organized by cytogenetic location
using OMIM's Gene Map. Search for information about specific genes,
traits, or disorders using OMIM's Search options. For step-by-step instructions
describing how to search OMIM, see our OMIM
Search Tutorial. Additional information about searching OMIM is
available from the Help
and FAQs
pages.
Information
Provided: Some of the types of information provided in OMIM records
include: genes that have been linked to disorders, the official symbol
for a gene, key mutations in genes that result in disease, functions
of genes and the proteins they encode, as well as descriptions of genetic
conditions and how they are inherited. Links to citations in Medline,
related OMIM entries and entries in other NCBI databases also are included.
The amount of information included in each entry depends upon how much
researchers know about a particular gene or condition.
NCBI
LocusLink
Overview:
LocusLink is an NCBI database that serves as a single query interface
to gene-specific information from a wide variety of bioinformatics sources.
LocusLink includes descriptive information about genetic loci in human,
cow, fruit fly, mouse, rat, nematode, zebrafish, and human immunodeficiency
virus type 1 genomes.
Search Tips:
Users can query LocusLink by typing keywords (such as disease or protein
name, gene symbol, accession numbers, or other database ID numbers)
into the search box at the top of the main page. Query options include
truncation of terms using the asterisk (*) as a wild card, field restriction,
and use of Boolean operators (and, or, not) that are not case sensitive.
Grouping phrases by parentheses or quotation marks is not supported.
LocusLink has a
system of controlled terms that can be used to retrieve only those records
with a particular feature. One of the controlled terms is disease_known,
which will return only loci associated with a known disorder. See the
Query
Tips in the Help file for a complete listing and more detailed descriptions
of controlled terms. For
more on searching LocusLink, see the Help
file. LocusLink also provides a FAQ
page.
Information
Provided: Each LocusLink report may include the following types
of information: links to gene-specific entries in other databases, official
gene nomenclature, LocusID (identification number assigned to the gene
by LocusLink), overview of protein function, alternate symbols and aliases,
phenotypes or expressed characteristics associated with the gene, other
database ID numbers, similar genes from other genomes, links to cytogenetic
maps, links to sequence records, and links to other related information
sources.
GeneCards
Overview:
Developed at the Weizmann
Institute of Science in Israel, GeneCards is a database of human
genes, their products, and their involvement in hereditary disorders.
GeneCards automatically extracts gene-specific information from a variety
of Web-based bioinformatics resources and integrates the data into each
entry. The database was designed for scientists who want to use one
interface to access multiple databases for information about human genes
that have been assigned approved symbols.
Search Tips:
Users can search GeneCards by keyword or gene symbol/alias using the
search box on the home page. Keywords can be single or multiple terms,
GenBank accession number, chromosome number, or gene locus. Truncation
using an asterisk (*) as a wild card at the beginning or end of a search
term is supported. The Boolean operators AND and OR can be used to connect
terms. The Boolean operator NOT is not supported. Examples
are provided for keyword searching. Users may also browse a complete
listing of genes or a subset of disease
genes featured in GeneCards. To learn more about searching GeneCards,
check out Quick
Start, Guided
Tour and The
GeneCards Guidance System.
Information
Provided: Each GeneCard may include the following information: official
gene name and symbol, synonyms or alternative names, ID numbers assigned
to the gene in other databases, chromosomal location, chromosome map
showing where the gene is found, domains and protein families associated
with the gene's protein product, links to sequence records, expression
patterns in human tissues, links to similar genes in other organisms,
SNPs and variants, disorders and mutations, links to citations in Medline,
and links to other related resources. Each GeneCard also links to sources
used to create the entry. GeneCards encourages feedback from its users
and provides a form
for submitting comments and suggestions.
Return
to Top
Molecular
Biology Basics Gene
Resources Nucleotide
Sequences Protein
Sequences
Sequence Similarity Searching
Gene-Mutations
Protein Structures
Nucleotide
Sequence Databases
International
Nucleotide Sequence Database Collaboration
This collaboration
is a coordinated effort among three key sequence repository centers: GenBank
at the National Center for Biotechnology (NCBI), the European
Molecular Biology Laboratory (EMBL), and the DNA
DataBank of Japan (DDBJ). Sequence data is exchanged daily among these
three organizations. Although record formats and search systems may differ,
information contained in each record (accession number, sequence data,
annotations) will be the same for all three databases.
NCBI
GenBank and Entrez Nucleotide
Overview:
GenBank
is an NCBI database that serves as an archive for all publicly available
DNA sequences from more than 100,000 different organisms. Submitting
scientists retain complete editorial control over their sequences, so
they decide on gene symbols (which may not be the official ones) and
what additional information to include. Scientists contact NCBI if they
wish to make any modifications to their sequence records. As an archival
database, GenBank can include redundant entries, even hundreds of records
for the same gene, and some entries may contain errors in their sequence
data. To address some problems associated with this archival database,
NCBI developed the nonredundant RefSeq. RefSeq
is a curated, nonredundant source of sequence data for genomic DNA,
mRNA transcripts, and proteins of major research organisms. Unlike GenBank
records, RefSeq records are created, reviewed, and updated by NCBI staff.
Each RefSeq entry
features a distinct accession
number (two characters followed by an underscore in which the first
two characters describe the sequence type). For more information about
RefSeq, see
RefSeq FAQs.
Search Tips:
There are a few different ways for accessing sequence records at NCBI:
text-searching with Entrez
Nucleotide, BLAST searching, or linking to sequence records from
databases and tools such as LocusLink,
OMIM,
or Map Viewer. Entrez Nucleotide is a part of NCBI's Entrez
search and retrieval system that can be used to search several linked
databases, such as sequence databases, structure databases, OMIM, genome
assemblies, and biomedical literature. With all Entrez databases, users
can refine search strategies using fields available in Limits and Preview/Index,
browse Index terms of a particular field, combine searches using History,
and store selected records from different searches on a Clipboard. Some
search-refining techniques available from the Limits page are to exclude
certain types of sequences (e.g., ESTs) and limit the search by date
or particular database (e.g., search only RefSeq). Boolean Operators
AND, OR, and NOT must be in upper case. Phrase searching using double
quotes and truncation using the asterisk (*) as a wild card also are
supported. For more information about searching this and other NCBI
Entrez databases, see Entrez
Help Document. For step-by-step instructions on finding and interpreting
sequence records, see our tutorial Accessing
records in NCBI sequence databases.
Information
Provided: Each record returned in a search will include the nucleotide
sequence and annotations such as accession numbers, keywords, source
organism, and citations for references. Sequence records also may contain
the translated amino acid sequence. For more detailed descriptions of
types of information in each sequence record, check the Sample
GenBank Record provided by NCBI.
Return
to Top
Molecular
Biology Basics Gene
Resources Nucleotide
Sequences Protein
Sequences
Sequence Similarity Searching
Gene-Mutations
Protein Structures
Protein
Sequence Databases
Entrez
Proteins
Overview:
Part of the National Center for Biotechnology Information (NCBI) Entrez
system, this database includes sequence data compiled from a variety of
sources, including Swiss-Prot, Protein Information Resource (PIR), Protein
Data Bank (PDB), and Protein Resource Foundation (PRF) in Japan. Some
protein sequences were created from translations of coding regions in
DNA sequences stored in GenBank and RefSeq.
Search Tips:
As with other Entrez databases, users can refine search strategies using
fields available in Limits, preview the number of search results for
a query, browse Index terms of a particular field, combine searches
using History, and store selected records from different searches on
Clipboard. Some of the indexed fields that can be used to narrow a search
include accession number, gene name, molecular weight, organism, properties,
protein name, and sequence length. Users also can specify that only
one particular database be searched (e.g., retrieve protein sequences
from Swiss-Prot only). Boolean Operators AND, OR, and NOT must be in
upper case. Phrase searching using double quotes and truncation using
the asterisk (*) as a wild card also are supported. For more information
about searching this and other NCBI Entrez databases, see the Entrez
Help Document. For step-by-step instructions on finding and interpreting
sequence records, see our tutorial on accessing
sequence records.
Information
Provided: Search results displayed using the default view will include
locus name (a unique name assigned to each record), sequence length,
protein description (definition), accession number, database source,
keywords, organism, citations to references, comments concerning protein
function or associated traits or disorders, information about sequence
regions of biological significance, and the amino acid sequence. For
detailed descriptions about fields presented in each NCBI sequence record,
see the GenBank
sample record.
Swiss-Prot/TrEMBL
Overview:
The protein sequence databases Swiss-Prot and TrEMBL were developed
by groups at the Swiss
Institute of Bioinformatics (SIB) and the European
Bioinformatics Institute (EBI). Swiss-Prot uses three key criteria:
high level of annotation, minimal redundancy, and high level of integration
with other databases. Swiss-Prot includes as much information as possible
in its annotations, and external experts review current literature and
provide comments and updates on different protein groups. Swiss-Prot's
depth of annotation, however, requires considerable time and effort.
To keep a current database of protein sequences, a subset called TrEMBL
(Translation of EMBL) was developed. Translations of nucleotide sequences
from EMBL (European Molecular Biology Laboratory) databases are computer
annotated and stored in TrEMBL until sequences can be fully annotated
and integrated into Swiss-Prot.
Search Tips:
Swiss-Prot sequence records can be accessed through the NCBI Entrez
Proteins database. If users choose to access the Swiss-Prot/TrEMBL Web
site for sequence searching, they can query the database using a variety
of methods: quick search on the main page (Boolean operators not supported),
Sequence Retrieval
System (SRS), full-text
search (Boolean operators, phrase searching, and wild cards supported),
and advanced
search. Forms for searching by accession number or ID, description
(entry name, gene name, species, organelle), author, or citation also
are provided. To learn more about searching Swiss-Prot see the Swiss-Prot
Documentation section which includes a downloadable PDF version
of the user manual.
Information
Provided: Swiss-Prot entries are described as containing two types
of data: core data (consisting of sequence, bibliographic references,
and description of the protein's biological origin) and the annotation.
Detailed annotations in each entry describe protein function, post-translational
modification (e.g., addition of sugars or phosphate groups after mRNA
translation), domain and binding sites, secondary structure, quaternary
structure (e.g., homodimer, heterodimer), disorders associated with
altered protein forms or amounts, variants, and similarities to other
proteins.
Protein
Information Resource - Protein Sequence Database (PIR-PSD)
Overview:
Established in 1984, Protein
Information Resource (PIR) is a division of the National Biomedical
Research Foundation associated with Georgetown University Medical Center.
In collaboration with Munich
Information Center for Protein Sequences (MIPS) in Germany and the
Japan International Protein Information Database (JIPID), PIR has developed
the PIR-International Protein Sequence Database (PSD). Its mission is
to be "the most comprehensive and expertly annotated protein sequence
database in the public domain" with the primary objective of achieving
"properties of Comprehensiveness, Timeliness, Non-Redundancy, Quality
Annotation, and Full Classification."
Search Tips:
PIR sequence records can be accessed through the NCBI Entrez Proteins
database. If users choose to go to the PIR-PSD Web site, the following
search options are provided: search by unique identifier or accession
number, basic text search, and advanced text search. For basic text
searches, the Boolean operators AND, OR, and NOT are not supported,
and a space between terms is interpreted as "and." Advanced searches
allow users to refine a strategy with fields such as Title, Species,
Author, Keyword, and Gene Name. In advanced search, search terms are
case sensitive and must be at least three characters long. Boolean operators
OR and NOT are supported. A space between words is interpreted as "and,"
so users searching for a phrase must put a character between multiple
terms (e.g., enter homo-sapiens to search for "homo sapiens"). For more
on searching PIR-PSD, see Help
Searching PIR Databases, Sample
Entry, Demo
Search, and FAQs.
Information
Provided: Each record includes protein name; classification and
origin; literature references; protein features such as domains and
motifs; primary sequence data; and links to related entries in other
databases. Users have the option to create submission forms for similarity
searching in PIR and NCBI databases. At the top of each record are links
to annotation and sequence data within the record and a link to a composition
table that summarizes total amino acid composition expressed as percentages.
At the bottom of the record are direct links to Protein Data Bank (PDB)
structures and sequence similarity alignments associated with the protein.
Return
to Top
Molecular
Biology Basics Gene
Resources Nucleotide
Sequences Protein Sequences
Sequence Similarity Searching
Gene-Mutations
Protein Structures
Resources
for Sequence Similarity Searching
Scientists frequently
perform sequence-similarity searching to see if a gene or protein from
one organism has a similar counterpart in another organism. For example,
to determine the function and biological importance of a new human protein,
scientists often identify a similar mouse protein and then use that protein
as a model for studying the human protein.
As we know from molecular
biology's "central dogma," the order of nucleotides in a gene's
DNA sequence determine the order of amino acids in a protein sequence.
Each set of three nucleotides (called a codon) in the DNA sequence encodes
a particular protein. See the Table
of Standard Genetic Code to see which codons are associated with which
amino acids.
Since more than one
codon can encode the same amino acid, there is a considerable amount of
variability in the nucleotide sequence that could translate into the same
amino acid sequence. The genetic code's degenerate nature is the reason
that similarity searching using amino acid sequences generally is more
informative than using nucleotide sequences.
Users who are new
to sequence-similarity searching should check out NCBI's Introduction
to Similarity Page, Homology
- General Rules, and BLAST
Guide's Glossary.
NCBI
BLAST
Overview:
BLAST (Basic Local Alignment Search Tool) is a set of programs designed
to perform similarity searches on all available sequence data. BLAST
uses an algorithm developed by the National Center for Biotechnology
Information (NCBI) that seeks out local alignment (alignment of some
portion of two sequences) as opposed to global alignment (alignment
of two sequences over their entire length). By searching for local alignments,
BLAST can identify regions of similarity in two sequences. Some similarity
searches offered by NCBI include comparing an amino acid sequence to
a protein sequence database (blastp), comparing a nucleotide query sequence
to a nucleotide sequence database (blastn), and comparing a nucleotide
sequence translated in all reading frames to a protein sequence database
(blastx).
Search Tips:
From the main BLAST page, users can choose among several NCBI services.
For service descriptions, click on the question mark to the right of
each section title or see the Description
of BLAST Services. Clicking on the desired BLAST search option will
lead to a search page with a box for entering the query sequence. Accepted
input includes a sequence in FASTA format (a single-line description
followed by sequence data), bare sequence (sequence data without the
single-line description), and identifier. The identifier may be an accession
number or GenBank ID (GI number), but must be entered as a single word
without any spaces between characters. For more information about input,
see NCBI's
Search Format page. Each search or format option on the search page
links to Help documentation with more detailed descriptions of each
option. For more on how to use BLAST, see our Sequence
Similarity Searching tutorial and NCBI's step-by-step BLAST
GUIDE, Query
Tutorial for new users,
BLAST Tutorial, and BLAST
Help.
Information
Provided: After submitting a BLAST request, users are presented
with a Formatting BLAST page that displays the query statement,
domain information, request for ID number, and format options. After
desired format options are selected, pressing the Format button
will pull up the Results of BLAST page. Using pair-wise alignment
(the default alignment view) in format options, the Results page
will display an image map graphically depicting retrieved database sequences
(subject sequences) aligned with query sequence (depicted as the numbered
line at the top). Passing the mouse over each line below the query sequence
will display a description of that sequence in the text box. Clicking
on each line will jump down to the corresponding pairwise alignment
between the query sequence and a particular subject sequence. Below
the image map is a list of sequences producing significant alignments.
Accession number or identifier for each alignment links to a sequence
record. The score links to the corresponding pairwise alignment at the
bottom of the Results page. The blue L seen in some results
links to a related entry in LocusLink. See the Sequence
Similarity Searching tutorial for more on interpreting BLAST results.
PIR
FASTA Similarity Search
Overview:
The FASTA Similarity Search tool is part of the Protein
Information Resource (PIR) collection of protein databases and bioinformatics
tools. This similarity-search tool uses the FASTA algorithm, which compares
a query sequence to those in the Protein Sequence Database and other
PIR databases.
Search Tips:
Users can query the database by inserting the single-letter amino acid
code into the query box or by entering the valid PIR-PSD entry code
for a particular protein of interest. See the Demo
Search for an example.
Information
Provided: Query results are presented in a table that lists more-similar
sequences at the top and less-similar sequences toward the bottom. Clicking
on ID number for a result will pull up the database entry for that protein,
and clicking on the colored bar on the right will link to pairwise alignment
between the submitted sequence and the subject sequence retrieved from
the database.
Return
to Top
Molecular
Biology Basics Gene
Resources Nucleotide
Sequences Protein
Sequences
Sequence Similarity Searching
Gene-Mutations
Protein Structures
Gene
Mutation Resources
Genes carry instructions
for building proteins, molecules that do most of the body's work. Certain
variations in a gene's nucleotide sequence can affect the resulting protein's
function by altering amino acid sequence and protein structure. The inability
of some variant proteins to function properly can cause genetic disorders
or other distinctive phenotypes.
Online
Mendelian Inheritance in Man (OMIM): Allelic Variants
Overview:
OMIM records for many genes include an Allelic Variants section that
summarizes published research concerning selected allelic variants or
mutations, many of which cause disorders. Some criteria for selecting
allelic variants for inclusion in OMIM are first mutation discovered,
high population frequency, distinctive phenotype, and unusual disease-causing
mechanism. Each variant is assigned a ten-digit number made up of the
gene's six-digit OMIM number, followed by a period and four digits unique
to the variant. For more information about this database, see the OMIM
entry above in the Learning about Genes and Their
Products section.
HGVbase
Overview:
The Human Genome Variation database (HGVbase) is a database of annotated
records for known sequence variations in the human genome. This database
was designed as a tool to help scientists understand how common genome
sequence variations, such as single nucleotide polymorphisms, result
in complex phenotypes such as disease susceptibility and reactions to
drugs. Each HGVbase record features data extracted from publicly available
genome databases or published literature that has been subjected to
manual review and enhanced with annotations. HGVbase shares data with
NCBI's dbSNP,
and currently incorporates about 40% of dbSNP's records into its database.
HGVbase is funded by the Karolinska Institute Center for Genomics and
Bioinformatics in Sweden, the European Bioinformatics Institute, and
the European Molecular Biology Laboratory (EMBL).
Search Tips:
HGVbase provides text search and sequence search options for its users.
In addition to the quick search box available on the HGVbase home page,
there are links to four different search tools: Text Search, Text+ Search,
Sequence Search, and Regional Search. The Text and Text+ search forms
allow users to search for records by text strings that can be targeted
to particular fields of a record. The Regional Search lets users search
for SNPs by chromosomal location.
Since some characterized
genes may lack standardized names, HGVbase recommends sequence searching
over text-based searching. To search by sequence, simply paste DNA or
RNA sequence data (in any format) into the Sequence Search form and
click "Run." For more information about searching HGVbase see the "How
to search" page available from the navigation menu on the left or click
the "Help" link in the upper right corner of each search form.
Information
provided: Some features included in each record are: the variant
type, accession numbers that link to sequences that contain the variant,
portions of the sequence that flank the variant, alleles or possible
nucleotides at the site of the polymorphism, associated gene names and
symbols, the region of the gene where the variant is found (e.g., exon,
intron, etc.) and citations to source literature. For more information
about the various fields of each HGVbase record see the Data
Structure Record.
Human
Gene Mutation Database
Overview:
Human Gene Mutation Database (HGMD) is a collection of published gene
lesions associated with human hereditary disorders. This database is
maintained by the Institute for Medical Genetics at University of Wales
College of Medicine. HGMD collaborates with Celera Genomics and is supported
by Genome Database (GDB) and several biotechnology companies. The home
page links to a useful overview of mutation
nomenclature.
Search Tips:
HGMD provides a simple search
interface for querying its database by disease, gene name, and gene
symbol. All punctuation marks (e.g., slashes, plus signs, double quotes,
commas, and dashes) are ignored. Truncation using an asterisk (*) is
supported. For more information on using HGMD, see the Help
file.
Information
Provided: Each search will pull up a list of gene symbols corresponding
to search terms. Clicking on a gene symbol will access a record summarizing
mutations and phenotypes and the number of entries associated with each
mutation type and phenotype. Clicking on a mutation type will show the
accession number, location, and associated phenotype and link to a reference
citation for each mutation. The record for each gene also links to a
mutation map, the gene's cDNA sequence, and gene-specific records in
other databases.
NCBI
dbSNP
Overview:
One of the most common types of DNA sequence variation is the single
nucleotide polymorphism (SNP), in which a single nucleotide base (A,
C, T, or G) is substituted for another. NCBI's Database of Single Nucleotide
Polymorphism (dbSNP) serves as a public repository for sequence variations
such as small-scale insertions or deletions, polymorphic repetitive
elements, and microsatellite variation, in addition to SNPs. Data can
come from any part of a genome in any species. Sequence variations are
submitted to the database by members of the scientific community. This
database is separate from GenBank but is cross-linked to records in
other NCBI resources such as GenBank, LocusLink, and PubMed.
For more about
SNPs and why they are important to biomedical research, see the SNP
Fact Sheet and NCBI's SNPs:
Variation on a Theme.
Search Tips:
Users can search dbSNP directly or access the database through other
NCBI resources. One way to access SNP data mapped to a particular gene
is to use NCBI LocusLink. Once you have found a gene's LocusLink record,
clicking on the purple V or VAR link (if available) will
open a list of SNPs mapped to that locus. Records in NCBI's sequence
databases also may link to SNP data.
To search dbSNP
directly, use Entrez
SNP or dbSNP's Easy
Search Form. dbSNP also provides a BLAST
search option that compares the query sequence with sequence data
contained in each SNP record. The BLAST option will generate a list
of SNPs that can be found within the query sequence. See the Entrez
SNP main page for descriptions of the different fields that can
be used for searching the database.
NCBI will soon
feature a quick how-to guide called GETTING STARTED. This guide should
help novice users learn how to use and design search strategies for
dbSNP. To learn more about dbSNP, see the FAQs
page.
Information
Provided: From LocusLink, after clicking on the purple V
or VAR link, the SNP's linked from LocusLink page will
open. This page provides Gene Model information with links to associated
contig, mRNA, and protein sequence records. Each SNP is included in
the graphic gene model and color-coded based on where the SNP is located
(intron, exon, or untranslated region) and whether the change is synonymous
or non-synonomous. For each SNP that occurs in an exon, the associated
nucleotide, codon position, and amino acid residue are given.
Each SNP is assigned
an identification number called a cluster id or rs number. The record
for each cluster id is referred to as a cluster report and includes
source organism, variation type (e.g., SNP (single nucleotide polymorphism)
or DIP (deletion/insertion polymorphism)), the nucleotide sequence flanking
the SNP in FASTA format, a LocusLink Analysis map depicting where the
SNP is found within the gene, and links to other NCBI resources related
to the particular SNP. Submitter records for each cluster provide one
or more links to more detailed descriptions for each SNP submission.
Human
Genome Variation Society: Variation Databases and Related Sites
Overview:
This Web site is a collection of different types of mutation databases
such as locus specific, disease-centered, national and ethnic, and non
human. Locus-specific databases are arranged alphabetically by gene
symbol. Links to other related databases and educational resources also
are provided.
Genome
Web: Human Mutation Databases
Overview:
This resource, from the UK Human Genome Mapping Project Resource Centre,
is a collection of links to general mutation and locus-specific databases.
A brief description of each database is found below the list of links.
Return
to Top
Molecular
Biology Basics Gene
Resources Nucleotide
Sequences Protein
Sequences
Sequence Similarity Searching
Gene-Mutations Protein
Structures
Protein
Structure Resources
Databases described
in this section can provide a better understanding of what a gene's protein
product looks like. For some well-studied proteins, users also may find
structures of mutant forms that can be compared with structures of nonmutated
or wild-type proteins.
A good, basic introduction
to protein structures, X-ray crystallography, and nuclear magnetic resonance
spectroscopy (NMR) can be found in the National Institute of General Medical
Sciences (NIGMS) 2001 publication The Structures of Life (67 pp.). A free
copy can be ordered from the NIGMS
Publication List or downloaded as a PDF
file (requires Adobe
Acrobat Reader).
For more information:
Nature
of 3-D Structural Data: The Protein Data Bank's brief introduction
to X-ray crystallography and NMR
Crystallography
101: Tutorial by Dr. Bernhard Rupp at Lawrence Livermore National
Laboratory
The
Basics of NMR: Online text book by Dr. Joseph P. Hornak, professor
of Chemistry and Imaging Science at the Rochester Institute of Technology
Protein
Data Bank
Overview:
Protein Data Bank (PDB) is an international archive of 3D structural
information for biological macromolecules. PDB is managed by the Research
Collaboratory for Structural Bioinformatics (RCSD), a nonprofit consortium
involving Rutgers, the State University of New Jersey; National Institute
of Standards and Technology (NIST); and San Diego Supercomputer Center
at the University of California, San Diego.
Search Tips:
Users can query the archive by PDB ID or keyword using the search box
on the main page. Other query options include SearchLite
(keyword search form with examples), SearchFields
(an advanced search option with customizable fields), and Status
Search (used to find structures being processed by PDB). To learn
more about searching PDB, take the Query
Tutorial or examine the User
Guides.
Information
Provided: Each structure record includes a summary, structure viewing
options, download and display options, links to records of structural
neighbors, geometry, links to other protein information sources, and
details about the structure's sequence. For step-by-step instructions
on interacting with 3-D structures, see Examining
a Protein's Structure.
Entrez
Structure
Overview:
The National Center for Biotechnology Information (NCBI) database of
three-dimensional molecular structure is called the Molecular Modeling
Database (MMDB). The database is searchable via NCBI's Entrez retrieval
system. Structure data is derived from X-ray crystallography and Nuclear
Magnetic Resonance (NMR) structure determinations from Protein Data
Bank (PDB). This database is considerably smaller than Entrez's nucleotide
and protein sequence databases. If a structure for a known sequence
is not included, the structure of a protein homolog may be available
for examination.
Search Tips:
Users can use the query interface to search by keyword, or access structure
records directly through links in PubMed citations and nucleotide and
protein sequence records. Links to instructions for searching by keyword,
protein sequence, and nucleotide sequence are on the main search page.
As in other Entrez databases, users can refine searches using fields
available in Limits, preview query results and browse index terms in
Preview/Index, combine searches using History, and store selected records
from different searches on Clipboard. Some indexed fields that can be
used to narrow a search include accession number, substance name, author
name, journal name, organism, properties, and text word. Boolean Operators
AND, OR, and NOT must be in upper case. Phrase searching using double
quotes and truncation using the asterisk (*) as a wild card also are
supported. For more information about searching this and other NCBI
Entrez databases, see the Entrez
Help Document.
Information
provided: Each structure record or summary includes MMDB and PDB
identifiers, links to protein and nucleotide sequences and related MEDLINE
documents, taxonomy assignments, structure authors, date the structure
was deposited into PDB, PDB classification and macromolecular content,
links to sequence and structure neighbors, and structure-viewing options.
Entries in MMDB are cross-linked to bibliographic information, sequence
database entries, and NCBI taxonomy. To view a structure, users must
download NCBI's free 3D structure viewer Cn3D,
which is supported by Windows, Macintosh, and UNIX platforms. To learn
more about using this viewer, see NCBI's Cn3D
Tutorial, Help,
and FAQs.
Return
to Top
Molecular
Biology Basics Gene
Resources Nucleotide
Sequences Protein
Sequences
Sequence Similarity Searching
Gene-Mutations
Protein Structures |