|
| |
Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources |
MOLECULAR MODELING: A METHOD FOR UNRAVELING PROTEIN STRUCTURE
AND FUNCTION
|
|
Proteins form our bodies
and help direct its many systems. |
|
Proteins are fundamental components of all living cells.
They exhibit an enormous amount of chemical and structural diversity,
enabling them to carry out an extraordinarily diverse range of biological
functions. Proteins help us digest our food, fight infections, control
body chemistry, and in general, keep our bodies functioning smoothly.
Scientists know that the critical feature of a protein is its ability
to adopt the right shape for carrying out a particular function.
But sometimes a protein twists into the wrong shape or has a missing
part, preventing it from doing its job. Many diseases, such as Alzheimer's
and "mad cow", are now known to result from proteins that have adopted
an incorrect structure.
Identifying a protein's shape, or structure, is key to understanding
its biological function and its role in health and disease. Illuminating
a protein's structure also paves the way for the development of
new agents and devices to treat a disease. Yet solving the structure
of a protein is no easy feat. It often takes scientists working
in the laboratory months, sometimes years, to experimentally determine
a single structure. Therefore, scientists have begun to turn toward
computers to help predict the structure of a protein based on its
sequence. The challenge lies in developing methods for accurately
and reliably understanding this intricate relationship.
|
|
|
Levels of Protein Structure
Proteins function through
their conformation. |
|
To produce proteins, cellular structures called ribosomes
join together long chains of subunits. A set of 20 different subunits,
called amino acids, can be arranged in any order to form
a polypeptide that can be thousands of amino acids long.
These chains can then loop about each other, or fold, in
a variety of ways, but only one of these ways allows a protein to
function properly. The critical feature of a protein is its ability
to fold into a conformation that creates structural features, such
as surface grooves, ridges, and pockets, which allow it to fulfill
its role in a cell. A protein's conformation is usually described
in terms of levels of structure. Traditionally, proteins are looked
upon as having four distinct levels of structure, with each level
of structure dependent on the one below it. In some proteins, functional
diversity may be further amplified by the addition of new chemical
groups after synthesis is complete.
The stringing together of the amino acid chain to form a polypeptide
is referred to as the primary structure. The secondary
structure is generated by the folding of the primary sequence
and refers to the path that the polypeptide backbone of the protein
follows in space. Certain types of secondary structures are relatively
common. Two well-described secondary structures are the alpha
helix and the beta sheet. In the first case, certain
types of bonding between groups located on the same polypeptide
chain cause the backbone to twist into a helix, most often in a
form known as the alpha helix. Beta sheets are formed when a polypeptide
chain bonds with another chain that is running in the opposite direction.
Beta sheets may also be formed between two sections of a single
polypeptide chain that is arranged such that adjacent regions are
in reverse orientation.
The tertiary structure describes the organization in three
dimensions of all of the atoms in the polypeptide. If a protein
consists of only one polypeptide chain, this level then describes
the complete structure.
Multimeric proteins, or proteins that consist of more than
one polypeptide chain, require a higher level of organization. The
quaternary structure defines the conformation assumed by
a multimeric protein. In this case, the individual polypeptide chains
that make up a multimeric protein are often referred to as the protein
subunits. The four levels of protein structure are hierarchal,
that is, each level of the build process is dependent upon the one
below it.
|
|
Proteins can be divided into two general classes based on their tertiary
structure.
- Fibrous proteins have elongated structures, with the polypeptide chains
arranged in long strands. This class of proteins serves as major
structural components of cells, and therefore their role tends to be
static in providing a structural framework.
- Globular proteins have more compact, often irregular structures.
This class of proteins includes most enzymes and most of the proteins
involved in gene expression and regulation.
|
|
|
|
How Do Proteins Acquire Their Correct Conformations?
A protein's primary amino acid sequence is crucial in determining
its final structure. In some cases, amino acid sequence is the sole
determinant, whereas in other cases, additional interactions may be
required before a protein can attain its final conformation. For
example, some proteins require the presence of a cofactor,
or a second molecule that is part of the active protein, before
it can attain its final conformation. Multimeric proteins often
require one or more subunits to be present for another
subunit to adopt the proper higher order structure. In any case,
as we stated earlier, the entire process is cooperative, that is,
the formation of one region of secondary structure determines the
formation of the next region.
|
|
|
Allosteric Proteins
Allosteric proteins
can change their shape and function depending on the environmental
conditions in which they are found. |
|
Under certain conditions, a protein may have a stable alternate
conformation, or shape, that enables it to carry out a different biological
function. Proteins that exhibit this characteristic are called allosteric.
The interaction of an allosteric protein with a specific cofactor,
or with another protein, may influence the transition of the protein
between shapes. In addition, any change in conformation brought
about by an interaction at one site may lead to an alteration in
the structure, and thus function, at another site. One should bear
in mind, though, that this type of transition affects only the protein's
shape, not the primary amino acid sequence. Allosteric proteins
play an important role in both metabolic and genetic regulation.
|
|
|
Determining Protein Structure
Traditionally, a protein's structure was determined using one of
two techniques: X-ray crystallography or nuclear magnetic
resonance (NMR) spectroscopy.
|
|
|
X-ray Crystallography
When performing this
technique, the molecule under study must first be crystallized,
and the crystals must be singular and of perfect quality –
a time-consuming and difficult task. |
|
Crystals are a solid form of a substance in which the component
molecules are present in an ordered array called a lattice.
The basic building block of a crystal is called a unit cell.
Each unit cell contains exactly one unique set of the crystal's
components, the smallest possible set that is fully representative
of the crystal. Crystals of a complex molecule, like a protein,
produce a complex pattern of X-ray diffraction, or scattering
of X-rays. When the crystal is placed in an X-ray beam, all of the
unit cells present the same face to the beam; therefore, many molecules
are in the same orientation with respect to the incoming X-rays.
The X-ray beam enters the crystal and a number of smaller beams
emerge: each one in a different direction, each one with a different
intensity. If an X-ray detector, such as a piece of film, is placed
on the opposite side of the crystal from the X-ray source, each
diffracted ray, called a reflection, will produce a spot
on the film. However, because only a few reflections can be detected
with any one orientation of the crystal, an important component
of any X-ray diffraction instrument is a device for accurately setting
and changing the orientation of the crystal. The set of diffracted,
emerging beams contains information about the underlying crystal
structure.
If we could use light instead of X-rays, we could set up a system
of lenses to recombine the beams emerging from the crystal and thus
bring into focus an enlarged image of the unit cell and the molecules
therein. But the molecules do not diffract visible light, and X-rays,
unlike light, cannot be focused with lenses. However, the scientific
laws that lenses obey are well understood, and it is possible to
calculate the molecular image with a computer. In effect, the computer
mimics the action of a lens.
The major drawback associated with this technique is that crystallization
of the proteins is a difficult task. Crystals are formed by slowly
precipitating proteins under conditions that maintain their native
conformation or structure. These exact conditions can only be discovered
by repeated trials that entail varying certain experimental conditions,
one at a time. This is a very time consuming and tedious process.
In some cases, the task of crystallizing a protein borders on the
impossible.
|
|
|
Nuclear Magnetic Resonance (NMR) Spectroscopy
Solution NMR is performed
on a solution of macromolecules while the molecules tumble
and vibrate with thermal motion. |
|
The basic phenomenon of NMR spectroscopy
was discovered in 1945. In this technique, a sample is immersed
in a magnetic field and bombarded with radio waves. These radio
waves encourage the nuclei of the molecule to resonate, or
spin. As the positively charged nucleus spins, the moving charge
creates what is called a magnetic moment. The thermal
motion of the molecule—the movement of the molecule associated
with the temperature of the material—further creates a torque,
or twisting force, that makes the magnetic moment "wobble" like
a child's top. When the radio waves hit the spinning nuclei, they
tilt even more, sometimes flipping over. These resonating nuclei
emit a unique signal that is then picked up on a special radio receiver
and translated using a decoder. This decoder is called the Fourier
Transform algorithm, a complex equation that translates the
language of the nuclei into something a scientist can understand.
By measuring the frequencies at which different nuclei flip, scientists
can determine molecular structure, as well as many other interesting
properties of the molecule.
In the past 10 years, NMR has proven to be a powerful alternative
to X-ray crystallography for the determination of molecular structure.
NMR has the advantage over crystallographic techniques in that experiments
are performed in solution as opposed to a crystal lattice. However,
the principles that make NMR possible tend to make this technique
very time consuming and limit the application to small- and medium-sized
molecules.
|
|
|
The Advent of Computational Modeling
Researchers have been working for decades to develop procedures
for predicting protein structure that are not so time consuming
and that are not hindered by size and solubility constraints. To
do this, researchers have turned to computers for help in predicting
protein structure from gene sequences, a concept called homology
modeling. The complete genomes of various organisms, including
humans, have now been decoded and allow researchers to approach
this goal in a logical and organized fashion.
Before we go any further, it is important to define some
common terminology used in this field.
|
Common
Terminology Used in Homology Modeling |
- Folding motifs are independent folding units, or
particular structures, that recur in many molecules.
- Domains are the building blocks of a protein and
are considered elementary units of molecular function.
- Families are groups of proteins that demonstrate
sequence homology or have similar sequences.
- Superfamilies consist of proteins that have similar
folding motifs but do not exhibit sequence similarity.
|
|
|
|
Some Basic Theory
A computer-generated
image of a protein's structure shows the relative locations
of most, if not all, of the protein's thousands of atoms.
The image also reveals the physical, chemical, and electrical
properties of the protein and provides clues about its
role in the body. |
|
It is theorized that proteins that share a similar sequence generally
share the same basic structure. Therefore, by experimentally determining
the structure for one member of a protein family, called a target,
researchers have a model on which to base the structure of other
proteins within that family. Moving a step further, by selecting
a target from each superfamily, researchers can study the universe
of protein folds in a systematic fashion and outline a set of sequences
associated with each folding motif. Many of these sequences may
not demonstrate a resemblance to one another, but their identification
and assignment to a particular fold is essential for predicting
future protein structures using homology modeling.
The scientific basis for these theories is that a strong conservation
of protein three-dimensional shape across large evolutionary distances—both
within single species, between species, and in spite of sequence
variation—has been demonstrated again and again. Although most
scientists choose high-priority structures as their targets, this
theory provides the option to choose any one of the proteins within
a family as the target, rather than trying to achieve experimental
results using a protein that is particularly difficult to work with
using crystallographic or NMR techniques.
|
|
|
Steps for Maximizing Results
Specific tasks must be carried out to maximize results
when determining protein structure using homology modeling.
First, protein sequences must be organized in terms of families, preferably
in a searchable database, and a target must be selected. Protein families
can be identified and organized by comparing protein sequences derived
from completely sequenced genomes. Targets may be selected for families
that do not exhibit apparent sequence homology to proteins with
a known three-dimensional structure.
Next, researchers must generate a purified protein for analysis
of the chosen target and then experimentally determine the target's
structure, either by X-ray crystallography and/or NMR. Target structures
determined experimentally may then be further analyzed to evaluate
their similarity to other known protein structures and to determine
possible evolutionary relationships that are not identifiable from
protein sequence alone. The target structure will also serve as
a detailed model for determining the structure of other proteins
within that family. In favorable cases, just knowing the structure
of a particular protein may also provide considerable insight into
its possible function.
|
|
|
PDB: The Protein Data Bank
The PDB was the
first "bioinformatics" database ever built and is designed to store
complex three-dimensional data. The PDB was originally developed
and housed at the Brookhaven National Laboratories but is now managed
and maintained by the Research Collaboratory for Structural Bioinformatics
(RCSB). The PDB is a collection of all publicly available three-dimensional
structures of proteins, nucleic acids, carbohydrates, and a variety
of other complexes experimentally determined by X-ray crystallography
and NMR.
|
|
PDB is supported by funds from the National
Science Foundation, the Department of Energy, and two units
of the National Institutes of Health: the National Institute
of General Medical Sciences and the National Library of Medicine.
|
|
|
|
Protein Modeling at NCBI
|
|
The Molecular Modeling Database
NCBI's Molecular
Modeling DataBase (MMDB), an integral part of our Entrez
information retrieval system, is a compilation of all of the PDB three-dimensional
structures of biomolecules. The difference between the two databases
is that the MMDB records reorganize and validate the information
stored in the database in a way that enables cross-referencing between
the chemistry and the three-dimensional structure of macromolecules.
By integrating chemical, sequence, and structure information, MMDB
is designed to serve as a resource for structure-based homology
modeling and protein structure prediction.
|
|
MMDB records have value-added information
compared to the original PDB entries, including explicit chemical
graph information, uniformly derived secondary structure definitions,
structure domain information, literature citation matching,
and molecule-based assignment of taxonomy to each biologically
derived protein or nucleic acid chain. |
|
|
|
NCBI has also developed a three-dimensional structure viewer, called
Cn3D,
for easy interactive visualization of molecular structures from
Entrez. Cn3D serves as a visualization tool for sequences and sequence
alignments. What sets Cn3D apart from other software is its ability
to correlate structure and sequence information. For example, using
Cn3D, a scientist can quickly locate the residues in a crystal structure
that correspond to known disease mutations or conserved active site
residues from a family of sequence homologs, or sequences
that share a common ancestor. Cn3D displays structure-structure
alignments along with the corresponding structure-based sequence
alignments to emphasize those regions within a group of
related proteins that are most conserved in structure and sequence.
Cn3D also features custom labeling options, high-quality graphics,
and a variety of file exports that together make Cn3D a powerful
tool for literature annotation.
|
|
|
PDBeast: Taxonomy in MMDB
Taxonomy is the scientific discipline that seeks to catalog
and reconstruct the evolutionary history of life on earth. NCBI's
Structure Group, in collaboration with NCBI's taxonomists, has undertaken
taxonomy annotation for the structure data stored in MMDB.
A semi-automated approach has been implemented in which a human
expert checks, corrects, and validates automatic taxonomic assignments.
The PDBeast
software tool was developed by NCBI for this purpose. It pulls text-descriptions
of "Source Organisms" from either the original PDB-Entries or user-specified
information and looks for matches in the NCBI Taxonomy database
to record taxonomy assignments.
|
|
The
Role of Taxonomy |
- Taxonomy provides a vivid picture of the existing
organic diversity of the earth.
- Taxonomy provides much of the information permitting a
reconstruction of the phylogeny of life.
- Taxonomy reveals numerous, interesting evolutionary phenomena.
- Taxonomy supplies classifications that are of great explanatory
value in most branches of biology.
|
|
|
|
COGs: Phylogenetic Classification of Proteins
The database of Clusters
of Orthologous Groups of proteins (COGs) represents an attempt
at the phylogenetic classification of proteins— a scheme
that indicates the evolutionary relationships between organisms—from
complete genomes. Each COG includes proteins that are thought to
be orthologous. Orthologs are genes in different species derived from
a common ancestor and carried on through evolution. COGs may be used to detect
similarities and differences between species for identifying protein families
and predicting new protein functions and to point to potential drug targets in
disease-causing species. The database is accompanied by the COGnitor
program, which assigns new proteins, typically from newly sequenced
genomes, to pre-existing COGs. A Web page containing additional
structural and functional information is now associated with each
COG. These hyperlinked information pages include: systematic classification
of the COG members under the different classification systems; indications
of which COG members (if any) have been characterized genetically
and biochemically; information on the domain architecture of the
proteins constituting the COG and the three-dimensional structure
of the domains if known or predictable; a succinct summary of the
common structural and functional features of the COG members as
well as peculiarities of individual members; and key references.
|
|
|
Detecting New Sequence Similarities: BLAST against MMDB
The journal article
describing the original algorithm used in BLAST has since
become one of the most frequently cited papers of the decade,
with over 10,000 citations. |
|
Comparison, whether of structural features or protein sequences,
lies at the heart of biology. The introduction of BLAST,
or The Basic Local Alignment Search Tool, in 1990 made it easier
to rapidly scan huge databases for overt homologies, or sequence
similarities, and to statistically evaluate the resulting matches.
BLAST works by comparing a user's unknown sequence against the database
of all known sequences to determine likely matches. Sequence similarities
found by BLAST have been critical in several gene discoveries. Hundreds
of major sequencing centers and research institutions around the
country use this software to transmit a query sequence from their
local computer to a BLAST server at the NCBI via the Internet. In
a matter of seconds, the BLAST server compares the user's sequence
with up to a million known sequences and determines the closest
matches.
Not all significant homologies are readily and easily detected,
however. Some of the most interesting are subtle similarities that
do not always rise to statistical significance during a standard
BLAST search. Therefore, NCBI has extended the statistical methodology
used in the original BLAST to address the problem of detecting weak,
yet significant, sequence similarities. PSI-BLAST,
or Position-Specific Iterated BLAST, searches sequence databases
with a profile constructed using BLAST alignments, from which it
then constructs what is called a position-specific score matrix.
For protein analysis, the new Pattern Hit Initiated BLAST,
or PHI-BLAST,
serves to complement the profile-based searching that was previously
introduced with PSI-BLAST. PHI-BLAST further incorporates hypotheses
as to the biological function of a query sequence and restricts
the analysis to a set of protein sequences that is already known
to contain a specific pattern or motif.
|
|
BLAST
now comes in several varieties in addition to those described
above. Specialized BLASTs are also available for human, microbial,
and other genomes, as well as for vector contamination, immunoglobulins,
and tentative human consensus sequences. |
|
|
|
Structure Similarity Searching Using VAST
As just noted, a sequence-sequence similarity program provides
an alignment of two sets of sequences. A structure-structure similarity
program provides a three-dimensional structure superposition. Structure
similarity search services are based on the premise that some measure
can be computed between two structures to assess their similarities,
much the same way a BLAST alignment is predicted.
|
|
The detection of structural
similarity in the absence of obvious sequence similarity
is a powerful tool to study remote homologies and protein
evolution. |
|
VAST,
or the Vector Alignment Search Tool,
is a computer algorithm developed at NCBI for use in identifying
similar three-dimensional protein structures. VAST is capable of
detecting structural similarities between proteins stored in MMDB,
even when no sequence similarity is detected.
VAST
Search is NCBI's structure-structure similarity search service
that compares three-dimensional coordinates of newly determined
protein structures to those in the MMDB or PDB databases. VAST Search
creates a list of structure neighbors, or related structures, that
a user can then browse interactively. VAST Search will retrieve
almost all structures with an identical three-dimensional fold,
although it may occasionally miss a few structures or report chance
similarities.
|
|
|
The Conserved Domain Database
The
Conserved Domain Database (CDD) is a collection of sequence
alignments and profiles representing protein domains conserved in
molecular evolution. It includes domains from SMART and Pfam, two
popular Web-based tools for studying sequence domains, as well as
domains contributed by NCBI researchers. CD
Search, another NCBI search service, can be used to identify
conserved domains in a protein query sequence. CD-Search uses RPS-BLAST
to compare a query sequence against specific matrices that have
been prepared from conserved domain alignments present in CDD. Alignments
are also mapped to known three-dimensional structures and can be
displayed using Cn3D (see above).
|
|
|
Conserved Domain Architecture Retrieval Tool
NCBI's Conserved
Domain Architecture Retrieval Tool (CDART) displays the functional domains
that make up a protein and lists other proteins with similar domain
architectures. CDART determines the domain
architecture of a query protein sequence by comparing it to the CDD, a
database of conserved domain alignments, using RPS-BLAST.
The Conserved Domain Architecture Retrieval Tool then compares the protein's
domain architecture to that of other proteins in NCBI's non-redundant
sequence database. Related sequences are identified as those proteins
that share one or more similar domains. CDART displays these sequences using a
graphical summary that depicts the types and locations of domains identified
within each sequence. Links to the individual sequences as well as to further information
on their domain architectures are also provided. Because protein domains
may be considered elementary units of molecular function and proteins
related by domain architecture may play similar roles in cellular processes,
CDART serves as a useful tool in comparative sequence analysis.
|
|
RPS-BLAST is a "reverse" version of PSI-BLAST, which
is described above. Both RPS-BLAST and PSI-BLAST use similar
methods to derive conserved features of a protein family. However,
RPS-BLAST compares a query sequence against a database of profiles
prepared from ready-made alignments, whereas PSI-BLAST builds
alignments starting from a single protein sequence. The programs
also differ in purpose: RPS-BLAST is used to identify conserved
domains in a query sequence, whereas PSI-BLAST is used to identify
other members of the protein family to which a query sequence
belongs.
|
|
|
|
Application to Biomedicine
Although the information derived from modeling studies is primarily
about molecular function, protein structure data also provide a
wealth of information on mechanisms linked to the function and
the evolutionary history of and relationships between macromolecules.
NCBI's goals in adding structure data to its Web site are to make
this information easily accessible to the biomedical community worldwide
and to facilitate comparative analysis involving three-dimensional
structure.
|
|
Back to top |
|
Revised: March 29, 2004.
|
|