| |
Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources |
SYSTEMATICS AND MOLECULAR PHYLOGENETICS
|
|
Classifying Organisms
Have you ever noticed that when you see an insect or a bird, there
is real satisfaction in giving it a name, and an uncomfortable uncertainty
when you can't? Along these same lines, consider the bewildering
number and variety of organisms that live, or have lived, on this
earth. If we did not know what to call these organisms, how could
we communicate ideas about them, let alone the history of life? Thanks
to taxonomy, the field of science that classifies life into groups,
we can discuss just about any organism, from bacteria to man.
Carolus Linnaeus pioneered the grouping of organisms based on scientific
names using Latin. His system of giving an organism a scientific
name of two parts, sometimes more, is called binomial nomenclature,
or "two-word naming". His scheme was based on physical similarities
and differences, referred to as characters. Today, taxonomic
classification is much more complex and takes into account cellular
types and organization, biochemical similarities, and genetic similarities.
Taxonomy is but one aspect of a much larger field called systematics.
|
|
Taxonomic
Classification |
Taxonomic ranks approximate evolutionary distances
among groups of organisms. For example, species belonging to two
different superkingdoms are most distantly related (their common
ancestor diverged in the distant past), with progressively more
exclusive groups indicated by phylum, class and so on, down to
infraspecific ranks, or ranks occurring within a species.
Infraspecific ranks, such as subspecies, varietas, and forma,
denote the closest evolutionary relationship. See the
simplified classification of humans below.
Taxonomists, scientists who classify
living organisms, define a species as any group of closely related
organisms that can produce fertile offspring. Two organisms are
more closely "related" as they approach the level of species,
that is, they have more genes in common. The level of species
can be further divided into smaller segments. A population
is the smallest unit of a species and is made up of organisms of
the same species. Sometimes, a population will physically alter
over time to suit the needs of its environment. This is called a
cline and can make members of the same species look different.
Taxonomic Classification of Man
Homo sapiens
|
|
Superkingdom: Eukaryota
|
|
|
Kingdom: Metazoa
|
|
|
Phylum: Chordata
|
|
|
Class: Mammalia
|
|
|
Order: Primata
|
|
|
Family: Hominidae
|
|
|
Genus: Homo
|
|
|
Species: sapiens
|
|
|
|
|
What Is Phylogenetic Systematics?
Carolus Linnaeus was also credited with pioneering systematics, the
field of science dealing with the diversity of life and the relationship
between life's components. Systematics reaches beyond taxonomy to
elucidate new methods and theories that can be used to classify
species based on similarity of traits and possible mechanisms of
evolution, a change in the gene pool of a population over
time.
Phylogenetic systematics is that field of biology that
does deal with identifying and understanding the evolutionary
relationships among the many different kinds of life on earth,
both living (extant) and dead (extinct). Evolutionary
theory states that similarity among individuals or species is attributable
to common descent, or inheritance from a common ancestor. Thus,
the relationships established by phylogenetic systematics often
describe a species' evolutionary history and, hence, its phylogeny,
the historical relationships among lineages or organisms or their parts,
such as their genes.
|
|
Charles Darwin was the first to recognize that the
systematic hierarchy represented a rough approximation of evolutionary
history. However, it was not until the 1950s that the German entomologist
Willi Hennig proposed that systematics should reflect the known evolutionary
history of lineages as closely as possible, an approach he called
phylogenetic systematics. The followers of Hennig were disparagingly
referred to as "cladists" by his opponents, because of the emphasis on
recognizing only monophyletic groups, a group plus all of its descendents,
or clades. However, the cladists quickly adopted that term as a helpful label,
and nowadays, cladistic approaches to systematics are used routinely. |
|
|
|
Understanding the Evolutionary Process
Genetic Variation: Changes in a Gene Pool
Evolution is not always discrete with clearly defined boundaries
that pinpoint the origin of a new species, nor is it a steady continuum.
Evolution requires genetic variation which results
from changes within a gene pool, the genetic make-up of a
specific population. A gene pool is the combination of all the alleles
—alternative forms of a genetic locus—for all traits that population
may exhibit. Changes in a gene pool can result from mutation—variation
within a particular gene—or from changes in gene frequency—the
proportion of an allele in a given population.
|
|
|
How Does Genetic Variation Occur?
Every organism possesses a genome that contains all of the biological
information needed to construct and maintain a living example of
that organism. The biological information contained in a genome
is encoded in the nucleotide sequence of its DNA or RNA molecules
and is divided into discrete units called genes. The information
stored in a gene is read by proteins, which attach to the genome
and initiate a series of reactions called gene expression.
Every time a cell divides, it must make a complete copy of its
genome, a process called DNA replication. DNA replication
must be extremely accurate to avoid introducing mutations,
or changes in the nucleotide sequence of a short region of the genome.
Inevitably, some mutations do occur, usually in one of two ways;
either from errors in DNA replication or from damaging effects of
chemical agents or radiation that react with DNA and change the
structure of individual nucleotides. Many of these mutations result
in a change that has no effect on the functioning of the genome,
referred to as silent mutations. Silent mutations include
virtually all changes that happen in the non-coding components of
genes and gene-related sequences.
Mutations in the coding regions of genes are much more important.
Here we must consider the importance of the same mutation in a somatic
cell compared with a germ line cell. A somatic cell is
any cell of an organism other than a reproductive cell, such as a
sperm or egg cell. A germ cell line is any line of cells that gives
rise to gametes and is continuous through the generations. Because
a somatic cell does not pass on copies of its genome to the next
generation, a somatic cell mutation is important only for the organism
in which it occurs and has no potential evolutionary impact. In
fact, most somatic mutations have no significant effect because
there are many other identical cells in the same tissue.
On the other hand, mutations in germ cells can be transmitted to
the next generation and will then be present in all of the cells of
an individual who inherits that mutation. Even still, mutations
within germ line cells may not change the phenotype of the organism
in any significant way. Those mutations that do have an evolutionary
effect can be divided into two categories, loss-of-function mutations
and gain-of-function mutations. A loss-of-function mutation
results in reduced or abolished protein function. Gain-of-function
mutations, which are much less common, confer an abnormal activity
on a protein.
|
|
The randomness with which mutations can
occur is an important concept in biology and is a requirement
of the Darwinian view of evolution, which holds that changes in
the characteristics of an organism occur by chance and are not
influenced by the environment in which the organism lives. Beneficial
changes within an organism are then positively selected for, whereas
harmful changes are negatively selected. |
|
|
|
The Drivers of Evolution: Selection, Drift, and Founder Effects
We just discussed that new alleles appear in a population because
of mutations that occur in the reproductive cells of an organism.
This means that many genes are polymorphic, that is, two
or more alleles for that gene are present in a population. Each
of these alleles has its own allele or gene frequency, a
measure of how common an allele is in a population. Allele frequencies
vary over time because of two conditions, natural selection and random
drift.
|
|
Natural Selection
Natural selection is the process whereby one genotype, the
hereditary constitution of an individual, leaves more offspring
than another genotype because of superior life attributes, termed
fitness. Natural selection acts on genetic variation by conferring
a survival advantage to those individuals harboring a particular
mutation that tends to favor a changing environmental condition.
These individuals then reproduce and pass on this "new" gene, altering
their gene pool. Natural selection, therefore, decreases the frequencies
of alleles that reduce the fitness of an organism and increases the
frequency of alleles that improve fitness.
"Natural Selection" is the principle
by which each slight variation, if useful, is preserved.
|
Charles Darwin
|
|
It is important to point out that natural selection does not always
represent progress, only adaptation to a changing surrounding, that
is, evolution attributable to natural selection is devoid of intent—
something does not evolve to better itself, only to adapt.
Because environments are always changing, what was once an advantageous
mutation can often become a liability further down the evolutionary line.
|
|
Random Drift
The term random drift actually encompasses a number of
distinct processes, sometimes referred to as outcomes. They include
indiscriminate parent sampling, the founder effect,
and fluctuations in the rate of evolutionary processes such as selection,
migration, and mutation. Parent sampling is the process of
determining which organisms of one generation will be the parents
of the next generation. Parent sampling may be discriminate,
that is, with regard to fitness differences, or indiscriminate, without
regard to fitness differences. Discriminate parent sampling is generally
considered natural selection, whereas indiscriminate parent sampling
is considered random drift.
What
Is Sampling? |
Suppose a population of red and brown
squirrels share a habitat with a color blind predator.
Although the predator is color blind, the brown squirrels seem
to die in greater numbers than the red squirrels, suggesting
that the brown squirrels just seem to be unlucky enough to
come into contact with the predator more often. As a result,
the frequency of brown squirrels in the next generation is
reduced. More red squirrels survive to reproduce, or are sampled,
but it is without regard to any differences in fitness between
the two groups. The physical differences of the groups do
not play a causal role in the differences in reproductive
success.
Now, lets say that the predator is not
color blind and can now see the red squirrels better than the
brown squirrels, resulting in a better survival rate for the
brown squirrels. This would be a case of discriminate parent
sampling, or natural selection.
|
|
|
Founder Effect
Another important cause of genetic drift is the founder effect, the
difference between the gene pool of a population as a whole and
that of a newly isolated population of the same species. The founder
effect occurs when populations are started from a small number of
pioneer individuals of one original population. Because of small sample
size, the new population could have a much different genetic ratio
than the original population. An example of the founder effect would
be when a plant population results from a single seed.
Thus far, we have discussed natural selection and random drift as
events that occur in isolation from one another. However, in most
populations, the two processes will be occurring at the same time.
Furthermore, there is great debate over whether, in particular instances
and in general, natural selection is more prevalent that random
drift.
|
|
|
Phylogenetic Trees: Presenting Evolutionary Relationships
Systematics describes the pattern of relationships among taxa and
is intended to help us understand the history of all life. But history
is not something we can see—it has happened once and leaves only
clues as to the actual events. Scientists use these clues to build
hypotheses, or models, of life's history. In phylogenetic studies,
the most convenient way of visually presenting evolutionary relationships
among a group of organisms is through illustrations called phylogenetic
trees.
|
|
- Node: represents a taxonomic unit. This can be either
an existing species or an ancestor.
- Branch: defines the relationship between the taxa in
terms of descent and ancestry.
- Topology: the branching patterns of the tree.
- Branch length: represents the number of changes that
have occurred in the branch.
- Root: the common ancestor of all taxa.
- Distance scale: scale that represents the number of differences
between organisms or sequences.
- Clade: a group of two or more taxa or DNA sequences
that includes both their common ancestor and all of their descendents.
- Operational Taxonomic Unit (OTU): taxonomic level of sampling
selected by the user to be used in a study, such as individuals, populations,
species, genera, or bacterial strains.
|
|
A phylogenetic tree is composed of nodes, each representing
a taxonomic unit (species, populations, individuals), and branches,
which define the relationship between the taxonomic units in terms
of descent and ancestry. Only one branch can connect any two adjacent
nodes. The branching pattern of the tree is called the topology,
and the branch length usually represents the number of changes that
have occurred in the branch. This is called a scaled branch.
Scaled trees are often calibrated to represent the passage of time.
Such trees have a theoretical basis in the particular gene or genes
under analysis. Branches can also be unscaled, which means that the
branch length is not proportional to the number of changes that
has occurred, although the actual number may be indicated numerically
somewhere on the branch. Phylogenetic trees may also be either rooted
or unrooted. In rooted trees, there is a particular node,
called the root, representing a common ancestor, from which
a unique path leads to any other node. An unrooted tree only specifies
the relationship among species, without identifying a common ancestor,
or evolutionary path.
|
|
|
Figure 1. Possible ways of drawing a tree.
Phylogenetic trees, a convenient way of representing
evolutionary relationships among a group of organisms, can be drawn
in various ways. Branches on phylogenetic trees may be scaled (top
panel) representing the amount of evolutionary change, time, or both, when
there is a molecular clock, or they may be unscaled (middle panel)
and have no direct correspondence with either time or amount of evolutionary
change. Phylogenetic trees may be rooted (top and middle panels) or
unrooted (bottom panels). In the case of unrooted trees, branching
relationships between taxa are specified by the way they are connected
to each other, but the position of the common ancestor is not. For
example, on an unrooted tree with five species, there are five branches
(four external, one internal) on which the tree can be rooted. Rooting
on each of the five branches has different implications for evolutionary
relationships. | .
Text and figures adapted with permission
from A. Vierstraete, University of Ghent, Belgium.
|
|
|
Methods of Phylogenetic Analysis
Two major groups of analyses exist to examine phylogenetic relationships:
phenetic methods and cladistic methods. It is important
to note that phenetics and cladistics have had an uneasy relationship
over the last 40 years or so. Most of today's evolutionary biologists
favor cladistics, although a strictly cladistic approach may
result in counterintuitive results.
|
|
Phenetic Method of Analysis
Phenetics, also known as numerical taxonomy, involves the
use of various measures of overall similarity for the ranking of
species. There is no restriction on the number or type of characters
(data) that can be used, although all data must be first converted
to a numerical value, without any character "weighting". Each organism
is then compared with every other for all characters measured, and
the number of similarities (or differences) is calculated. The organisms
are then clustered in such a way that the most similar are grouped
close together and the more different ones are linked more distantly.
The taxonomic clusters, called phenograms, that result from
such an analysis do not necessarily reflect genetic similarity or
evolutionary relatedness. The lack of evolutionary significance
in phenetics has meant that this system has had little impact on
animal classification, and as a consequence, interest in and use
of phenetics has been declining in recent years.
|
|
Cladistic Method of Analysis
An alternative approach to diagramming relationships between taxa
is called cladistics. The basic assumption behind cladistics
is that members of a group share a common evolutionary history.
Thus, they are more closely related to one another than they are
to other groups of organisms. Related groups of organisms are recognized
because they share a set of unique features (apomorphies)
that were not present in distant ancestors but which are shared
by most or all of the organisms within the group. These shared derived
characteristics are called synapomorphies. Therefore, in
contrast to phenetics, cladistics groupings do not depend on whether
organisms share physical traits but depend on their evolutionary relationships.
Indeed, in cladistic analyses two organisms may share numerous characteristics
but still be considered members of different groups.
Cladistic analysis entails a number of assumptions. For example, species
are assumed to arise primarily by bifurcation, or separation, of the
ancestral lineage; species are often considered to become extinct upon
hybridization (crossbreeding); and hybridization is assumed to be rare or
absent. In addition, cladistic groupings must possess the following characteristics:
all species in a grouping must share a common ancestor and all
species derived from a common ancestor must be included in the taxon.
The application of these requirements results in the following terms
being used to describe the different ways in which groupings can
be made:
- A monophyletic grouping is one in which all species share
a common ancestor, and all species derived from that common ancestor
are included. This is the only form of grouping accepted as valid
by cladists.
- A paraphyletic grouping is one in which all species share
a common ancestor, but not all species derived from that common
ancestor are included.
- A polyphyletic grouping is one in which species that
do not share an immediate common ancestor are lumped together,
while excluding other members that would link them.
|
|
|
The Origins of Molecular Phylogenetics
Macromolecular data, meaning gene (DNA) and protein sequences,
are accumulating at an increasing rate because of recent advances in
molecular biology. For the evolutionary biologist, the rapid accumulation
of sequence data from whole genomes has been a major advance, because
the very nature of DNA allows it to be used as a "document" of evolutionary
history. Comparisons of the DNA sequences of various genes between
different organisms can tell a scientist a lot about the relationships
of organisms that cannot otherwise be inferred from morphology,
or an organism's outer form and inner structure. Because genomes evolve
by the gradual accumulation of mutations, the amount of nucleotide
sequence difference between a pair of genomes from different organisms
should indicate how recently those two genomes shared a common ancestor.
Two genomes that diverged in the recent past should have fewer differences
than two genomes whose common ancestor is more ancient. Therefore,
by comparing different genomes with each other, it should be possible
to derive evolutionary relationships between them, the major objective
of molecular phylogenetics.
Molecular phylogenetics attempts to determine the rates
and patterns of change occurring in DNA and proteins and to reconstruct
the evolutionary history of genes and organisms. Two general approaches
may be taken to obtain this information. In the first approach,
scientists use DNA to study the evolution of an organism. In the
second approach, different organisms are used to study the evolution
of DNA. Whatever the approach, the general goal is to infer process
from pattern: the processes of organismal evolution deduced
from patterns of DNA variation and processes of molecular evolution
inferred from the patterns of variations in the DNA itself.
|
|
|
Molecular Phylogenetic Analysis: Fundamental Elements
Nucleotide and protein
sequences can also be used to generate trees. DNA, RNA,
and protein sequences can be considered as phenotypic
traits. The sequences depict the relationship of genes
and usually of the organism in which the genes are found. |
|
As we just discussed, macromolecules, especially gene and protein
sequences, have surpassed morphological and other organismal characters
as the most popular forms of data for phylogenetic analyses. Therefore,
this next section will concentrate only on molecular data.
It is important to point out that a single, all-purpose
recipe does not exist for phylogenetic analysis of molecular
data. Although numerous algorithms, procedures, and computer programs
have been developed, their reliability and practicality are, in
all cases, dependent upon the size and structure of the dataset
under analysis. The merits and shortfalls of these various methods
are subject to much scientific debate, because the danger of generating
incorrect results is greater in computational molecular phylogenetics
than in many other fields of science. Occasionally, the limiting
factor in such analyses is not so much the computational method
used, but the users' understanding of what the method is actually
doing with the data. Therefore, the goal of this section is to demonstrate
to the reader that practical analysis should be thought of both
as a search for a correct model (analysis) as well as a search for
the correct tree (outcome).
Phylogenetic tree-building models presume particular evolutionary
models. For any given set of data, these models may be violated
because of various occurrences, such as the transfer of genetic
material between organisms. Therefore, when interpreting a given
analysis, a person should always consider the model used and entertain
possible explanations for the results obtained. For example, models
used in molecular phylogenetic analysis methods make "default" assumptions,
including:
- The sequence is correct and originates from the specified source.
- The sequences are homologous—all descended in some way
from a shared ancestral sequence.
- Each position in a sequence alignment is homologous with every
other in that alignment.
- Each of the multiple sequences included in a common analysis
has a common phylogenetic history with the other sequences.
- The sampling of taxa is adequate to resolve the problem under
study.
- Sequence variation among the samples is representative of the
broader group.
- The sequence variability in the sample contains phylogenetic
signal adequate to resolve the problem under study.
|
|
The
Four Steps of Phylogenetic Analysis |
A straightforward phylogenetic analysis
consists of four steps:
- Alignment—building the data model and extracting a dataset.
- Determining the substitution model—consider sequence
variation.
- Tree building.
- Tree evaluation.
|
|
|
|
Tree Building: Key Features of DNA-based Phylogenetic Trees
Studies of gene and protein evolution often involve the comparison
of homologs, sequences that have common origins but may or
may not have common activity. Sequences that share an arbitrary
level of similarity determined by alignment of matching bases are
homologous. These sequences are inherited from a common
ancestor that possessed similar structure, although the ancestor
may be difficult to determine because it has been modified through
descent.
|
|
Homologs are most commonly defined as orthologs, paralogs,
or xenologs.
Orthologs are homologs produced by speciation—they
represent genes derived from a common ancestor that
diverged because of divergence of the organism. Orthologs
tend to have similar function.
Paralogs are homologs produced by gene duplication
and represent genes derived from a common ancestral
gene that duplicated within an organism and then diverged.
Paralogs tend to have different functions.
Xenologs are homologs resulting from the horizontal
transfer of a gene between two organisms. The function
of xenologs can be variable, depending on how significant
the change in context was for the horizontally moving
gene. In general, though, the function tends to be similar.
|
|
|
|
|
A typical gene-based phylogenetic tree is depicted below. This tree
shows the relationship between four homologous genes: A, B, C, and
D. The topology of this tree consists of four external nodes (A,
B, C,
and D), each one representing
one of the four genes, and two internal nodes (e
and f) representing
ancestral genes. The branch lengths indicate the degree of evolutionary
differences between the genes. This particular tree is unrooted—it
is only an illustration of the relationships between genes A, B, C,
and D and does not signify anything about the series of evolutionary
events that led to these genes. |
|
|
|
The second panel, below, depicts three rooted trees that can be
drawn from the unrooted tree shown above, each representing the different
evolutionary pathways possible between these four genes. A rooted
tree is often referred to as an inferred tree. This is to emphasize that
this type of illustration depicts only the series of evolutionary
events that are inferred from the data under study and may not be
the same as the true tree or the tree that depicts the actual series
of evolutionary events that occurred. |
|
|
|
To distinguish between the pathways, the phylogenetic analysis
must include at least one outgroup, a gene that is less
closely related to A, B, C, and D than these genes are to each other
(panel below). Outgroups enable the root of the tree to be located
and the correct evolutionary pathway to be identified. Let's say
that the four homologous genes used in the previous tree examples
come from human, chimpanzee, gorilla, and orangutan. In this case,
an outgroup could be a gene from another primate, such as baboon,
which is known to have branched away from the four species above
before the common ancestor of the species.
|
|
|
|
|
Gene Trees Versus Species Trees—Why Are They Different?
It is assumed that a gene tree, because it is based on molecular
data, will be a more accurate and less ambiguous representation
of the species tree than that obtainable by morphological comparisons.
This may indeed be the case, but it does not mean that the gene
tree is the same as the species tree. For this to be true, the internal
nodes in both trees would have to be precisely equivalent, and they
are not. An internal node in a gene tree indicates the divergence
of an ancestral gene into two genes with different DNA sequences,
usually resulting from a mutation of one sort or another. An internal
node in a species tree represents what is called a speciation
event, whereby the population of the ancestral species splits
into two groups that are no longer able to interbreed. These two
events, mutation and speciation, do not always occur at the same
time.
|
|
Molecular
Phylogenetics Terminology |
- Monophyletic: two or more DNA sequences that are
derived from a single common ancestral DNA sequence.
- Clade: a group of monophyletic DNA sequences that
make up all of the sequences included in the analysis that
are descended from a particular common ancestral sequence.
- Parsimony: an approach that decides between different
tree topologies by identifying the one that involves the
shortest evolutionary pathway. This is the pathway that
requires the smallest number of nucleotide changes to go
from the ancestral sequence, at the root of the tree, to
all of the present-day sequences that have been compared.
- Molecular Clock Hypothesis: states that nucleotide
substitutions, or amino acid substitutions if proteins are
being compared, occur at a constant rate, that is, the degree
of difference between two sequences can be used to assign
a date to the time at which their ancestral sequence diverged.
The rate of molecular change differs among groups of organisms,
among genes, and even among different parts of the same gene.
Furthermore, molecular clocks require calibration with fossils
to determine timing of origin of clades, and thus their accuracy
is crucially dependent on the fossil record, or lack thereof, for
the groups under study. Fossil DNA older than about
25,000–50,000 years is virtually empty of phylogenetic signal
except in rare instances, and therefore traditional morphological studies
of extinct and extant organisms remain a crucial component of
phylogenetic analysis.
|
|
|
|
Systematics and NCBI
The Taxonomy Project
The purpose of NCBI's Taxonomy Project is to build a consistent
phylogenetic taxonomy for the NCBI sequence databases. The Taxonomy
Database contains the names and lineages of every organism
represented by at least one nucleotide or protein sequence in the NCBI
genetic databases. As of February 2003, this total is over 250,000 taxa.
For current information, visit NCBI's
Taxonomy
Statistics Web page. The database is recognized
as the standard reference by the international sequence database
collaboration (GenBank, EMBL, DDJB, and Swiss-Prot).
The Taxonomy
Browser is an NCBI-derived search tool that allows an individual
to search the Taxonomy database. Using the browser, information may be retrieved
on available nucleotide, protein, and structure records for a particular
species or higher taxon. The Taxonomy Browser
can be used to view the taxonomic position or retrieve sequence
and structural data for a particular organism or group of organisms.
Searches may be made on the basis of whole, partial, or phonetically
spelled organism names, and direct links to organisms commonly used in
biological research are also provided. The Entrez Taxonomy system has the
ability to display custom taxonomic trees representing user-defined subsets of
the full NCBI taxonomy.
TaxPlot,
another component of the Taxonomy project, is a research tool for
conducting three-way comparisons of different genomes. Comparisons
are based on the sequences of the proteins encoded in that organism's
genome. To use TaxPlot, one selects a reference genome to which
two other genomes will be compared. The TaxPlot tool then uses a
pre-computed BLAST result to plot a point for each protein predicted
to be included in the reference genome.
|
|
|
BLAST: Detecting New Sequence Similarities
Currently, the characters most widely used for phylogenetic analysis
are DNA and protein sequences. DNA sequences may be compared directly,
or for those regions that code for a known protein, translated into
protein sequences. Creating phylogenies from nucleotide or amino
acid sequences first requires aligning the bases so that the differences
between the sequences being studied are easier to spot.
The introduction of NCBI's BLAST,
or The Basic Local Alignment Search
Tool, in 1990 made it easier to rapidly scan huge databases
for overt homologies, or sequence similarity, and to statistically
evaluate the resulting matches. BLAST works by comparing a user's
unknown sequence against the database of all known sequences to
determine likely matches. In a matter of seconds, the BLAST server
compares the user's sequence with up to a million known sequences
and determines the closest matches.
Specialized BLASTs are also available for human, mouse, microbial,
and many other genomes. A single BLAST search can compare a sequence of
interest to all other sequences stored in GenBank,
NCBI's nucleotide sequence database. In this step, a researcher
has the option of limiting the search to a specific taxonomic group.
If the full scientific name or relationship of species of interest
is not known, the user can search for such details using NCBI's
Taxonomy Browser,
which provides direct links to some of the organisms commonly used
in molecular research projects, such as the zebrafish, fruit fly,
bakers yeast, nematode, and many more.
BLAST next tallies the differences between sequences and assigns
a "score" based on sequence similarity. The scores assigned in a
BLAST search have a well-defined statistical interpretation, making
real sequence matches easier to distinguish from random background
hits. This is because BLAST uses a special algorithm, or mathematical
formula, that seeks local as opposed to global alignments and is
therefore able to detect relationships among sequences that share
only isolated regions of similarity. Taxonomy-related BLAST
results are presented in three formats based on the information
found in NCBI's Taxonomy database. The Organism Report sorts
BLAST comparisons, also called hits, by species such that all hits
to a given organism are grouped together. The Lineage Report
provides a view of the relationships between the organisms based
on NCBI's Taxonomy database. The Taxonomy Report provides
in-depth details on the relationship between all the organisms in
the BLAST hit list.
|
|
|
COGs: Phylogenetic Classification of Proteins
The database of Clusters
of Orthologous Groups of proteins (COGs) represents an attempt
at the phylogenetic classification of proteins, a scheme
that indicates the evolutionary relationships between organisms, from
complete genomes. Each COG includes proteins that are thought to
be orthologous, or connected through vertical evolutionary
descent. COGs may be used to detect similarities and differences
between species, for identifying protein families and predicting
new protein functions, and to point to potential drug targets in
disease-causing species. The database is accompanied by the COGnitor
program, which assigns new proteins, typically from newly sequenced
genomes, to pre-existing COGs. A Web page containing additional
structural and functional information is now associated with each
COG. These hyperlinked information pages include: systematic classification
of the COG members under the different classification systems; indications
of which COG members (if any) have been characterized genetically and
biochemically; information on the domain architecture of the proteins
constituting the COG and the three-dimensional structure of the domains
if known or predictable; a succinct summary of the common structural
and functional features of the COG members, as well as peculiarities
of individual members; and key references.
|
|
|
HomoloGene
HomoloGene
is a database of both curated and calculated orthologs and homologs
for the organisms represented in NCBI's UniGene
database. Curated orthologs include gene pairs from the Mouse Genome
Database (MGD) at the Jackson Laboratory, the Zebrafish Information
(ZFIN) database at the University of Oregon, and from published reports.
Computed orthologs and homologs are identified from BLAST nucleotide
sequence comparisons between all UniGene clusters for each pair
of organisms. HomoloGene also contains a set of triplet clusters
in which orthologous clusters in two organisms are both orthologous
to the same cluster in a third organism. HomoloGene can be searched
via the Entrez retrieval system.
|
|
UniGene
is a system for automatically partitioning GenBank sequences
into a non-redundant set of gene-oriented clusters. Each
UniGene cluster contains sequences that represent a unique
gene, as well as related information, such as the tissue
types in which the gene has been expressed and map location.
|
|
|
|
|
Entrez Genome
The whole genomes of over 1,200 organisms can be found in Entrez
Genomes. The genomes represent both completely sequenced organisms
and those for which sequencing is in progress. All three main domains
of life—bacteria,
archaea,
and eukaryotes—
are represented, as well as many viruses,
viroids,
plasmids,
and eukaryotic
organelles. Data can be accessed hierarchically starting
from either an alphabetical listing or a phylogenetic tree for complete
genomes in each of six principle taxonomic groups. One can follow
the hierarchy to a variety of graphical overviews, including that
of the whole genome of a single organism, a single chromosome, or
even a single gene. At each level, one can access multiple views
of the data, pre-computed summaries, and links to analyses appropriate
for that level. In addition, any gene product (protein) that is
a member of a COG is linked to the COGs database. A summary of COG
functional groups is also presented in tabular and graphical formats
at the genome level.
For complete microbial genomes, pre-computed BLAST neighbors for
protein sequences, including their taxonomic distribution and links
to 3D structures, are given in TaxTables and PDBTables, respectively.
Pairwise sequence alignments are presented graphically and linked
to NCBI's Cn3D
macromolecular viewer that allows the interactive display of three-dimensional
structures and sequence alignments.
|
|
|
PDBeast: Taxonomy in MMDB
NCBI's Structure Group, in collaboration with NCBI taxonomists,
has undertaken taxonomy annotation for the three-dimensional
structure data stored in the Molecular
Modeling Database (MMDB). A semi-automated approach has been
implemented in which a human expert checks, corrects, and validates
automatic taxonomic assignments in MMDB. The PDBeast
software tool was developed by NCBI for this purpose. It pulls text descriptions
of "Source Organisms" from either the original entries or user-specified
information and looks for matches in the NCBI Taxonomy database
to record taxonomy assignments.
|
|
The Molecular Modeling Database (MMDB) is a compilation
of three-dimensional structures of biomolecules obtained
from the Protein Data Bank (PDB). The PDB, managed
and maintained by the Research Collaboratory for Structural
Bioinformatics, is a collection of all publicly available
three-dimensional structures of proteins, nucleic acids,
carbohydrates, and a variety of other complexes experimentally
determined by X-ray crystallography and NMR. The difference
between the two databases is that MMDB records reorganize
and validate the information stored in the database
in a way that enables cross-referencing between the
chemistry and the three-dimensional structure of macromolecules.
By integrating chemical, sequence, and structure information,
MMDB is designed to serve as a resource for structure-based
homology modeling and protein structure prediction.
|
|
|
|
|
The Importance of Molecular Phylogenetics
The field of molecular phylogenetics has grown, both in size and
in importance, since its inception in the early 1990s, attributable mostly
to advances in molecular biology and more rigorous methods for phylogenetic
tree building. The importance of phylogenetics has also been greatly
enhanced by the successful application of tree reconstruction, as
well as other phylogenetic techniques, to more diverse and perplexing
issues in biology. Today, a survey of the scientific literature
will show that molecular biology, genetics, evolution, development,
behavior, epidemiology, ecology, systematics, conservation biology,
and forensics are but a few examples of the many disparate fields
conceptually united by the methods and theories of molecular phylogenetics.
Phylogenies are used essentially the same way in all of these fields,
either by drawing inferences from the structure of the tree or from
the way the character states map onto the tree. Biologists can then
use these clues to build hypotheses and models of important events
in history. Broadly speaking, the relationships established by phylogenetic
trees often describe a species' evolutionary history and, hence,
its phylogeny—the historical relationships among lineages or
organisms or their parts, such as their genes. Phylogenies may be
thought of as a natural and meaningful way to order data, with an
enormous amount of evolutionary information contained within their
branches. Scientists working in these different areas can then use
these phylogenies to study and elucidate the biological processes
occurring at many levels of life's hierarchy.
|
|
Back to top |
|
Revised: April 1, 2004.
|
|