Phylogenetics Factsheet

National Center for Biotechnology Information

About NCBI	NCBI at a Glance	A Science Primer	Databases and Tools
Human Genome Resources	Model Organisms Guide	Outreach and Education	News

About NCBI
Site Map

Science Primer:

Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources

SYSTEMATICS AND MOLECULAR PHYLOGENETICS

Classifying Organisms

Have you ever noticed that when you see an insect or a bird, there is real satisfaction in giving it a name, and an uncomfortable uncertainty when you can't? Along these same lines, consider the bewildering number and variety of organisms that live, or have lived, on this earth. If we did not know what to call these organisms, how could we communicate ideas about them, let alone the history of life? Thanks to taxonomy, the field of science that classifies life into groups, we can discuss just about any organism, from bacteria to man.

Carolus Linnaeus pioneered the grouping of organisms based on scientific names using Latin. His system of giving an organism a scientific name of two parts, sometimes more, is called binomial nomenclature, or "two-word naming". His scheme was based on physical similarities and differences, referred to as characters. Today, taxonomic classification is much more complex and takes into account cellular types and organization, biochemical similarities, and genetic similarities. Taxonomy is but one aspect of a much larger field called systematics.

Taxonomic Classification

Taxonomic ranks approximate evolutionary distances among groups of organisms. For example, species belonging to two different superkingdoms are most distantly related (their common ancestor diverged in the distant past), with progressively more exclusive groups indicated by phylum, class and so on, down to infraspecific ranks, or ranks occurring within a species. Infraspecific ranks, such as subspecies, varietas, and forma, denote the closest evolutionary relationship. See the simplified classification of humans below.

Taxonomists, scientists who classify living organisms, define a species as any group of closely related organisms that can produce fertile offspring. Two organisms are more closely "related" as they approach the level of species, that is, they have more genes in common. The level of species can be further divided into smaller segments. A population is the smallest unit of a species and is made up of organisms of the same species. Sometimes, a population will physically alter over time to suit the needs of its environment. This is called a cline and can make members of the same species look different.

Taxonomic Classification of Man
Homo sapiens

		Superkingdom: Eukaryota
		Kingdom: Metazoa
		Phylum: Chordata
		Class: Mammalia
		Order: Primata
		Family: Hominidae
		Genus: Homo
		Species: sapiens

What Is Phylogenetic Systematics?

Carolus Linnaeus was also credited with pioneering systematics, the field of science dealing with the diversity of life and the relationship between life's components. Systematics reaches beyond taxonomy to elucidate new methods and theories that can be used to classify species based on similarity of traits and possible mechanisms of evolution, a change in the gene pool of a population over time.

Phylogenetic systematics is that field of biology that does deal with identifying and understanding the evolutionary relationships among the many different kinds of life on earth, both living (extant) and dead (extinct). Evolutionary theory states that similarity among individuals or species is attributable to common descent, or inheritance from a common ancestor. Thus, the relationships established by phylogenetic systematics often describe a species' evolutionary history and, hence, its phylogeny, the historical relationships among lineages or organisms or their parts, such as their genes.

Charles Darwin was the first to recognize that the systematic hierarchy represented a rough approximation of evolutionary history. However, it was not until the 1950s that the German entomologist Willi Hennig proposed that systematics should reflect the known evolutionary history of lineages as closely as possible, an approach he called phylogenetic systematics. The followers of Hennig were disparagingly referred to as "cladists" by his opponents, because of the emphasis on recognizing only monophyletic groups, a group plus all of its descendents, or clades. However, the cladists quickly adopted that term as a helpful label, and nowadays, cladistic approaches to systematics are used routinely.

Understanding the Evolutionary Process

Genetic Variation: Changes in a Gene Pool

Evolution is not always discrete with clearly defined boundaries that pinpoint the origin of a new species, nor is it a steady continuum. Evolution requires genetic variation which results from changes within a gene pool, the genetic make-up of a specific population. A gene pool is the combination of all the alleles —alternative forms of a genetic locus—for all traits that population may exhibit. Changes in a gene pool can result from mutation—variation within a particular gene—or from changes in gene frequency—the proportion of an allele in a given population.

How Does Genetic Variation Occur?

Every organism possesses a genome that contains all of the biological information needed to construct and maintain a living example of that organism. The biological information contained in a genome is encoded in the nucleotide sequence of its DNA or RNA molecules and is divided into discrete units called genes. The information stored in a gene is read by proteins, which attach to the genome and initiate a series of reactions called gene expression.

Every time a cell divides, it must make a complete copy of its genome, a process called DNA replication. DNA replication must be extremely accurate to avoid introducing mutations, or changes in the nucleotide sequence of a short region of the genome. Inevitably, some mutations do occur, usually in one of two ways; either from errors in DNA replication or from damaging effects of chemical agents or radiation that react with DNA and change the structure of individual nucleotides. Many of these mutations result in a change that has no effect on the functioning of the genome, referred to as silent mutations. Silent mutations include virtually all changes that happen in the non-coding components of genes and gene-related sequences.

Mutations in the coding regions of genes are much more important. Here we must consider the importance of the same mutation in a somatic cell compared with a germ line cell. A somatic cell is any cell of an organism other than a reproductive cell, such as a sperm or egg cell. A germ cell line is any line of cells that gives rise to gametes and is continuous through the generations. Because a somatic cell does not pass on copies of its genome to the next generation, a somatic cell mutation is important only for the organism in which it occurs and has no potential evolutionary impact. In fact, most somatic mutations have no significant effect because there are many other identical cells in the same tissue.

On the other hand, mutations in germ cells can be transmitted to the next generation and will then be present in all of the cells of an individual who inherits that mutation. Even still, mutations within germ line cells may not change the phenotype of the organism in any significant way. Those mutations that do have an evolutionary effect can be divided into two categories, loss-of-function mutations and gain-of-function mutations. A loss-of-function mutation results in reduced or abolished protein function. Gain-of-function mutations, which are much less common, confer an abnormal activity on a protein.

The randomness with which mutations can occur is an important concept in biology and is a requirement of the Darwinian view of evolution, which holds that changes in the characteristics of an organism occur by chance and are not influenced by the environment in which the organism lives. Beneficial changes within an organism are then positively selected for, whereas harmful changes are negatively selected.

The Drivers of Evolution: Selection, Drift, and Founder Effects

We just discussed that new alleles appear in a population because of mutations that occur in the reproductive cells of an organism. This means that many genes are polymorphic, that is, two or more alleles for that gene are present in a population. Each of these alleles has its own allele or gene frequency, a measure of how common an allele is in a population. Allele frequencies vary over time because of two conditions, natural selection and random drift.

Natural Selection

Natural selection is the process whereby one genotype, the hereditary constitution of an individual, leaves more offspring than another genotype because of superior life attributes, termed fitness. Natural selection acts on genetic variation by conferring a survival advantage to those individuals harboring a particular mutation that tends to favor a changing environmental condition. These individuals then reproduce and pass on this "new" gene, altering their gene pool. Natural selection, therefore, decreases the frequencies of alleles that reduce the fitness of an organism and increases the frequency of alleles that improve fitness.

"Natural Selection" is the principle by which each slight variation, if useful, is preserved.

Charles Darwin

It is important to point out that natural selection does not always represent progress, only adaptation to a changing surrounding, that is, evolution attributable to natural selection is devoid of intent— something does not evolve to better itself, only to adapt. Because environments are always changing, what was once an advantageous mutation can often become a liability further down the evolutionary line.

Random Drift

The term random drift actually encompasses a number of distinct processes, sometimes referred to as outcomes. They include indiscriminate parent sampling, the founder effect, and fluctuations in the rate of evolutionary processes such as selection, migration, and mutation. Parent sampling is the process of determining which organisms of one generation will be the parents of the next generation. Parent sampling may be discriminate, that is, with regard to fitness differences, or indiscriminate, without regard to fitness differences. Discriminate parent sampling is generally considered natural selection, whereas indiscriminate parent sampling is considered random drift.

What Is Sampling?

Suppose a population of red and brown squirrels share a habitat with a color blind predator. Although the predator is color blind, the brown squirrels seem to die in greater numbers than the red squirrels, suggesting that the brown squirrels just seem to be unlucky enough to come into contact with the predator more often. As a result, the frequency of brown squirrels in the next generation is reduced. More red squirrels survive to reproduce, or are sampled, but it is without regard to any differences in fitness between the two groups. The physical differences of the groups do not play a causal role in the differences in reproductive success.

Now, lets say that the predator is not color blind and can now see the red squirrels better than the brown squirrels, resulting in a better survival rate for the brown squirrels. This would be a case of discriminate parent sampling, or natural selection.

Founder Effect

Another important cause of genetic drift is the founder effect, the difference between the gene pool of a population as a whole and that of a newly isolated population of the same species. The founder effect occurs when populations are started from a small number of pioneer individuals of one original population. Because of small sample size, the new population could have a much different genetic ratio than the original population. An example of the founder effect would be when a plant population results from a single seed.

Thus far, we have discussed natural selection and random drift as events that occur in isolation from one another. However, in most populations, the two processes will be occurring at the same time. Furthermore, there is great debate over whether, in particular instances and in general, natural selection is more prevalent that random drift.

Phylogenetic Trees: Presenting Evolutionary Relationships

Systematics describes the pattern of relationships among taxa and is intended to help us understand the history of all life. But history is not something we can see—it has happened once and leaves only clues as to the actual events. Scientists use these clues to build hypotheses, or models, of life's history. In phylogenetic studies, the most convenient way of visually presenting evolutionary relationships among a group of organisms is through illustrations called phylogenetic trees.

A phylogenetic tree, as described to the left.

Node: represents a taxonomic unit. This can be either an existing species or an ancestor.

Branch: defines the relationship between the taxa in terms of descent and ancestry.

Topology: the branching patterns of the tree.

Branch length: represents the number of changes that have occurred in the branch.

Root: the common ancestor of all taxa.

Distance scale: scale that represents the number of differences between organisms or sequences.

Clade: a group of two or more taxa or DNA sequences that includes both their common ancestor and all of their descendents.

Operational Taxonomic Unit (OTU): taxonomic level of sampling selected by the user to be used in a study, such as individuals, populations, species, genera, or bacterial strains.

A phylogenetic tree is composed of nodes, each representing a taxonomic unit (species, populations, individuals), and branches, which define the relationship between the taxonomic units in terms of descent and ancestry. Only one branch can connect any two adjacent nodes. The branching pattern of the tree is called the topology, and the branch length usually represents the number of changes that have occurred in the branch. This is called a scaled branch. Scaled trees are often calibrated to represent the passage of time. Such trees have a theoretical basis in the particular gene or genes under analysis. Branches can also be unscaled, which means that the branch length is not proportional to the number of changes that has occurred, although the actual number may be indicated numerically somewhere on the branch. Phylogenetic trees may also be either rooted or unrooted. In rooted trees, there is a particular node, called the root, representing a common ancestor, from which a unique path leads to any other node. An unrooted tree only specifies the relationship among species, without identifying a common ancestor, or evolutionary path.

Figure 1. Possible ways of drawing a tree.

Phylogenetic trees, a convenient way of representing evolutionary relationships among a group of organisms, can be drawn in various ways. Branches on phylogenetic trees may be scaled (top panel) representing the amount of evolutionary change, time, or both, when there is a molecular clock, or they may be unscaled (middle panel) and have no direct correspondence with either time or amount of evolutionary change. Phylogenetic trees may be rooted (top and middle panels) or unrooted (bottom panels). In the case of unrooted trees, branching relationships between taxa are specified by the way they are connected to each other, but the position of the common ancestor is not. For example, on an unrooted tree with five species, there are five branches (four external, one internal) on which the tree can be rooted. Rooting on each of the five branches has different implications for evolutionary relationships.

Text and figures adapted with permission from A. Vierstraete, University of Ghent, Belgium.

Methods of Phylogenetic Analysis

Two major groups of analyses exist to examine phylogenetic relationships: phenetic methods and cladistic methods. It is important to note that phenetics and cladistics have had an uneasy relationship over the last 40 years or so. Most of today's evolutionary biologists favor cladistics, although a strictly cladistic approach may result in counterintuitive results.

Phenetic Method of Analysis

Phenetics, also known as numerical taxonomy, involves the use of various measures of overall similarity for the ranking of species. There is no restriction on the number or type of characters (data) that can be used, although all data must be first converted to a numerical value, without any character "weighting". Each organism is then compared with every other for all characters measured, and the number of similarities (or differences) is calculated. The organisms are then clustered in such a way that the most similar are grouped close together and the more different ones are linked more distantly. The taxonomic clusters, called phenograms, that result from such an analysis do not necessarily reflect genetic similarity or evolutionary relatedness. The lack of evolutionary significance in phenetics has meant that this system has had little impact on animal classification, and as a consequence, interest in and use of phenetics has been declining in recent years.

Cladistic Method of Analysis

An alternative approach to diagramming relationships between taxa is called cladistics. The basic assumption behind cladistics is that members of a group share a common evolutionary history. Thus, they are more closely related to one another than they are to other groups of organisms. Related groups of organisms are recognized because they share a set of unique features (apomorphies) that were not present in distant ancestors but which are shared by most or all of the organisms within the group. These shared derived characteristics are called synapomorphies. Therefore, in contrast to phenetics, cladistics groupings do not depend on whether organisms share physical traits but depend on their evolutionary relationships. Indeed, in cladistic analyses two organisms may share numerous characteristics but still be considered members of different groups.

Cladistic analysis entails a number of assumptions. For example, species are assumed to arise primarily by bifurcation, or separation, of the ancestral lineage; species are often considered to become extinct upon hybridization (crossbreeding); and hybridization is assumed to be rare or absent. In addition, cladistic groupings must possess the following characteristics: all species in a grouping must share a common ancestor and all species derived from a common ancestor must be included in the taxon. The application of these requirements results in the following terms being used to describe the different ways in which groupings can be made:

A monophyletic grouping is one in which all species share a common ancestor, and all species derived from that common ancestor are included. This is the only form of grouping accepted as valid by cladists.
A paraphyletic grouping is one in which all species share a common ancestor, but not all species derived from that common ancestor are included.
A polyphyletic grouping is one in which species that do not share an immediate common ancestor are lumped together, while excluding other members that would link them.

The Origins of Molecular Phylogenetics

Macromolecular data, meaning gene (DNA) and protein sequences, are accumulating at an increasing rate because of recent advances in molecular biology. For the evolutionary biologist, the rapid accumulation of sequence data from whole genomes has been a major advance, because the very nature of DNA allows it to be used as a "document" of evolutionary history. Comparisons of the DNA sequences of various genes between different organisms can tell a scientist a lot about the relationships of organisms that cannot otherwise be inferred from morphology, or an organism's outer form and inner structure. Because genomes evolve by the gradual accumulation of mutations, the amount of nucleotide sequence difference between a pair of genomes from different organisms should indicate how recently those two genomes shared a common ancestor. Two genomes that diverged in the recent past should have fewer differences than two genomes whose common ancestor is more ancient. Therefore, by comparing different genomes with each other, it should be possible to derive evolutionary relationships between them, the major objective of molecular phylogenetics.

Molecular phylogenetics attempts to determine the rates and patterns of change occurring in DNA and proteins and to reconstruct the evolutionary history of genes and organisms. Two general approaches may be taken to obtain this information. In the first approach, scientists use DNA to study the evolution of an organism. In the second approach, different organisms are used to study the evolution of DNA. Whatever the approach, the general goal is to infer process from pattern: the processes of organismal evolution deduced from patterns of DNA variation and processes of molecular evolution inferred from the patterns of variations in the DNA itself.

Molecular Phylogenetic Analysis: Fundamental Elements

Nucleotide and protein sequences can also be used to generate trees. DNA, RNA, and protein sequences can be considered as phenotypic traits. The sequences depict the relationship of genes and usually of the organism in which the genes are found.

As we just discussed, macromolecules, especially gene and protein sequences, have surpassed morphological and other organismal characters as the most popular forms of data for phylogenetic analyses. Therefore, this next section will concentrate only on molecular data.

It is important to point out that a single, all-purpose recipe does not exist for phylogenetic analysis of molecular data. Although numerous algorithms, procedures, and computer programs have been developed, their reliability and practicality are, in all cases, dependent upon the size and structure of the dataset under analysis. The merits and shortfalls of these various methods are subject to much scientific debate, because the danger of generating incorrect results is greater in computational molecular phylogenetics than in many other fields of science. Occasionally, the limiting factor in such analyses is not so much the computational method used, but the users' understanding of what the method is actually doing with the data. Therefore, the goal of this section is to demonstrate to the reader that practical analysis should be thought of both as a search for a correct model (analysis) as well as a search for the correct tree (outcome).

Phylogenetic tree-building models presume particular evolutionary models. For any given set of data, these models may be violated because of various occurrences, such as the transfer of genetic material between organisms. Therefore, when interpreting a given analysis, a person should always consider the model used and entertain possible explanations for the results obtained. For example, models used in molecular phylogenetic analysis methods make "default" assumptions, including:

The sequence is correct and originates from the specified source.

The sequences are homologous—all descended in some way from a shared ancestral sequence.

Each position in a sequence alignment is homologous with every other in that alignment.

Each of the multiple sequences included in a common analysis has a common phylogenetic history with the other sequences.

The sampling of taxa is adequate to resolve the problem under study.

Sequence variation among the samples is representative of the broader group.

The sequence variability in the sample contains phylogenetic signal adequate to resolve the problem under study.

The Four Steps of Phylogenetic Analysis

A straightforward phylogenetic analysis consists of four steps:

Alignment—building the data model and extracting a dataset.

Determining the substitution model—consider sequence variation.

Tree building.

Tree evaluation.

Tree Building: Key Features of DNA-based Phylogenetic Trees

Studies of gene and protein evolution often involve the comparison of homologs, sequences that have common origins but may or may not have common activity. Sequences that share an arbitrary level of similarity determined by alignment of matching bases are homologous. These sequences are inherited from a common ancestor that possessed similar structure, although the ancestor may be difficult to determine because it has been modified through descent.

Homologs are most commonly defined as orthologs, paralogs, or xenologs.

Orthologs

Paralogs

Xenologs

A typical gene-based phylogenetic tree is depicted below. This tree shows the relationship between four homologous genes: A, B, C, and D. The topology of this tree consists of four external nodes (A, B, C, and D), each one representing one of the four genes, and two internal nodes (e and f) representing ancestral genes. The branch lengths indicate the degree of evolutionary differences between the genes. This particular tree is unrooted—it is only an illustration of the relationships between genes A, B, C, and D and does not signify anything about the series of evolutionary events that led to these genes.

Image depicting a typical gene-based phylogenetic tree.

The second panel, below, depicts three rooted trees that can be drawn from the unrooted tree shown above, each representing the different evolutionary pathways possible between these four genes. A rooted tree is often referred to as an inferred tree. This is to emphasize that this type of illustration depicts only the series of evolutionary events that are inferred from the data under study and may not be the same as the true tree or the tree that depicts the actual series of evolutionary events that occurred.

Three examples of rooted trees which can be drawn from the unrooted tree described above.

To distinguish between the pathways, the phylogenetic analysis must include at least one outgroup, a gene that is less closely related to A, B, C, and D than these genes are to each other (panel below). Outgroups enable the root of the tree to be located and the correct evolutionary pathway to be identified. Let's say that the four homologous genes used in the previous tree examples come from human, chimpanzee, gorilla, and orangutan. In this case, an outgroup could be a gene from another primate, such as baboon, which is known to have branched away from the four species above before the common ancestor of the species.

A diagram of a phylogenetic tree including an outgroup.

Gene Trees Versus Species Trees—Why Are They Different?

It is assumed that a gene tree, because it is based on molecular data, will be a more accurate and less ambiguous representation of the species tree than that obtainable by morphological comparisons. This may indeed be the case, but it does not mean that the gene tree is the same as the species tree. For this to be true, the internal nodes in both trees would have to be precisely equivalent, and they are not. An internal node in a gene tree indicates the divergence of an ancestral gene into two genes with different DNA sequences, usually resulting from a mutation of one sort or another. An internal node in a species tree represents what is called a speciation event, whereby the population of the ancestral species splits into two groups that are no longer able to interbreed. These two events, mutation and speciation, do not always occur at the same time.

Molecular Phylogenetics Terminology

Monophyletic: two or more DNA sequences that are derived from a single common ancestral DNA sequence.

Clade: a group of monophyletic DNA sequences that make up all of the sequences included in the analysis that are descended from a particular common ancestral sequence.

Parsimony: an approach that decides between different tree topologies by identifying the one that involves the shortest evolutionary pathway. This is the pathway that requires the smallest number of nucleotide changes to go from the ancestral sequence, at the root of the tree, to all of the present-day sequences that have been compared.

Molecular Clock Hypothesis: states that nucleotide substitutions, or amino acid substitutions if proteins are being compared, occur at a constant rate, that is, the degree of difference between two sequences can be used to assign a date to the time at which their ancestral sequence diverged. The rate of molecular change differs among groups of organisms, among genes, and even among different parts of the same gene. Furthermore, molecular clocks require calibration with fossils to determine timing of origin of clades, and thus their accuracy is crucially dependent on the fossil record, or lack thereof, for the groups under study. Fossil DNA older than about 25,000–50,000 years is virtually empty of phylogenetic signal except in rare instances, and therefore traditional morphological studies of extinct and extant organisms remain a crucial component of phylogenetic analysis.

Systematics and NCBI

The Taxonomy Project

The purpose of NCBI's Taxonomy Project is to build a consistent phylogenetic taxonomy for the NCBI sequence databases. The Taxonomy Database contains the names and lineages of every organism represented by at least one nucleotide or protein sequence in the NCBI genetic databases. As of February 2003, this total is over 250,000 taxa. For current information, visit NCBI's Taxonomy Statistics Web page. The database is recognized as the standard reference by the international sequence database collaboration (GenBank, EMBL, DDJB, and Swiss-Prot).

The Taxonomy Browser is an NCBI-derived search tool that allows an individual to search the Taxonomy database. Using the browser, information may be retrieved on available nucleotide, protein, and structure records for a particular species or higher taxon. The Taxonomy Browser can be used to view the taxonomic position or retrieve sequence and structural data for a particular organism or group of organisms. Searches may be made on the basis of whole, partial, or phonetically spelled organism names, and direct links to organisms commonly used in biological research are also provided. The Entrez Taxonomy system has the ability to display custom taxonomic trees representing user-defined subsets of the full NCBI taxonomy.

TaxPlot, another component of the Taxonomy project, is a research tool for conducting three-way comparisons of different genomes. Comparisons are based on the sequences of the proteins encoded in that organism's genome. To use TaxPlot, one selects a reference genome to which two other genomes will be compared. The TaxPlot tool then uses a pre-computed BLAST result to plot a point for each protein predicted to be included in the reference genome.

BLAST: Detecting New Sequence Similarities

Currently, the characters most widely used for phylogenetic analysis are DNA and protein sequences. DNA sequences may be compared directly, or for those regions that code for a known protein, translated into protein sequences. Creating phylogenies from nucleotide or amino acid sequences first requires aligning the bases so that the differences between the sequences being studied are easier to spot.

The introduction of NCBI's BLAST, or The Basic Local Alignment Search Tool, in 1990 made it easier to rapidly scan huge databases for overt homologies, or sequence similarity, and to statistically evaluate the resulting matches. BLAST works by comparing a user's unknown sequence against the database of all known sequences to determine likely matches. In a matter of seconds, the BLAST server compares the user's sequence with up to a million known sequences and determines the closest matches.

Specialized BLASTs are also available for human, mouse, microbial, and many other genomes. A single BLAST search can compare a sequence of interest to all other sequences stored in GenBank, NCBI's nucleotide sequence database. In this step, a researcher has the option of limiting the search to a specific taxonomic group. If the full scientific name or relationship of species of interest is not known, the user can search for such details using NCBI's Taxonomy Browser, which provides direct links to some of the organisms commonly used in molecular research projects, such as the zebrafish, fruit fly, bakers yeast, nematode, and many more.

BLAST next tallies the differences between sequences and assigns a "score" based on sequence similarity. The scores assigned in a BLAST search have a well-defined statistical interpretation, making real sequence matches easier to distinguish from random background hits. This is because BLAST uses a special algorithm, or mathematical formula, that seeks local as opposed to global alignments and is therefore able to detect relationships among sequences that share only isolated regions of similarity. Taxonomy-related BLAST results are presented in three formats based on the information found in NCBI's Taxonomy database. The Organism Report sorts BLAST comparisons, also called hits, by species such that all hits to a given organism are grouped together. The Lineage Report provides a view of the relationships between the organisms based on NCBI's Taxonomy database. The Taxonomy Report provides in-depth details on the relationship between all the organisms in the BLAST hit list.

COGs: Phylogenetic Classification of Proteins

The database of Clusters of Orthologous Groups of proteins (COGs) represents an attempt at the phylogenetic classification of proteins, a scheme that indicates the evolutionary relationships between organisms, from complete genomes. Each COG includes proteins that are thought to be orthologous, or connected through vertical evolutionary descent. COGs may be used to detect similarities and differences between species, for identifying protein families and predicting new protein functions, and to point to potential drug targets in disease-causing species. The database is accompanied by the COGnitor program, which assigns new proteins, typically from newly sequenced genomes, to pre-existing COGs. A Web page containing additional structural and functional information is now associated with each COG. These hyperlinked information pages include: systematic classification of the COG members under the different classification systems; indications of which COG members (if any) have been characterized genetically and biochemically; information on the domain architecture of the proteins constituting the COG and the three-dimensional structure of the domains if known or predictable; a succinct summary of the common structural and functional features of the COG members, as well as peculiarities of individual members; and key references.

HomoloGene

HomoloGene is a database of both curated and calculated orthologs and homologs for the organisms represented in NCBI's UniGene database. Curated orthologs include gene pairs from the Mouse Genome Database (MGD) at the Jackson Laboratory, the Zebrafish Information (ZFIN) database at the University of Oregon, and from published reports. Computed orthologs and homologs are identified from BLAST nucleotide sequence comparisons between all UniGene clusters for each pair of organisms. HomoloGene also contains a set of triplet clusters in which orthologous clusters in two organisms are both orthologous to the same cluster in a third organism. HomoloGene can be searched via the Entrez retrieval system.

UniGene is a system for automatically partitioning GenBank sequences into a non-redundant set of gene-oriented clusters. Each UniGene cluster contains sequences that represent a unique gene, as well as related information, such as the tissue types in which the gene has been expressed and map location.

Entrez Genome

The whole genomes of over 1,200 organisms can be found in Entrez Genomes. The genomes represent both completely sequenced organisms and those for which sequencing is in progress. All three main domains of life—bacteria, archaea, and eukaryotes— are represented, as well as many viruses, viroids, plasmids, and eukaryotic organelles. Data can be accessed hierarchically starting from either an alphabetical listing or a phylogenetic tree for complete genomes in each of six principle taxonomic groups. One can follow the hierarchy to a variety of graphical overviews, including that of the whole genome of a single organism, a single chromosome, or even a single gene. At each level, one can access multiple views of the data, pre-computed summaries, and links to analyses appropriate for that level. In addition, any gene product (protein) that is a member of a COG is linked to the COGs database. A summary of COG functional groups is also presented in tabular and graphical formats at the genome level.

For complete microbial genomes, pre-computed BLAST neighbors for protein sequences, including their taxonomic distribution and links to 3D structures, are given in TaxTables and PDBTables, respectively. Pairwise sequence alignments are presented graphically and linked to NCBI's Cn3D macromolecular viewer that allows the interactive display of three-dimensional structures and sequence alignments.

PDBeast: Taxonomy in MMDB

NCBI's Structure Group, in collaboration with NCBI taxonomists, has undertaken taxonomy annotation for the three-dimensional structure data stored in the Molecular Modeling Database (MMDB). A semi-automated approach has been implemented in which a human expert checks, corrects, and validates automatic taxonomic assignments in MMDB. The PDBeast software tool was developed by NCBI for this purpose. It pulls text descriptions of "Source Organisms" from either the original entries or user-specified information and looks for matches in the NCBI Taxonomy database to record taxonomy assignments.

The Molecular Modeling Database (MMDB) is a compilation of three-dimensional structures of biomolecules obtained from the Protein Data Bank (PDB). The PDB, managed and maintained by the Research Collaboratory for Structural Bioinformatics, is a collection of all publicly available three-dimensional structures of proteins, nucleic acids, carbohydrates, and a variety of other complexes experimentally determined by X-ray crystallography and NMR. The difference between the two databases is that MMDB records reorganize and validate the information stored in the database in a way that enables cross-referencing between the chemistry and the three-dimensional structure of macromolecules. By integrating chemical, sequence, and structure information, MMDB is designed to serve as a resource for structure-based homology modeling and protein structure prediction.

The Importance of Molecular Phylogenetics

The field of molecular phylogenetics has grown, both in size and in importance, since its inception in the early 1990s, attributable mostly to advances in molecular biology and more rigorous methods for phylogenetic tree building. The importance of phylogenetics has also been greatly enhanced by the successful application of tree reconstruction, as well as other phylogenetic techniques, to more diverse and perplexing issues in biology. Today, a survey of the scientific literature will show that molecular biology, genetics, evolution, development, behavior, epidemiology, ecology, systematics, conservation biology, and forensics are but a few examples of the many disparate fields conceptually united by the methods and theories of molecular phylogenetics. Phylogenies are used essentially the same way in all of these fields, either by drawing inferences from the structure of the tree or from the way the character states map onto the tree. Biologists can then use these clues to build hypotheses and models of important events in history. Broadly speaking, the relationships established by phylogenetic trees often describe a species' evolutionary history and, hence, its phylogeny—the historical relationships among lineages or organisms or their parts, such as their genes. Phylogenies may be thought of as a natural and meaningful way to order data, with an enormous amount of evolutionary information contained within their branches. Scientists working in these different areas can then use these phylogenies to study and elucidate the biological processes occurring at many levels of life's hierarchy.

Revised: April 1, 2004.

Privacy Statement

Disclaimer

Accessibility