DAVID Bioinformatics
The Database for Annotation, Visualization and Integrated Discovery
DAVID Bioinformatics Resources 2008
National Institute of Allergy and Infectious Diseases (NIAID), NIH
Help
Gene Functional Classification


1.  Introduction
2.  General Analysis Data Flow
3.  Options
4.  View Results in Text Mode
5.  View Gene-annotation on 2-D View
6.  View Results in HeatMap
7.  Introduction of heuristic fuzzy clustering

    1. Introduction

Grouping genes based on functional similarity can systematically enhance biological interpretation of large lists of genes derived from high throughput studies. The Functional Classification Tool generates a gene-to-gene similarity matrix based shared functional annotation using over 75,000 terms from 14 functional annotation sources. Our novel clustering algorithms classifies highly related genes into functionally related groups. Tools are provide to further explore each functional gene cluster including listing of the ?consensus terms? shared by the genes in the cluster, display of enriched terms, and heat map visualization of gene-to-term relationships. A global view of cluster-to-cluster relationships is provided using a fuzzy heat map visualization. Summary information provided by the Functional Classification Tool is extensively linked to DAVID Functional Annotation Tools and to external databases allowing further detailed exploration of gene and term information. The Functional Classification Tool provides a rapid means to organize large lists of genes into functionally related groups to help unravel the biological content captured by high throughput technologies.
    2. General Analysis Data Flow



          top
    3. Options.

Standard Option:


Clustering Stringency (lowest -> highest): a high level single control to establish a set of detailed parameters involved in functional classification algorithms. In general, higher stringency setting generates less functional groups with more tightly associated genes in each group, so that more genes will be treated as ?irrelevant? one into unclustered group. Default setting is Medium, which gives balanced results for most cases based on our studies. Customize allows you to set it any way you want with Advanced options.

Advanced Options:
   
Similarity Term Overlap (any value >=0; default = 4):
the minimum number of annotation terms overlapped between two genes in order to be qualified for kappa calculation. This parameter is to maintain necessary statistical power to make kappa value more meaningful. The higher value, the more meaningful the result is.

Similarity Threshold (any value between 0 to 1; Default = 0.35):
the minimum kappa value to be considered biological significant. The higher setting, the more genes will be put into unclustered group, which lead to higher quality of functional classification result with a fewer groups and a fewer gene members. Kappa value 0.3 starts giving meaningful biology based on our genome-wide distribution study. Anything below 0.3 have great chance to be noise.

Initial Group Members (any value >=2; default = 4):
the minimum gene number in a seeding group, which affects the minimum size of each functional group in the final. In general, the lower value attempts to include more genes in functional groups, particularly generates a lot small size groups.

Final
Group Members (any value >=2; default = 4): the minimum gene number in one final group after ?cleanup? procedure. In general, the lower value attempts to include more genes in functional groups, particularly generates a lot small size groups. It co-functions with previous parameters to control the minimum size of functional groups. If you are interested in functional groups containing only 2 or 3 genes, you need to set it to a very low value. Otherwise, the small group will not be displayed and will be put into the unclustered group.

Multi-linkage Threshold (any value between 0% to 100%; default = 50%):
It controls how seeding groups merge each other, i.e. two groups sharing the same gene members over the percentage will become one group. The higher percentage, in general, gives sharper separation i.e. it generates more final functional groups with more tightly associated genes in each group. In addition, changing the parameter does not contribute extra genes into unclustered group.


top
    4. View Results in Text Mode


       
Gene(s) not in the ouput
: Any genes in user?s list are NOT mapped to any of the functional groups, i.e. orphan genes or irrelevant genes. The possible reasons are: 1. it does not have relationship with any of other genes above similarity threshold. 2.  it has relationship with a few other genes. But they do not have enough members to form a functional group based on minimum final cluster members. 3. False negative. We know our current algorithm could have up to 2% false negative rate. If you believe it happens to your list, please report to us.
Enriched Term in Group (T):
It submits the gene members in the group to our functional annotation engine. The result of DAVID chart report tries to highlight the most likely biology associated with the group.

2-D View:
It allows user to see gene members and their associated annotation term in a heatmap type of view so that user can further explore the gene-gene and term-term relationships within a group.  The terms displayed in the map have to pass the term frequency setting in option session, i.e. 50% of gene associates it as default.

Group Enrichment Score:
It ranks the biological significance of gene groups based on overall EASE scores of all enriched annotation terms.  In another words, step 1, run user's gene list with DAVID functional annotation chart to get p-value(EASE score) for each enriched annotation terms; step 2, calculate geometric mean of EASE scores of those terms involved in this gene group.

Search Related Genes (RG):
It summarizes the common (consensus) annotation term profile of the functional group based on term frequency and ask the question ? which other genes have similar annotation terms profile??. The function allows user to search within user?s list or defined genomes, e.g. homo sapiens. 
2-D View ( ):
It allows users to exam the common and difference of annotations cross the group gene members. See 2-D session for more details.



 
top
5. View Gene-Annotation Association on 2-D View



top
    6. View Results in HeatMap

Overall View:
 
Hierarchical Heat Map Vs. Fuzzy Heat Map The major different of traditional heat map and fuzzy heat map is that the latter one allows genes and terms to appear multiple times within the heat map providing a much clearer view of gene-to-term relationships within a cluster of related genes and a much clearer view of cluster-to-cluster relationships.   Global View of Fuzzy Heat Map:
    Detail View of Fuzzy Heat Map:
 
7. Heuristic Multiple Linkage Clustering

We developed a novel heuristic partitioning procedure that allows an object (gene) to participate in more than one cluster. The use of this method in grouping related genes better reflects the nature of biology in that a given gene may be associated with more than one functional group of genes. Two additional advancements included in this algorithm are: 1) the automatic determination of the optimal numbers of clusters (K), and 2) the exclusion of members (genes) that have weak relationships to other members.  Users are permitted to change default parameters to set cluster membership similarity stringencies.  Fuzzy Heuristic Partitioning of a gene list yields high quality clusters of highly related genes, with some genes participating in more than one function cluster.
Algorithm:
o Fuzzy seeding by allowing each gene to serve as a medoid (# neighbor > 4 && cross relevance > 50%)
o Merge seeding clusters by multiple linkage
o Repeat 2 until no more merge needed


Figure: Graphic illustration of the heuristic fuzzy partition algorithm. A. Hypothetically each element (gene) can be positioned in a virtual two-dimensional space based on its characters (annotation terms). The distance represents the degree of relationship (kappa score) among the genes. B. Any gene has a chance as a medoid to form an initial seeding group. Only the initial groups with enough closely related members (e.g. members >3 & kappa score >= 0.4) are qualified (solid circle). Conversely, unqualified ones are in dash circles. Importantly, the genes not covered by any qualified initial seeding group are considered as outliers (gray color) which are carried along, but not to participate in next steps.  C. Every qualified initial seeding group is iteratively merged with each other to form a larger group based on the multi-linkage rule, i.e. sharing 50% or more of memberships, until all secondary clusters (thicker oval) are stable. D. Finally, three final groups (thicker oval) are formed because they can no longer be merged with any other group. One gene (in red) belonging to two groups represents the fuzziness capability of the algorithm. And outliers (in gray) are removed for clearer presentation.

An hyperthetical step-by-step example: example.doc
 
top

Last Edit:  Jan. 2007







 Please cite the web site or Genome Biology 2003; 4(5):P3 within any publication that makes use of any methods inspired by DAVID.
                          

        

                 
Term of Service | Contact Us | Site Map