ORIOGEN (O rder R estricted I nference
for O rdered G ene E xpressioN )
Developed by:
Shyamal D. Peddada
Biostatistics Branch
National Institute of Environmental Health Sciences
National Institute of Health
peddada@niehs.nih.gov
Programmed by:
John Zajd and Shawn Harris
Constella Group, Inc.
oriogen@ConstellaGroup.com
© Copyright 2004-2005
ORIOGEN Version 2.2 Release Description
ORIOGEN is a user-friendly Java-based software package for selecting and clustering genes according to their time-course or dose-response profiles. It is based on the methodology developed in Peddada et al. (2003), with a few modifications described below.
The user pre-specifies a list of profiles (or patterns) of mean gene expression over time/dose that may be of interest for a specific experiment. The present version of ORIOGEN can detect increasing, decreasing, umbrella-shaped or inverted-umbrella-shaped patterns and cyclic patterns (up to one cycle only).
Here the word mean refers to the population mean (which is unknown) and not the sample mean, which is calculated from given data and provides an estimate of the population mean. Thus the profiles are described in terms of the population means. Note that, since sample mean is a random realization from a population, the observed sample mean expressions over time/dose may not conform exactly to a pattern of mean expression satisfied by the population means. For example, the experimenter may be interested in selecting a gene whose mean expression increases with time/dose (known as increasing shape). However, due to the randomness in the data, the observed sample means may not necessarily have an increasing profile. Similarly, an experimenter may be interested in selecting genes where the mean expression increases with time/dose up to a certain point and then decreases (i.e. umbrella shape). Due to randomness in the data, the sample means may not necessarily follow this pattern.
ORIOGEN does not normalize the data, so it is recommended that the user pre-process the data by applying a suitable normalization method before submitting the data to ORIOGEN. ORIOGEN selects genes, based on a statistical decision rule with a pre-specified level of significance, and clusters each selected gene into an appropriate "best-fitting" pattern or profile. The methodology can be briefly described as follows.
ORIOGEN expresses each pre-specified profile in terms of mathematical inequalities (known as order restrictions) between the mean expressions.
Then using the methodology developed in Hwang and Peddada (1994) it fits each pre-specified profile to each gene.
Thus for a given gene, ORIOGEN computes a "goodness-of-fit" statistic for each candidate profile.
For a gene g, it then tests for the significance using a minor modification to the statistic
obtained in Step 3 of Peddada et al (2003).
This modification replaces Step 7 described in Peddada et al (2003).
The purpose of this modification is to not only test the null hypothesis that the mean expression stays constant across all time points/doses against the alternative hypothesis of pre-specified profiles, but also to test that at least one of the population means is significantly different from zero.
Thus the modified test statistics is
where, for gene g,
is the pooled sample standard deviation for all time points/dose groups,
nj and nk
are the number of replicates at the endpoints of the
region, where
denotes the sample mean of the
time point/dose group, and ni is the number of replicates at this point.
As in Peddada et al (2003), the P-value of the test is obtained by bootstrapping the null distribution of the above statistic.
Once a gene is declared significant by the above process, it is initially assigned to the profile with the largest goodness-of-fit statistic. However, in some situations the cluster assignment may require further refinements. For instance, consider the following two profiles:
In Figure 1 the mean expression of a gene (denoted by blue dots) increases from the first to the second time point and stays up at time point 3. Whereas in Figure 2 the mean expression increases from time point 1 to time point 2 and then it decreases at time point 3. In theory, both describe an umbrella shape. This is because the alternative hypotheses does not have strict inequalities between the populations means, it allows "=". Figure 2 has a strict inequality between the means, while the profile in Figure 1 allows for equality between the means of the second and the third time points. Thus the profile in Figure 1 can also be viewed as an increasing profile. In some applications, it may be important to distinguish between these two profiles. Similar refinements and re-classifications may be necessary for other profiles. ORIOGEN attempts to refine the clusters as follows.
For any given profile, we define a "segment" to be the collection of all points on the profile that are included between two "change-points". For example, in an umbrella order there exists one change-point, the point at which the profile changes its direction from an increasing shape to a decreasing shape. Thus in this case there are two segments. In the case of cyclic order with a single period (or one cycle), there are two change points and hence three segments. In the case of an increasing order there is exactly one segment and no change points.
If the selected profile is either an umbrella or a cyclic profile then corresponding to each segment of the profile we compute the test statistic defined in (1), with pooled sample standard deviation computed from the data corresponding to the specific segment. We then order the segments according to the value of their test statistics. In the case of an umbrella-shaped profile, we test whether the segment with the smallest test statistic is flat or not flat using bootstrap methodology. If the bootstrap p-value is less than the chosen level of significance then we retain the umbrella-shaped profile for the gene. If it is not significant then the gene is re-classified by replacing the segment corresponding to the smallest test statistic by a flat line. Thus in this case the gene is reclassified into an increasing (or decreasing) profile. In the case of a cyclic profile, we perform the above test on each of the two segments that correspond to the two smaller test statistics. If both bootstrap p-values are significant then we retain the cyclic shape. If neither is significant, then the two segments are replaced with flat lines, thus resulting in an increasing or decreasing profile. If only one of them is significant then the gene will be reclassified as either increasing, decreasing, or an umbrella profile, depending on the location of the segment that is replaced by a flat line.
Along the same lines, if a gene is initially classified as having an increasing profile, and if the fitted profile has a flat region in the middle, then we test the null hypothesis of a flat middle segment against the alternative that the middle segment is decreasing. If the null hypothesis is rejected then we re-classify the gene to a cyclic profile. A similar decision is made for genes that are initially classified as decreasing.
The above re-classification of profiles does not affect the gene selection process. The default level of significance for re-classification is set at 0.10. However, a user can modify this choice by clicking the "Advanced" button and entering the desired level of significance for re-classification. Note that, by lowering this level of significance we increase the possibility for a gene to have "flat" regions in its dose-response/time-course profile.
The results for significant genes are saved to a text file and their profile fitted means as well as the raw sample means are displayed graphically. Q-values for each selected gene are calculated and stored in the output file (Storey, 2002). Gene ontology information for significant genes is provided where available.
Assumptions and limitations:
Acknowledgements
We thank Drs. Leping Li, David M. Umbach and Clarice Weinberg, Biostatistics Branch, NIEHS, for numerous discussions and their feedback during the preparation of this software.
References
Efron, B. and Tibshirani, R. (1993). An Introduction to the Bootstrap. Chapman and Hall Inc, New York, NY.
Hwang, J. and Peddada, S. (1994). Confidence interval estimation subject to order restrictions. Annals of Statistics, 22, 67-93.
Liu, D., Umbach, D. M., Peddada, S. D., Li, L., Crockett, P., and Weinberg, C. (2004). A Random-Periods Model for Expression of Cell-Cycle Genes. Proceedings of National Academy of Sciences, 101, No. 19, 7240-7245.
Peddada, S. Prescott, K. and Conaway, (2001). Tests for order restrictions in binary data. Biometrics, 57, 1219-1227.
Peddada, S., Lobenhofer, E., Li, L., Afshari, C., Weinberg, C., and Umbach. D. M. (2003). Gene selection and clustering for time-course and dose-response microarray experiments using order restricted inference. Bioinformatics, 7, 834-841.
Storey, J.D. (2002). A direct approach to false discovery rates. J. R. Stat. Soc. B, 64, 479-498.