Analysis Method: vartest
Brian T. Luke (lukeb@ncifcrf.gov)Return to Contents
This procedure, based on the relevance index proposed by Yip and coworkers [Yip-03], attempts to find those features where the variance of intensities for each category of samples is smallest relative to the total variance of all intensities. If is the estimated variance of the samples in category i, and is the estimated variance for all samples, they are defined by the following formulas.
In the top equation is the intensity of the kth sample in category i, is the average intensity for all samples in this category and is the number of samples in this category. The second equation sums over all N samples and is the average intensity for this feature. The vartest score for a given feature, V, containing C categories is given by the following.
The features are then ranked from lowest to highest V since the objective it to find the feature with the smallest intra-category variance. In many instances with equal numbers of Cases and Controls, this procedure yields an identical ordering of features as found using student.
The results examining 10,000 features representing either Feature-a or Feature-b, and comparing their scores against the maximum possible score obtained from features with no information is shown in the following table.
Each | Thresh | 10a | 10b | 15a | 15b | 20a | 20b | 25a | 25b | 30a | 30b | 35a | 35b | 40a | 40b |
30 | 1.5699 | 15 | 16 | 35 | 28 | 85 | 72 | 247 | 169 | 606 | 322 | 1277 | 605 | 2435 | 1030 |
45 | 1.6193 | 2 | 4 | 13 | 12 | 49 | 28 | 175 | 88 | 510 | 239 | 1349 | 532 | 2883 | 1138 |
60 | 1.7698 | 22 | 22 | 112 | 75 | 390 | 258 | 1137 | 664 | 2640 | 1480 | 5028 | 2777 | 7460 | 4327 |
90 | 1.7686 | 3 | 4 | 18 | 15 | 145 | 80 | 622 | 265 | 2164 | 954 | 4885 | 2310 | 7893 | 4461 |
150 | 1.8802 | 33 | 25 | 281 | 177 | 1481 | 915 | 4537 | 2805 | 8040 | 5591 | 9700 | 8098 | 9982 | 9491 |
300 | 1.9251 | 560 | 461 | 3463 | 2522 | 8104 | 6613 | 9866 | 9338 | 9997 | 9949 | 10000 | 10000 | 10000 | 10000 |
As stated earlier, the first column represents the number of Cases and the number of Controls in each dataset. The second column represents the minimum value of V obtained from 10,000 features where the intensities for both Cases and Controls are randomly assigned within the range of 0.0 to 100.0. The remaining columns show the number of times in 10,000 randomly generated feature intensities that a feature has a value of V that is below this threshold. The headings for these column show whether the features represent Feature-a or Feature-b, described previously, and the value of Za or 2Zb. For example, the column labeled 10a is for features that represent Feature-a with Za=10, while the column labeled 10b is for features that represent Feature-b with 2Zb=10 (Zb=5).
This procedure recognizes putative biomarkers represented by Feature-a slightly better than those for Feature-b. For datasets with 300 cases and 300 controls, approximately 81% of the features with Za=20 produced a lower V value than any observed feature with no information. In contrast, if 2Zb=20, 66.1% of the features had lower V values. As with the other methods examined, the ability to identify a weak putative biomarker is much better if the dataset contains more samples. If there are only 30 Cases and 30 Controls and the features have the form of Feature-a, there is at least a 50% chance of having a V value lower than 1.5699 if Za=50, meaning that the range of intensities for one State is only 50% that of the other. As the number of Cases and Controls increases from 45 to 150, the range of the smaller intensity State increases from 55% to 85% of the range of the larger intensity State. If the features have the form of Feature-b, the region of overlap increases from 70% to 85% as the number of Cases and Controls increases from 30 to 150.
(Last updated 4/29/07)