Analysis Method: extreme
Brian T. Luke (lukeb@ncifcrf.gov)Return to Contents
The Komogorov-Smirnov test may identify a feature as important if in the middle of overlapping ranges one category has a high density of points while the other has a very low density. This may be due to a sampling problem and what is really desired are features where one category has a large number of intensities that are either above or below the intensity range of all other samples in the other categories.
The extreme algorithm simply orders all intensities from lowest to highest and starting from both extremes finds the maximum number of samples from a single category before a sample from another category is observed. Therefore, in contrast to t he Komogorov-Smirnov test, this procedure works with more than two categories in the dataset.
The results examining 10,000 features representing either Feature-a or Feature-b, and comparing their scores against the maximum possible score obtained from features with no information is shown in the following table.
Each | Thresh | 10a | 10b | 15a | 15b | 20a | 20b | 25a | 25b | 30a | 30b | 35a | 35b | 40a | 40b |
30 | 13 | 1 | 4 | 12 | 9 | 60 | 7 | 272 | 22 | 761 | 61 | 1854 | 142 | 3640 | 278 |
45 | 14 | 5 | 4 | 99 | 9 | 548 | 43 | 2029 | 113 | 4557 | 379 | 7159 | 839 | 8929 | 1815 |
60 | 18 | 6 | 0 | 42 | 2 | 463 | 7 | 2139 | 43 | 5136 | 182 | 8035 | 538 | 9429 | 1375 |
90 | 17 | 147 | 3 | 1866 | 56 | 6168 | 400 | 9192 | 1714 | 9917 | 4186 | 9996 | 7021 | 10000 | 8966 |
150 | 14 | 6239 | 888 | 9819 | 5049 | 9999 | 9023 | 10000 | 9929 | 10000 | 9998 | 10000 | 10000 | 10000 | 10000 |
300 | 15 | 9992 | 8210 | 10000 | 9991 | 10000 | 10000 | 10000 | 10000 | 10000 | 10000 | 10000 | 10000 | 10000 | 10000 |
As stated earlier, the first column represents the number of Cases and the number of Controls in each dataset. The second column represents the maximum count for samples from a single category obtained from 10,000 features where the intensities for both Cases and Controls are randomly assigned within the range of 0.0 to 100.0, and this number is reasonably insensitive to the number of Cases and Controls. The remaining columns show the number of times in 10,000 randomly generated feature intensities that a feature has a sample count that is above this threshold. The headings for these column show whether the features represent Feature-a or Feature-b, described previously, and the value of Za or 2Zb. For example, the column labeled 10a is for features that represent Feature-a with Za=10, while the column labeled 10b is for features that represent Feature-b with 2Zb=10 (Zb=5).
This procedure recognizes putative biomarkers represented by Feature-a better than those for Feature-b if the dataset contains a relatively small number of samples. For datasets with 300 cases and 300 controls, approximately 99.9% of the features with Za=10 produced a higher count than any observed feature with no information, while if 2Zb=10, 82.19% of the features had a higher count. If there are only 30 Cases and 30 Controls and the features have the form of Feature-a, there is at least a 50% chance of having a sample count from a single category greater than 13 if Za=45, meaning that the range of intensities for one State is 55% that of the other. As the number of Cases and Controls increases from 45 to 150, the range of the smaller intensity State increases from 65% to 90% of the range of the larger intensity State. If the features have the form of Feature-b, the region of overlap increases from 62.5% to 92.5% as the number of Cases and Controls increases from 30 to 150.
(Last updated 4/29/07)