Analysis Method: kolsmir

Brian T. Luke (lukeb@ncifcrf.gov)
Return to Contents

The  Kolmogorov-Smirnov test (K-S test) simply measures the maximum difference in the cumulative fraction plots of the two States.

F1(x) is the fraction of samples in the first State that have an intensity less than x, while F2(x) is the fraction of samples in the second State.  The metric is simply given by the following.

D = max|F1(x) – F2(x)| over all x.

The peaks should be ranked from highest to lowest value of D.

In general, the intensities are ranked from lowest to highest and x assumes each value of the intensity (meaning that the point with this intensity does not count towards the total).

Increment the total in each state starting with the sample with the lowest intensity and stopping with the sample with the next-to-highest intensity and find the largest difference after each increment.

This method is only valid for a two-State problem.

The results examining 10,000 features representing either Feature-a or Feature-b, and comparing their scores against the maximum possible score obtained from features with no information is shown in the following table.

Each Thresh 10a 10b 15a 15b 20a 20b 25a 25b 30a 30b 35a 35b 40a 40b
30 17 0 0 3 5 7 2 15 11 66 11 206 31 563 44
45 20 6 4 9 8 59 16 175 41 572 101 1675 178 3545 337
60 26 0 0 1 1 12 3 75 20 401 30 1451 76 3687 159
90 29 11 9 47 22 293 61 1545 171 4469 421 7811 969 9572 1941
150 39 12 4 142 30 1232 140 5365 485 9130 1253 9942 2697 9999 5052
300 53 155 62 2829 358 9137 1491 9994 3968 10000 7598 10000 9592 10000 9983

As stated earlier, the first column represents the number of Cases and the number of Controls in each dataset.  The second column represents the maximum value of the D obtained from 10,000 features where the intensities for both Cases and Controls are randomly assigned within the range of 0.0 to 100.0.  The remaining columns show the number of times in 10,000 randomly generated feature intensities that a feature has a value of D that is above this threshold.  The headings for these column show whether the features represent Feature-a or Feature-b, described previously, and the value of Za or 2Zb.  For example, the column labeled 10a is for features that represent Feature-a with Za=10, while the column labeled 10b is for features that represent Feature-b with 2Zb=10 (Zb=5).

This procedure recognizes putative biomarkers represented by Feature-a better than those for Feature-b.  For datasets with 300 cases and 300 controls, approximately 91.4% of the features with Za=20 produced a higher D value than any observed feature with no information.  In contrast, if 2Zb=20, only 14.9% of the features had higher D values.  As with the other methods examined, the ability to identify a weak putative biomarker is much better if the dataset contains more samples.  If there are only 30 Cases and 30 Controls and the features have the form of Feature-a, there is at least a 50% chance of having a D value greater than 17 if Za=60, meaning that the range of intensities for one State is only 40% that of the other.  As the number of Cases and Controls increases from 45 to 150, the range of the smaller intensity State increases from 55% to 75% of the range of the larger intensity State.  If the features have the form of Feature-b, the region of overlap increases from 52.5% to 85% as the number of Cases and Controls increases from 30 to 300.

(Last updated 4/29/07)