Analysis Method: extreme

Brian T. Luke (lukeb@ncifcrf.gov)
Return to Contents

The Komogorov-Smirnov test may identify a feature as important if in the middle of overlapping ranges one category has a high density of points while the other has a very low density. This may be due to a sampling problem and what is really desired are features where one category has a large number of intensities that are either above or below the intensity range of all other samples in the other categories.

The extreme algorithm simply orders all intensities from lowest to highest and starting from both extremes finds the maximum number of samples from a single category before a sample from another category is observed. Therefore, in contrast to t he Komogorov-Smirnov test, this procedure works with more than two categories in the dataset.

The results examining 10,000 features representing either Feature-a or Feature-b, and comparing their scores against the maximum possible score obtained from features with no information is shown in the following table.

Each	Thresh	10a	10b	15a	15b	20a	20b	25a	25b	30a	30b	35a	35b	40a	40b
30	13	1	4	12	9	60	7	272	22	761	61	1854	142	3640	278
45	14	5	4	99	9	548	43	2029	113	4557	379	7159	839	8929	1815
60	18	6	0	42	2	463	7	2139	43	5136	182	8035	538	9429	1375
90	17	147	3	1866	56	6168	400	9192	1714	9917	4186	9996	7021	10000	8966
150	14	6239	888	9819	5049	9999	9023	10000	9929	10000	9998	10000	10000	10000	10000
300	15	9992	8210	10000	9991	10000	10000	10000	10000	10000	10000	10000	10000	10000	10000

As stated earlier, the first column represents the number of Cases and the number of Controls in each dataset. The second column represents the maximum count for samples from a single category obtained from 10,000 features where the intensities for both Cases and Controls are randomly assigned within the range of 0.0 to 100.0, and this number is reasonably insensitive to the number of Cases and Controls. The remaining columns show the number of times in 10,000 randomly generated feature intensities that a feature has a sample count that is above this threshold. The headings for these column show whether the features represent Feature-a or Feature-b, described previously, and the value of Za or 2Zb. For example, the column labeled 10a is for features that represent Feature-a with Za=10, while the column labeled 10b is for features that represent Feature-b with 2Zb=10 (Zb=5).

This procedure recognizes putative biomarkers represented by Feature-a better than those for Feature-b if the dataset contains a relatively small number of samples. For datasets with 300 cases and 300 controls, approximately 99.9% of the features with Za=10 produced a higher count than any observed feature with no information, while if 2Zb=10, 82.19% of the features had a higher count. If there are only 30 Cases and 30 Controls and the features have the form of Feature-a, there is at least a 50% chance of having a sample count from a single category greater than 13 if Za=45, meaning that the range of intensities for one State is 55% that of the other. As the number of Cases and Controls increases from 45 to 150, the range of the smaller intensity State increases from 65% to 90% of the range of the larger intensity State. If the features have the form of Feature-b, the region of overlap increases from 62.5% to 92.5% as the number of Cases and Controls increases from 30 to 150.

(Last updated 4/29/07)