Analysis Method: dtgini
(formerly known as gini [Hab-05])Brian T. Luke (lukeb@ncifcrf.gov)
Return to Contents
This procedure uses each feature to perform a one-node split of all N samples in the parent node into D daughter nodes using D-1 cut points, as in a decision tree. The quality of the split is determined by the value of GINIsplit. In the dth daughter node, the probability of being in State s, Ps,d,is just the number of samples from this state in this node divided by the total number of samples in this node. The Gini value for this node is
If each daughter node contains Nd samples, GINIsplit for this feature is then
The features are then ordered from lowest to highest GINIsplit(l).
For this procedure, D=S, so there is one daughter node for each State. This means that if three States are present, two cut points will be used to produce three daughter nodes. The feature intensities are ordered in ascending (or descending) order and the possible cut points are the midpoints between subsequent intensities. GINIsplit is determined for each cut point (or combinations of cut points) and the cut point(s) with the lowest GINIsplit are used.
The results examining 10,000 features representing either Feature-a or Feature-b, and comparing their scores against the minimum possible score obtained from features with no information is shown in the following table.
Each | Thresh | 10a | 10b | 15a | 15b | 20a | 20b | 25a | 25b | 30a | 30b | 35a | 35b | 40a | 40b |
30 | 0.333 | 1 | 1 | 8 | 1 | 28 | 12 | 76 | 12 | 275 | 42 | 728 | 65 | 1798 | 96 |
45 | 0.39 | 12 | 5 | 38 | 9 | 226 | 30 | 886 | 66 | 2488 | 166 | 5053 | 357 | 7484 | 738 |
60 | 0.41 | 8 | 2 | 81 | 6 | 557 | 42 | 2327 | 111 | 5308 | 301 | 8080 | 727 | 9517 | 1601 |
90 | 0.431 | 13 | 0 | 297 | 10 | 2610 | 48 | 6636 | 244 | 9253 | 925 | 9915 | 2434 | 9996 | 4963 |
150 | 0.462 | 952 | 34 | 6778 | 318 | 9744 | 2156 | 9995 | 6031 | 10000 | 9070 | 10000 | 9872 | 10000 | 7294 |
300 | 0.483 | 9892 | 3746 | 10000 | 9634 | 10000 | 10000 | 10000 | 10000 | 10000 | 10000 | 10000 | 10000 | 10000 | 10000 |
As stated earlier, the first column represents the number of Cases and the number of Controls in each dataset. The second column represents the minimum value of GINIsplit obtained from 10,000 features where the intensities for both Cases and Controls are randomly assigned within the range of 0.0 to 100.0. The remaining columns show the number of times in 10,000 randomly generated feature intensities that a feature has a value of GINIsplit that is below this threshold. The headings for these column show whether the features represent Feature-a or Feature-b, described previously, and the value of Za or 2Zb. For example, the column labeled 10a is for features that represent Feature-a with Za=10, while the column labeled 10b is for features that represent Feature-b with 2Zb=10 (Zb=5).
This procedure again identifies weak features more easily if they resemble Feature-a than Feature-b. If a Feature-a type of feature has Za=10, meaning that the range of one State spans 90% of the range of the other, this feature has a 98.9% chance of having a lower GINIsplit value than any observed feature with Za=0 (i.e. no information). Conversely, if the feature resembles Feature-b with 2Zb=10, meaning that there is a 95% overlap in the ranges, there is only about a 37.5% chance that it will have a lower GINIsplit value that a non-informative feature. As the number of subjects in the dataset decreases, it becomes harder to distinguish a feature with different ranges from one without. For example, if there are only 30 Cases and 30 Controls, Za must be at least 50 for a Feature-a type of feature (meaning one State has an intensity range that is only 50% of the other) of 2Zb must be at least 85 for a Feature-b type of feature (meaning that the ranges only have a 57.5% overlap) before the feature has at least a 50% chance of having a GINIsplit value that is less than one observed for a non-informative feature.
(Last updated 4/4/07)