Analysis of Datasets Containing No Biological Information

Brian T. Luke (lukeb@ncifcrf.gov)
Return to Contents

To examine the classification accuracy of various methods, and therefore the significance of each method, on datasets with no biological information, 30 artificial datasets with random peak intensities have been created.  All datasets contain 300 “peaks” and have the same number of Cases and Controls (30, 42, 60, 90, 150, and 300 of each category).  For each number of Cases and Controls, five independent datasets are available for download and analysis. Descriptions of the structure of the datasets as well as the procedure used in their generation are available.  Below is a summary of the results for distinguishing Cases from Controls for all 30 datasets using the BioMarker Development Kit (BMDK), a decision tree (DT), and a medoid classification algorithm (MCA).  Specific results for each dataset using these three classification methods are also available.

Table 1 lists the best classification results obtained by the BMDK suite of programs for each dataset. The second and third columns list the number of Cases and Controls in each dataset and the fourth column lists the number of putative biomarkers identified by the 10 methods currently employed in BMDK.  The final three columns list the quality of the best classifier after an exhaustive examination of all sets of one, two, and three putative biomarkers using a distance-dependent 6-nearest neighbor algorithm, respectively.  Since this classifier has the ability to return a classification of “Undetermined”, the quality of this classifier is the sum of the sensitivity and specificity minus the percentage of samples that are “Undetermined”.  Since an “Undetermined” classification may be caused by an incomplete coverage of feature space by the available samples, which should be avoided, the results in Table 1 have the added restriction that %undetermined cannot exceed 5% for any classifier.  With this restriction, the fifth dataset containing 30 Cases and 30 Controls (Set 30_5a) could not find a single valid 3-peak classifier.  These results use a standard Euclidean distance to determine the distance between samples in feature space since none of the other three available distance metrics were able to identify a valid 3-peak classifier for any of the 30 datasets.

If the minimum required accuracy of a classifier is a sensitivity and specificity of 85%, the minimum acceptable quality score is 165, assuming that %undetermined=5%.  None of the results presented in Table 1 reach this minimum level of accuracy.  This result is expected since the datasets are constructed to contain no information.  It is also expected that the quality of the best classifier should decrease as the number of Cases and Controls increases.  While this is generally true, the 1-peak classifier for the fifth dataset containing 60 Cases and 60 Controls (Set 60_5a) produced an anomalously high quality of 151.7 (sensitivity=78.3%, specificity=73.3%, %undetermined=0.0%).  An intensity plot for this peak is shown in Figure 1.  The left column shows the intensities for all Cases, while the intensities for all Controls are in the left column. It is clear from this figure that the accuracy of the peak is caused by a high density of intensities for one category in a region with a low density of intensities for the other.  The only way to justify this peak as a useful classifier is to assume that each category is composed of several States and that each State is represented by a specific range of intensities for this peak.  This division of the individuals in each category into multiple States would have to be biologically verified before this classifier should be used; even though its quality is still well below the minimum threshold.<

The classification results using a decision tree (DT) are listed in Table 2.  This table lists the quality (sum of sensitivity and specificity) of the best and 200th best decision tree across four runs for each dataset.  Two of the runs convert a decision node into a terminal node if it contains at most 1% of either the Cases or Controls, while the other two runs increases this criteria to 4%.  All four runs use a different seed to the random number generator.  For the five datasets with 30 Cases and 30 Controls, 18 of the 20 runs identify at least 200 decision trees with an average sensitivity and specificity of 95% or higher. All runs identified 200 decision trees with an average sensitivity and specificity above 91% and one run identified at least 200 decision trees with a sensitivity and specificity of 100%.

Using and average sensitivity and specificity of 85% as a minimum required accuracy for a “useful” classifier, Table 2 shows that this level is regularly achieved if the dataset contains 60 Cases and 60 Controls or less.  If there are 90 Cases and 90 Controls, an average sensitivity and specificity of 83.3% is achieved, which is slightly below the required accuracy.  For the largest dataset (300 Cases and 300 Controls) the average sensitivity and specificity is below 70% but still much higher than the BMDK results in Table 1.

The classification accuracies (sum of sensitivity and specificity) using the mediod classification algorithm (MCA) are listed in Table 3.  For each dataset, two runs are performed when five, six or seven of the 300 features are used in the classifier.  The first examines all Cases and then all Controls while the second reverses this order.  All six runs for a given dataset use a different seed to the random number generator and the number of Case-cells or Control-cells is not allowed to exceed two-thirds of the number of Cases or Controls, respectively.  This would allow up to one-third of the samples to effectively be treated as a testing set while still maintaining full coverage of the fingerprint patterns.

If the dataset contains only 30 Cases and 30 Controls, all runs found at least one classifier that produced a sensitivity and specificity of 100% without requiring more than 20 Cases and 20 Controls for complete coverage.  Two of the 6-feature runs and four of the 7-feature runs produced a final population where at least 200 unique classifiers produced a sensitivity and specificity of 100%.  For the largest dataset (300 Cases and 300 Controls) one 5-feature classifier was identified that had an average sensitivity and specificity of 85%, the other nine runs found a classifier with an average sensitivity and specificity of 83.5% or higher.  All 6-feature and 7-feature runs produced a final population that contains at least 200 unique classifiers with an average sensitivity and specificity above 85%; one 7-feature run found a classifier with an average sensitivity and specificity above 90.1%.

Conclusions

While it has been argued [Pet-03] that accurate classification of a testing set must imply some underlying biological principle, the results presented here clearly shows that this is not true for fingerprint-based classifiers; especially MCA classifiers.  Hundreds of 7-feature MCA classifiers produce an average sensitivity and specificity above 90% for datasets with 150 Cases, 150 Controls, and only 300 features (Table 3) even though the datasets are constructed to contain no biological information. A chance [Ran-05a, Ran-05b] fitting of the data is highly possible since fingerprint-based classifiers appear to be overly flexible.  Increasing the number of samples in the dataset generally decreases the quality of the best classifier if it is a chance fit, but increasing the number of features in the dataset will increase this quality.  For fingerprint-based classifiers, the quality should improve if more features are used in the classifier; while the results in Table 1 suggest that increasing the number of features from two to three does not make much of an improvement, and may actually produce poorer results.

The bottom line is that fingerprint-based classifiers should not be used to analyze a dataset.  Problems of chance, coverage, uniqueness, and significance result in concluding that fingerprint-based classifiers are not generalizable to the underlying population. 

Figure 1:  Intensities for the 60 cases (left column) and 60 controls (right column) for the peak that yielded a quality score of 151.7 (sensitivity=78.3%, specificity=73.3%, undetermined=0.0%) in the dataset of random peak intensities.

Table 1: Classification accuracy (sum of the sensitivity and specificity minus the percent undetermined) using between one and three peaks from the list of NPB putative biomarkers identified by BMDK in a distance-dependent 6-nearest neighbor classifier using absolute differences in the peak intensities and requiring that the %Undetermined is no more than 5%.

Set Cases Controls NPB 1 Peak 2 Peaks 3 Peaks
30_1a 30 30 21 146.7 151.1 157.8
30_2a 30 30 22 137.3 153.3 155.3
30_3a 30 30 22 147.4 137.3 145.7
30_4a 30 30 21 143.3 144.9 147.6
30_5a 30 30 16 144.2 146.7 None
42_1a 42 42 27 142.9 135.7 113.0
42_2a 42 42 16 121.7 136.4 130.2
42_3a 42 42 22 131.4 136.2 137.3
42_4a 42 42 26 135.7 135.7 135.7
42_5a 42 42 21 131.0 134.7 129.8
60_1a 60 60 22 136.7 138.3 140.1
60_2a 60 60 29 121.7 129.7 128.9
60_3a 60 60 26 133.3 140.0 133.6
60_4a 60 60 22 130.0 139.3 133.7
60_5a 60 60 27 151.7 137.0 136.2
90_1a 90 90 24 121.1 133.9 125.3
90_2a 90 90 23 121.1 130.4 125.8
90_3a 90 90 27 123.3 126.7 123.3
90_4a 90 90 28 121.1 137.9 137.3
90_5a 90 90 27 118.9 131.1 135.6
150_1a 150 150 22 114.0 127.3 106.6
150_2a 150 150 25 116.7 125.3 125.3
150_3a 150 150 25 115.3 120.9 123.4
150_4a 150 150 22 118.0 123.3 125.3
150_5a 150 150 23 116.0 123.6 121.7
300_1a 300 300 26 115.7 117.7 120.2
300_2a 300 300 31 110.7 121.1 122.1
300_3a 300 300 29 111.3 121.7 119.1
300_4a 300 300 26 112.7 118.3 117.7
300_5a 300 300 26 113.7 115.0 119.5

Table 2: Classification accuracy (sum of the sensitivity and specificity) using the decision tree algorithm for the 1st and 200th best classifier as a function of the number of Cases and Controls.(a)

Set Cases Controls 1% 1% 4% 4%
1st 200th 1st 200th 1st 200th 1st 200th
30_1a 30 30 200.0 196.7 196.7 196.7 190.0 190.0 196.7 190.0
30_2a 30 30 196.7 196.7 196.7 193.3 196.7 193.3 200.0 196.7
30_3a 30 30 190.0 190.0 193.3 190.0 196.7 193.3 196.7 193.3
30_4a 30 30 196.7 196.7 196.7 193.3 193.3 18677 193.3 193.3
30_5a 30 30 196.7 193.3 200.0 200.0 190.0 183.3 196.7 193.3
42_1a 42 42 188.1 185.7 185.8 178.5 183.3 180.9 183.3 180.9
42_2a 42 42 188.1 185.7 185.8 178.5 183.3 180.9 183.3 180.0
42_3a 42 42 183.3 183.3 188.1 185.7 185.7 185.7 183.3 180.9
42_4a 42 42 183.3 180.9 178.5 178.6 178.5 176.2 180.9 178.5
42_5a 42 42 190.5 185.7 185.7 180.9 188.1 185.7 180.9 176.2
60_1a 60 60 176.6 175.0 175.0 170.0 175.0 166.6 176.6 171.6
60_2a 60 60 176.6 171.6 175.0 171.6 171.6 168.3 173.3 168.3
60_3a 60 60 171.6 171.6 178.3 175.0 176.6 173.4 163.3 171.7
60_4a 60 60 176.6 173.3 176.6 171.6 176.7 173.3 171.6 168.3
60_5a 60 60 171.6 170.0 170.0 168.3 171.6 168.3 170.0 165.0
90_1a 90 90 156.7 155.6 160.0 158.9 160.0 158.9 162.3 158.9
90_2a 90 90 166.7 165.6 163.4 160.0 160.0 156.7 158.9 155.6
90_3a 90 90 158.9 157.8 161.1 158.9 158.9 156.7 164.5 152.2
90_4a 90 90 160.0 158.9 162.2 160.0 160.0 156.7 160.0 157.8
90_5a 90 90 166.7 164.4 162.2 158.9 166.7 165.6 161.1 158.9
150_1a 150 150 152.0 150.0 152.0 150.0 150.0 148.0 152.7 150.0
150_2a 150 150 149.3 146.0 150.7 149.3 150.7 149.3 151.3 150.0
150_3a 150 150 151.3 148.7 150.7 148.7 149.3 147.3 150.7 149.3
150_4a 150 150 152.0 150.0 149.3 148.0 155.3 154.0 153.3 151.3
150_5a 150 150 152.7 151.3 154.0 152.7 149.3 146.7 148.7 146.7
300_1a 300 300 138.3 137.3 136.3 135.3 137.3 136.0 135.7 134.7
300_2a 300 300 137.0 136.0 136.3 135.3 137.0 135.7 137.7 136.3
300_3a 300 300 136.7 136.0 137.0 136.0 134.0 132.0 136.3 135.0
300_4a 300 300 136.0 135.0 136.7 135.3 136.3 133.7 136.0 134.7
300_5a 300 300 135.0 133.7 136.3 135.0 135.7 134.7 135.7 135.0

(a)A decision node was converted to a terminal node if it contained at most 1% of the samples from either category (1% runs) or if it contained at most 4% of the samples from either category (4% runs). All four runs used different seeds to the random number generator that controlled the Evolutionary Programming search.

Supplemental Table 6: Classification accuracy (sum of the sensitivity and specificity) using the medoid classification algorithm for the 1st and 200th best classifier as a function of the number of cases and controls from two runs using five, six, and seven peaks with random intensities.(a)

Set Cases Controls Run 5 Peaks 6 Peaks 7 Peaks
1st 200th 1st 200th 1st 200th
30_1a 30 30 1 200.0 193.3 200.0 196.7 200.0 196.7
2 200.0 193.3 200.0 196.7 200.0 200.0
30_2a 30 30 1 200.0 193.3 200.0 196.7 200.0 196.7
2 200.0 193.3 200.0 196.7 200.0 196.7
30_3a 30 30 1 200.0 190.0 200.0 196.7 200.0 200.0
2 200.0 193.3 200.0 200.0 200.0 196.7
30_4a 30 30 1 200.0 193.3 200.0 196.7 200.0 196.7
2 200.0 193.3 200.0 193.3 200.0 200.0
30_5a 30 30 1 200.0 193.3 200.0 200.0 200.0 196.7
2 200.0 193.3 200.0 196.7 200.0 200.0
42_1a 42 42 1 195.2 188.1 197.6 190.5 197.6 192.9
2 195.2 188.1 197.6 190.5 197.6 192.9
42_2a 42 42 1 197.6 188.1 197.6 192.9 197.6 192.9
2 192.9 188.1 197.6 192.9 197.6 195.2
42_3a 42 42 1 192.9 188.1 197.6 190.5 197.6 192.9
2 195.2 188.1 195.2 190.5 197.6 192.9
42_4a 42 42 1 195.2 188.1 197.6 190.5 197.6 192.9
2 195.2 188.1 195.2 190.5 197.6 195.2
42_5a 42 42 1 192.9 185.7 195.2 188.1 197.6 192.9
2 195.2 188.1 197.6 192.9 197.6 195.2
60_1a 60 60 1 191.7 183.3 191.7 185.0 195.0 190.0
2 188.3 181.7 193.3 186.7 195.0 190.0
60_2a 60 60 1 188.3 181.7 193.3 185.0 195.0 186.7
2 193.3 183.3 193.3 185.0 193.3 186.7
60_3a 60 60 1 190.0 183.3 192.7 185.0 193.3 186.7
2 193.3 183.3 193.3 186.7 193.3 188.3
60_4a 60 60 1 193.3 183.3 193.3 185.0 195.0 190.0
2 190.0 183.3 193.3 185.0 191.7 186.7
60_5a 60 60 1 193.3 181.7 193.3 186.7 195.0 190.0
2 190.0 183.3 191.7 186.7 193.3 190.0
90_1a 90 90 1 184.4 178.9 188.9 181.1 188.9 183.3
2 184.4 177.8 185.6 180.0 188.9 182.2
90_2a 90 90 1 185.6 177.8 188.9 181.1 190.0 184.4
2 184.4 178.9 186.7 180.0 190.0 183.3
90_3a 90 90 1 184.4 177.8 186.7 180.0 191.1 183.3
2 187.8 178.9 188.9 182.2 188.9 184.4
90_4a 90 90 1 183.3 178.9 186.7 181.1 188.9 182.2
2 185.6 177.8 187.8 181.1 190.0 182.2
90_5a 90 90 1 185.6 177.8 185.6 180.0 188.9 182.2
2 186.7 181.1 187.8 182.2 190.0 185.6
150_1a 150 150 1 182.0 176.0 183.3 179.3 186.7 181.3
2 180.0 174.0 182.0 178.0 184.7 179.3
150_2a 150 150 1 183.3 174.7 182.7 178.7 184.7 180.7
2 181.3 174.7 182.7 178.0 187.3 180.0
150_3a 150 150 1 180.0 174.7 183.3 178.7 184.0 180.7
2 180.0 175.3 184.0 179.3 185.3 180.7
150_4a 150 150 1 180.7 175.3 182.0 178.7 184.0 180.7
2 180.0 174.7 185.3 178.7 185.3 181.3
150_5a 150 150 1 182.0 175.3 182.7 178.7 185.3 180.7
2 180.0 174.0 182.7 178.0 184.0 180.0
300_1a 300 300 1 167.3 163.3 176.3 171.7 179.7 175.0
2 170.3 163.3 175.7 172.3 178.7 175.0
300_2a 300 300 1 169.7 163.3 175.3 171.3 178.7 174.0
2 169.3 163.3 179.0 172.7 180.3 175.7
300_3a 300 300 1 168.0 163.3 175.3 172.3 179.3 175.3
2 168.7 163.3 178.3 172.3 178.7 175.3
300_4a 300 300 1 168.0 163.3 176.7 172.3 178.3 175.3
2 167.0 163.3 176.7 171.3 178.7 175.0
300_5a 300 300 1 168.3 163.3 176.7 171.7 178.7 174.3
2 168.3 163.7 176.3 172.7 178.7 175.0

(a)Run 1 examined all cases and then controls, while Run 2 examined the controls and then the cases.

(Last updated 5/2/07)