Random Intensity Datasets: 300 Cases, 300 Controls

Brian T. Luke (lukeb@ncifcrf.gov)
Return to Contents

These five pairs of datasets contain 300 features with 300 Cases and 300 Controls. These datasets are constructed with random peak intensities so that they contain no biological information. Structure of the Datasets contains a general description of datasets that can be used by programs within the BioMarker Development Kit (BMDK). Since the Cases and Controls are stored in different files, the class indices are not included in the data. Each feature has a single label, but they are simply "F-00001" through "F-00300". Each dataset has an associated document that describes the results of an analysis using the BioMarker Development Kit (BMDK), and classifiers based on a decision tree (DT) and a medoid classification algorithm (MCA). To reduce the amount of repeated information in these tables of results, Description of the Tables gives details about each table.

Analysis #Cases
#Controls
#Features Case
Dataset
Control
Dataset
Analysis
Random_Intensity_300_1a 300 300 case_300_1a.txt control_300_1a.txt Tables
Random_Intensity_300_2a 300 300 case_300_2a.txt control_300_2a.txt Tables
Random_Intensity_300_3a 300 300 case_300_3a.txt control_300_3a.txt Tables
Random_Intensity_300_4a 300 300 case_300_4a.txt control_300_4a.txt Tables
Random_Intensity_300_5a 300 300 case_300_5a.txt control_300_5a.txt Tables

The following table lists the best classification observed results for each dataset-pair.

Set NPB BMDK-1 BMDK-2 BMDK-3 DT MCA-5 MCA-6 MCA-7
300_1a 26 115.7 117.7 120.2 138.3 170.3 176.3 179.7
300_2a 31 110.7 121.1 122.1 137.7 169.3 179.0 180.3
300_3a 29 111.3 121.7 119.1 137.0 168.7 178.3 179.3
300_4a 26 112.7 118.3 117.7 136.7 168.0 176.7 178.7
300_5a 26 113.7 115.0 119.5 136.3 168.3 176.7 178.7

For each set of Cases and Controls, BMDK uses 10 different methods to search for putative biomarkers, and the number of putative biomarkers (NPB) identified for each set is listed in the second column (the Tables shown in the links above give details on which procedures selected which features). BMDK only uses these putative biomarkers to construct the final classifier based on a distance-dependent K-nearest neighbor algorithm. This classifier allows for an "undetermined" classification, so the quality metrics shown above are the sum of the overall sensitivity and specificity minus the percent "undetermined" from a leave-one-out cross-validation analysis, with the constraint that no more than 5% of the samples can be "undetermined". The third, fourth and fifth columns list the best result using between one and three of the putative biomarkers, respectively. For the DT and MCA classifiers, the quality is the sum of the sensitivity and specificity.

None of the final BMDK classifiers produced a sensitivity and specificity above 61.5%.

The best DT classifiers (Column 6) containing up to seven decision nodes yielded an average sensitivity and specificity between 68.2 and 69.2% for the 300 samples. The final three columns of the preceding table show the best results for an MCA classifier using five, six, and seven features, respectively. The best 5-feature classifiers had an average sensitivity and specificity of at least 84.2%, while the 6- and 7-feature classifiers has an average sensitivity and specificity of at least 86.2 and 89.3%, respectively. In all cases, the MCA classifier was constructed after effectively separating the data into a training set containing 200 Cases and 200 Controls, and a testing set containing 100 Cases and 100 Controls.

While there is a significant decrease in the quality of the DT classification, it is clear that the fingerprint-based methods are able to classify these samples to a higher accuracy than the biomarker-based, even though these datasets are constructed to contain no biological information. These fingerprint-based methods are designed so that the average sensitivity and specificity cannot decrease if more features are used, while for three of the five sets of data the 3-feature BMDK classifier did not perform better than the 2-feature classifier. If a minimum required "accuracy for publication" is set at 85%, for example, a decision tree with more than seven decision nodes would be required, while even a 5-feature MCA classifier for one of the sets of data has sufficient accuracy. No biomarker-based classifier would ever be able to obtain this accuracy for these datasets.

(Last updated 9/1/07)