Random Intensity Datasets: 60 Cases, 60 Controls

Brian T. Luke (lukeb@ncifcrf.gov)
Return to Contents

These five pairs of datasets contain 300 features with 60 Cases and 60 Controls. These datasets are constructed with random peak intensities so that they contain no biological information. Structure of the Datasets contains a general description of datasets that can be used by programs within the BioMarker Development Kit (BMDK). Since the Cases and Controls are stored in different files, the class indices are not included in the data. Each feature has a single label, but they are simply "F-00001" through "F-00300". Each dataset has an associated document that describes the results of an analysis using the BioMarker Development Kit (BMDK), and classifiers based on a decision tree (DT) and a medoid classification algorithm (MCA). To reduce the amount of repeated information in these tables of results, Description of the Tables gives details about each table.

Analysis #Cases
#Controls
#Features Case
Dataset
Control
Dataset
Analysis
Random_Intensity_60_1a 60 300 case_60_1a.txt control_60_1a.txt Tables
Random_Intensity_60_2a 60 300 case_60_2a.txt control_60_2a.txt Tables
Random_Intensity_60_3a 60 300 case_60_3a.txt control_60_3a.txt Tables
Random_Intensity_60_4a 60 300 case_60_4a.txt control_60_4a.txt Tables
Random_Intensity_60_5a 60 300 case_60_5a.txt control_60_5a.txt Tables

The following table lists the best classification observed results for each dataset-pair.

Set NPB BMDK-1 BMDK-2 BMDK-3 DT MCA-5 MCA-6 MCA-7
60_1a 22 136.7 138.3 140.1 176.7 191.7 193.3 195.0
60_2a 29 121.7 129.7 128.9 176.7 193.3 193.3 195.0
60_3a 26 133.3 140.1 133.6 178.3 193.3 193.3 193.3
60_4a 22 130.0 139.3 133.7 176.7 193.3 193.3 195.0
60_5a 27 151.7 137.0 136.2 171.7 193.3 193.3 195.0

For each set of Cases and Controls, BMDK uses 10 different methods to search for putative biomarkers, and the number of putative biomarkers (NPB) identified for each set is listed in the second column (the Tables shown in the links above give details on which procedures selected which features). BMDK only uses these putative biomarkers to construct the final classifier based on a distance-dependent K-nearest neighbor algorithm.  This classifier allows for an "undetermined" classification, so the quality metrics shown above are the sum of the overall sensitivity and specificity minus the percent "undetermined" from a leave-one-out cross-validation analysis, with the constraint that no more than 5% of the samples can be "undetermined". The third, fourth and fifth columns list the best result using between one and three of the putative biomarkers, respectively. For the DT and MCA classifiers, the quality is the sum of the sensitivity and specificity.

With the exception of the 1-feature classifier for Set 60_5a, none of the final BMDK classifiers produced a sensitivity and specificity above 70%. The 2-feature classifier for set 60_3a yielded a sensitivity of 69.6% and a specificity of 74.6% with five samples (4.2%) receiving an undetermined classification. The 3-feature classifier for Set 60_1a also had a sensitivity of 69.6%, a specificity of 74.6% with a 4.2% of the samples (five samples) receiving an "undetermined" classification. The 1-feature classifier for Set 60_5a yielded an relatively high quality score (sensitivity=78.3%, specificity=73.3%, and no undetermined samples). The intensities of this feature are shown in the following figure for the Cases (left column) and Controls (right column).

This feature produced these relatively good results because several of the Cases had intensity in regions containing virtually no Controls and several Controls in regions with very few cases. Since a random distribution of intensities is not uniform for a finite number of samples, it is possible to obtain a relatively good result by chance [Ran-05a, Ran-05b]. A visual inspection of the peak intensities is therefore necessary, since this feature does not have a sufficient difference in the ranges of intensities for Cases and Controls to represent a true biomarker.

The best DT classifiers (Column 6) containing up to seven decision nodes misclassified between four and seven samples, yielding an average sensitivity and specificity of at least 85.8%. For four of the five datasets, the average sensitivity and specificity varied between 88.3 and 89.1% across all 120 samples. The final three columns of the preceding table show the best results for an MCA classifier using five, six, and seven features, respectively. The best 5-feature classifiers had an average sensitivity and specificity of at least 95.8%%, while the 6- and 7-feature classifiers has an average sensitivity and specificity of at least 96.6%. In all cases, the MCA classifier was constructed after effectively separating the data into a training set containing 40 Cases and 40 Controls, and a testing set containing 20 Cases and 20 Controls.

It is clear that the fingerprint-based methods are able to classify these samples to a very high accuracy, even though these datasets are constructed to contain no biological information.

(Last updated 9/1/07)