skip to content
National Cancer Institute U.S. National Institutes of Health www.cancer.gov
Pubications

Publications Search

Abstract

Title: Prediction error estimation: a comparison of resampling methods.
Author: Molinaro AM, Simon R, Pfeiffer RM
Journal: Bioinformatics
Year: 2005
Month: May

Abstract: MOTIVATION: In genomic studies, thousands of features are collected on relatively few samples. One of the goals of these studies is to build classifiers to predict the outcome of future observations. There are three inherent steps to this process: feature selection, model selection, and prediction assessment. With a focus on prediction assessment, we compare several methods for estimating the 'true' prediction error of a prediction model in the presence of feature selection. RESULTS: For small studies where features are selected from thousands of candidates, the resubstitution and simple split-sample estimates are seriously biased. In these small samples, leave-one-out (LOOCV), 10-fold cross-validation (CV), and the .632+ bootstrap have the smallest bias for diagonal discriminant analysis, nearest neighbor, and classification trees. LOOCV and 10-fold CV have the smallest bias for linear discriminant analysis. Additionally, LOOCV, 5- and 10-fold CV, and the .632+ bootstrap have the lowest mean square error. The .632+ bootstrap is quite biased in small sample sizes with strong signal to noise ratios. Differences in performance among resampling methods are reduced as the number of specimens available increase. SUPPLEMENTARY INFORMATION: A complete compilation of results is available in Molinaro et al. (2005). R code for simulations and analyses is available from the authors.