Statistics and Its Interface

Volume 7 (2014)

Number 2

Predictor augmentation in random forests

Pages: 177 – 186

DOI: https://dx.doi.org/10.4310/SII.2014.v7.n2.a3

Authors

Ruo Xu (Google. Inc., Mountain View, California, U.S.A.)

Dan Nettleton (Department of Statistics, Iowa State University, Ames, Ia., U.S.A.)

Daniel J. Nordman (Department of Statistics, Iowa State University, Ames, Ia., U.S.A.)

Abstract

Random forest (RF) methodology is an increasingly popular nonparametric methodology for prediction in both regression and classification problems. We describe a behavior of random forests (RFs) that may be unknown and surprising to many initial users of the methodology: out-of-sample prediction by RFs can be sometimes improved by augmenting the dataset with a new explanatory variable, independent of all variables in the original dataset. We explain this phenomenon with a simulated example, and show how independent variable augmentation can help RFs to decreases prediction variance and improve prediction performance in some cases. We also give real data examples for illustration, argue that this phenomenon is closely connected with overfitting, and suggest potential research for improving RFs.

Keywords

classification, machine learning, prediction, regression

2010 Mathematics Subject Classification

62-07

Published 17 April 2014