Statistics and Its Interface

Volume 2 (2009)

Number 3

Search for the smallest random forest

Pages: 381 – 388

DOI: http://dx.doi.org/10.4310/SII.2009.v2.n3.a11

Authors

Minghui Wang (Department of Epidemiology and Public Health, Yale University School of Medicine, New Haven, Conn., U.S.A.)

Heping Zhang (Department of Epidemiology and Public Health, Yale University School of Medicine, New Haven, Conn., U.S.A.)

Abstract

Random forests have emerged as one of the most commonly used nonparametric statistical methods in many scientific areas, particularly in analysis of high throughput genomic data. A general practice in using random forests is to generate a sufficiently large number of trees, although it is subjective as to how large is sufficient. Furthermore, random forests are viewed as “black-box” because of its sheer size. In this work, we address a fundamental issue in the use of random forests: how large does a random forest have to be? To this end, we propose a specific method to find a sub-forest (e.g., in a single digit number of trees) that can achieve the prediction accuracy of a large random forest (in the order of thousands of trees). We tested it on extensive simulation studies and a real study on prognosis of breast cancer. The results show that such sub-forests usually exist and most of them are very small, suggesting they are actually the “representatives” of the whole random forests. We conclude that the sub-forests are indeed the core of a random forest. Thus it is not necessary to use the whole forest for satisfying prediction performance. Also, by reducing the size of a random forest to a manageable size, the random forest is no longer a black-box.

Keywords

random forest, classification, smallest forest

Full Text (PDF format)