Statistics and Its Interface

Volume 6 (2013)

Number 2

Penalized unsupervised learning with outliers

Pages: 211 – 221

DOI: http://dx.doi.org/10.4310/SII.2013.v6.n2.a5

Author

Daniela M. Witten (Department of Biostatistics, University of Washington, Seattle, Wash., U.S.A.)

Abstract

We consider the problem of performing unsupervised learning in the presence of outliers—that is, observations that do not come from the same distribution as the rest of the data. It is known that in this setting, standard approaches for unsupervised learning can yield unsatisfactory results. For instance, in the presence of severe outliers, $K$-means clustering will often assign each outlier to its own cluster, or alternatively may yield distorted clusters in order to accommodate the outliers. In this paper, we take a new approach to extending existing unsupervised learning techniques to accommodate outliers. Our approach is an extension of a recent proposal for outlier detection in the regression setting. We allow each observation to take on an “error” term, and we penalize the errors using a group lasso penalty in order to encourage most of the observations’ errors to exactly equal zero. We show that this approach can be used in order to develop extensions of $K$-means clustering and principal components analysis that result in accurate outlier detection, as well as improved performance in the presence of outliers. These methods are illustrated in a simulation study and on two gene expression data sets, and connections with $M$-estimation are explored.

Keywords

robust, group lasso, clustering, principal components analysis, $M$-estimation

Full Text (PDF format)