Statistics and Its Interface

Volume 9 (2016)

Number 4

Special Issue on Statistical and Computational Theory and Methodology for Big Data

Guest Editors: Ming-Hui Chen (University of Connecticut); Radu V. Craiu (University of Toronto); Faming Liang (University of Florida); and Chuanhai Liu (Purdue University)

Iterative subsampling in solution path clustering of noisy big data

Pages: 415 – 431

DOI: http://dx.doi.org/10.4310/SII.2016.v9.n4.a2

Authors

Yuliya Marchetti (Department of Statistics, University of California at Los Angeles)

Qing Zhou (Department of Statistics, University of California at Los Angeles)

Abstract

We develop an iterative subsampling approach to improve the computational efficiency of our previous work on solution path clustering (SPC). The SPC method achieves clustering by concave regularization on the pairwise distances between cluster centers. This clustering method has the important capability to recognize noise and to provide a short path of clustering solutions; however, it is not sufficiently fast for big datasets. Thus, we propose a method that iterates between clustering a small subsample of the full data and sequentially assigning the other data points to attain orders of magnitude of computational savings. The new method preserves the ability to isolate noise, includes a solution selection mechanism that ultimately provides one clustering solution with an estimated number of clusters, and is shown to be able to extract small tight clusters from noisy data. The method’s relatively minor losses in accuracy are demonstrated through simulation studies, and its ability to handle large datasets is illustrated through applications to gene expression datasets.

Keywords

big data, clustering, sparse regularization, subsampling

2010 Mathematics Subject Classification

Primary 62H30. Secondary 68T05.

Full Text (PDF format)

Published 14 September 2016