Statistics and Its Interface

Volume 11 (2018)

Number 2

Dimension reduction for big data

Pages: 295 – 306

DOI: http://dx.doi.org/10.4310/SII.2018.v11.n2.a7

Authors

Tonglin Zhang (Department of Statistics, Purdue University, West Lafayette, Indiana, U.S.A.)

Baijian Yang (Department of Computer and Information Technology, Purdue University, West Lafayette, Indiana, U.S.A.)

Abstract

Dimension reduction is aimed at reducing the dimension of a high dimensional vector-valued explanatory variables and simultaneously preserves its relationship with a univariate or low-dimensional real-valued response. As one of the oldest and most well-known dimension reduction approaches, principal component analysis (PCA) has been extensively used in high dimensional data analysis in applications. Classical PCA approaches cannot be applied to big data because of memory and storage barriers. Using a technique called scanning data by rows, the article proposes a new PCA approach. It shows that the proposed PCA approach can provide exact solutions when the size of observed data exceeds the memory size of a computing system.

Keywords

big data, dimension reduction, generalized linear models, parallel computation, principal component analysis, scanning data by rows

2010 Mathematics Subject Classification

Primary 62H25. Secondary 62J12.

Full Text (PDF format)

Received 5 August 2016

Published 7 March 2018