Statistics and Its Interface

Volume 14 (2021)

Number 4

Online multiple learning with working sufficient statistics for generalized linear models in big data

Pages: 403 – 416

DOI: https://dx.doi.org/10.4310/20-SII661

Authors

Tonglin Zhang (Department of Statistics, Purdue University, West Lafayette, Indiana, U.S.A.)

Baijian Yang (Department of Computer and Information Technology, West Lafayette, Indiana, U.S.A.)

Abstract

The article proposes an online multiple learning approach to generalized linear models (GLMs) in big data. The approach relies on a new concept called working sufficient statistics (WSS), formulated under traditional iteratively reweighted least squares (IRWLS) for maximum likelihood of GLMs. Because the algorithm needs to access the entire data set multiple times, it is impossible to directly apply traditional IRWLS to big data. To overcome the difficulty, a new approach, called one-step IRWLS, is proposed under the framework of the online setting. The work investigates two methods. The first only uses the current data to formulate the objective function. The second also uses the information of the previous data. The simulation studies show that the results given by the second method can be as precise and accurate as those given by the exact maximum likelihood. A nice property is that one-step IRWLS successfully avoids the memory and computational efficiency barriers caused by the volume of big data. As the size of the WSS does not vary with the sample size, the proposed approach can be used even if the size of big data is much higher than the memory size of the computing system.

Keywords

big data, generalized linear models, one-step IRWLS, online multiple learning, parallel computation, working sufficient statistics

2010 Mathematics Subject Classification

Primary 62F10, 62J12. Secondary 62E20.

Received 6 July 2019

Accepted 18 December 2020

Published 8 July 2021