Statistics and Its Interface

Volume 11 (2018)

Number 1

Feature screening for ultrahigh dimensional binary data

Pages: 41 – 50

DOI: https://dx.doi.org/10.4310/SII.2018.v11.n1.a4

Authors

Guoyu Guan (Key Laboratory for Applied Statistics of MOE, School of Economics, Northeast Normal University, Changchun, Jilin Province, China)

Na Shan (Key Laboratory for Applied Statistics of MOE, School of Psychology, Northeast Normal University, Changchun, Jilin Province, China)

Jianhua Guo (Key Laboratory for Applied Statistics of MOE, School of Mathematics and Statistics, Northeast Normal University, Changchun, Jilin Province, China)

Abstract

With the rapid development of information technology, ultrahigh dimensional binary data have increased dramatically, for which feature screening has become a necessary step in real data analysis. In this article, we propose a $L_0$-regularization feature screening procedure for naive Bayes classifier, which is equivalent to the classical mutual information screening method. However, the turning parameter in $L_0$-regularization is hard to be selected and lack of theoretical support. To this end, a BIC-type criterion is applied to identify important features. Moreover, the asymptotic properties of the proposed method is theoretically investigated under some mild assumptions. Lastly, its outstanding performance is numerically confirmed on simulated data, and a real example of Chinese document classification is presented for illustration purpose.

Keywords

feature screening, $L_0$-regularization, naive Bayes, screening consistency

2010 Mathematics Subject Classification

Primary 62F07. Secondary 62H30.

The research of Guoyu Guan is supported in part by National Natural Science Foundation of China (No.11501093), China Postdoctoral Science Foundation Funded Project (No.2015M581378), and the Fundamental Research Funds for the Central Universities (No.2412015KJ028,130028613). The research of Na Shan is supported in part by National Natural Science Foundation of China (No. 11401047, 11571050) and the Project of the Educational Department of Jilin Province of China (2016315). The research of all the authors is supported by National Natural Science Foundation of China (No.11631003, 11690012).

Received 14 September 2016

Published 23 August 2017