Statistics and Its Interface
Volume 11 (2018)
Feature screening for ultrahigh dimensional binary data
Pages: 41 – 50
With the rapid development of information technology, ultrahigh dimensional binary data have increased dramatically, for which feature screening has become a necessary step in real data analysis. In this article, we propose a $L_0$-regularization feature screening procedure for naive Bayes classifier, which is equivalent to the classical mutual information screening method. However, the turning parameter in $L_0$-regularization is hard to be selected and lack of theoretical support. To this end, a BIC-type criterion is applied to identify important features. Moreover, the asymptotic properties of the proposed method is theoretically investigated under some mild assumptions. Lastly, its outstanding performance is numerically confirmed on simulated data, and a real example of Chinese document classification is presented for illustration purpose.
feature screening, $L_0$-regularization, naive Bayes, screening consistency
2010 Mathematics Subject Classification
Primary 62F07. Secondary 62H30.
The research of Guoyu Guan is supported in part by National Natural Science Foundation of China (No.11501093), China Postdoctoral Science Foundation Funded Project (No.2015M581378), and the Fundamental Research Funds for the Central Universities (No.2412015KJ028,130028613). The research of Na Shan is supported in part by National Natural Science Foundation of China (No. 11401047, 11571050) and the Project of the Educational Department of Jilin Province of China (2016315). The research of all the authors is supported by National Natural Science Foundation of China (No.11631003, 11690012).
Paper received on 14 September 2016.