Statistics and Its Interface

Volume 12 (2019)

Number 2

Pólya urn model and its application to text categorization

Pages: 227 – 237

DOI: https://dx.doi.org/10.4310/SII.2019.v12.n2.a4

Authors

Haibin Zhang (School of Statistics, East China Normal University, Shanghai, China)

Xianyi Wu (School of Statistics, East China Normal University, Shanghai, China)

Xueqin Zhou (College of Science, Shanghai Institute of Technology, Shanghai, China)

Abstract

Pólya urn model is a basic model widely applied in statistics and text mining. Most algorithms to training the model are very slow and complicated so that it generally difficult to fit a Pólya urn model to big data sets. This paper proposes a new minorization-maximization (MM) algorithm for the maximum likelihood estimation (MLE) of the Pólya urn model in which the surrogate function is constructed by means of a simple convex function. The convergence of the MM algorithm is analyzed and the asymptotic normality of the corresponding MLE for non-identically distributed observations is also derived. The performance of this new MM algorithm is also compared with Newton method and other MM algorithms. The Pólya urn model is applied to text categorization. Its superiority to naive Bayes (NB) classifier, k-Nearest Neighbor (k-NN) and support vector machine (SVM) are demonstrated by a real newsgroup dataset.

Keywords

Pólya urn model, minorization-maximization, asymptotic properties, text categorization

This research was supported by NSFC under grant No. 71771089.

Received 20 October 2017

Published 11 March 2019