Statistics and Its Interface

Volume 1 (2008)

Number 1

Nonparametric clustering of functional data

Pages: 47 – 62

DOI: https://dx.doi.org/10.4310/SII.2008.v1.n1.a5

Authors

Forrest Miller (Department of Mathematics, Kansas State University, Manhattan, Ks., U.S.A.)

James Neill (Department of Statistics, Kansas State University, Manhattan, Ks., U.S.A.)

Haiyan Wang (Department of Statistics, Kansas State University, Manhattan, Ks., U.S.A.)

Abstract

This paper presents a method for effectively detecting unknown patterns or clusters in high dimensional functional data. Examples of such data include gene expression levels measured over time from microarray experiments, functional magnetic resonance imaging (fMRI), mass spectrometry data from proteinomics, lipidomics etc. We define clusters through the unknown high dimensional multivariate distributions of all observations along each curve. Kullback-Leibler information and Mahalanobis generalized squared distance can fail to provide meaningful measure of distance between distributions in such high dimensional setting. We propose a new similarity measure and an agglomerative clustering algorithm, called PCLUST, to effectively differentiate among high dimensional populations. The algorithm produces invariant results under monotone transformations of data and does not require users to specify the number of clusters. Simulations show that PCLUST significantly outperforms 9 other popular algorithms in both clustering accuracy and robustness. An application in identifying biomarkers using time course gene expression data from Arabidopsis in response to environmental stresses is illustrated.

Keywords

cluster analysis, nonparametric inference, hypothesis testing, mixture model, high dimensional multivariate analysis, time course gene expression microarray data, lipid metabolism

2010 Mathematics Subject Classification

Primary 60H30, 62G10, 62G35. Secondary 62P10.

Published 1 January 2008