Abstract:
|
We consider clustering based on significance tests for Gaussian Mixture Models (GMMs). Our starting point is the SigClust, a method developed by Liu, Hayes, Nobel and Marron (2008) which introduces a test based on the k-means objective (with k = 2) to decide whether the data should be split into two clusters. When applied recursively, this test yields a method for hierarchical clustering that is equipped with a significance guarantee. In this research, we study the power of this approach in some examples and show that there are large regions of the parameter space where the power is low. We then introduce a new test based on the idea of relative fit. In contrast to prior work, we do not assume that the distribution is either Gaussian or a mixture of Gaussians. Rather, we develop a test for whether a mixture of Gaussians provides a better fit relative to a single Gaussian, without assuming that either model is correct. The test we propose has a simple critical value and provides provable error control. We show how our tests can be used both hierarchically and sequentially, in a manner for model selection, for clustering.
|