Abstract:
|
Multivariate mixture models provide a convenient method of density estimation, model-based clustering, and provide an excellent insight into the actual data generation process. But the problem of choosing the number of components (k) in a statistically meaningful way is still a subject of considerable research . Available methods for estimating k include optimizing AIC and BIC, gradient checking in a nonparametric mixture model setup, and Bayesian approaches with entropy distances. In this paper we present rules for selecting k based on a one-sided non-parametric confidence-set generated by a quadratic distance measure. In this methodology the goal is to find the minimal number of components that are needed to adequately describe the true distribution. We also present results for selecting k based on a risk analysis that includes a penalty for overfitting. The goal here is to find the fitted mixture that is closest to the true distribution. Finally, we fine-tune our methods to analyze gene-expression data from micro-arrays and compare them with other competitive methods.
|