Abstract:
|
Model-based clustering with finite mixture models has become a widely used clustering method, implemented by the R package MCLUST. Usually, observations to be clustered are assumed to have been accurately measured, but there are situations where this assumption is not feasible. This article proposes a new model-based clustering algorithm, called MCLUST-ME, that properly accounts for measurement errors. More specifically, we assume that the distribution of each observation consists of an underlying true component distribution and an independent measurement error distribution. Under this assumption, for two-group clustering, the data are no longer linearly or quadratically separable in general. Instead, each unique value of measurement error covariance corresponds to its own decision boundary. Through simulation, we confirmed this point and also discover that on average, our method performs at least as well as MCLUST in terms of accuracy at the presence of measurement errors, and that the two methods do not always choose the same optimal model. A real data set from RNA-Seq analysis is used to further illustrate the difference in clustering results between two methods.
|