Abstract:
|
Cluster analysis is a technique that aims to produce smaller groups of similar observations in a data set. In model-based clustering, the population is assumed to be a convex combination of sub-populations, each of which is modeled by a probability distribution. When data sets are characterized by outliers, a contaminated normal (CN) distribution can be used to model sub-population. The CN is a two-component Normal mixture: one with a large prior probability represents good observations, and the other with a small prior probability, the same mean, and an inflated covariance matrix represents outliers. The CN distribution can produce robust parameter estimates and detect mild outliers automatically. An extension of the CN, the multiple scaled contaminated normal (MSCN) distribution, has the advantage of directional robust parameter estimates and outlier detection; that is, these procedures work separately for each principal component. However, this model cannot be fitted to incomplete data sets. Hence, we develop a framework for fitting a mixture of MSCN distributions to data sets that contain some values missing at random using the expectation-conditional maximization algorithm.
|