Abstract:
|
In model-based clustering, different density functions are used to model sub-populations in the data. When data are characterized by outliers, robust distributions such as the Student-t (T) or the contaminated normal (CN) distribution, and their extensions for directional tail behavior, multiple scaled (MS) T and CN, can be used. Model-based clustering methods take the number of clusters as an input parameter, and many indices exist to choose the number of clusters. In this paper, we use simulated and real data sets to compare different indices to select the number of clusters when using mixtures of T, CN, MST, and MSCN distributions. The effectiveness of each index is determined by the number of successes in selecting the right number of sub-populations in the data.
|