Abstract:
|
Difficulties in clustering mixed-mode data include handling the association between the different types of variables and determining how the continuous and categorical variables should be weighted in the algorithm. We follow the multivariate normal model to deal with such data by assuming latent continuous variables with thresholds defining categories for the categorical variables. We propose a new method to generate realizations of latent variables corresponding to observed categorical variables. We then apply k-means clustering on the observed and generated continuous data. This new method, called the latent realization clustering method, depends on the Kendall rank correlation coefficient between variables of different types. When applied to simulated data, this method performs less accurate than the mixture model based clustering method but takes much less time. Additionally, we use the variation in our latent data realizations to produce estimated probabilities of each observation belonging to each cluster and a probability matrix that estimates the probability of each pair of observations falling in the same cluster.
|