Online Program

Return to main conference page

All Times ET

Wednesday, June 2
Machine Learning
Shaping Decisions with Classification and Clustering
Wed, Jun 2, 1:10 PM - 2:45 PM
TBD
 

Spectral Clustering of Mixed-Type Data (309695)

*Felix Mbuga, San Jose State University 
Cristina Tortora, San Jose State University 

Keywords: Cluster analysis, Spectral Clustering, Mixed-type data

Cluster analysis seeks to assign objects with similar characteristics into groups called clusters such that objects within a group are similar to each other and dissimilar to objects in other groups. Most popular clustering methods work on either quantitative continuous data or qualitative (categorical) data. Among them, spectral clustering has been shown to perform well in different scenarios on continuous data: it can detect convex and not-convex clusters, it is robust to outliers, and can detect overlapping clusters. However, the constraint on continuous data can be limiting in real applications where data are often of mixed-type, i.e. data that contains both continuous and categorical features. This paper looks at extending spectral clustering to mixed-type data. The new method replaces the Euclidean based similarity distance used in regular spectral clustering with different dissimilarity measures for continuous and categorical variables. A global dissimilarity measure is than computed using a weighted sum, and a Gaussian kernel is used to convert the dissimilarity matrix into a similarity matrix. The new method includes an automatic tuning of the variable weight and kernel parameter. The performance of spectral clustering in different scenarios are compared with that of two state of the art mixed-type data clustering methods, k-prototypes and KAMILA, using several real and simulated data sets. The simulated data were design to test the effect of several factors on the clustering performance, specifically we tested the effect of: different number of clusters (2 or 4), the degree of overlap in the variables, the number of continuous-categorical variables, the number of levels in the categorical variable, and whether or not the clusters were balanced (number of points per cluster).