Online Program Home
My Program

Abstract Details

Activity Number: 663 - Regression, Clustering and Gene Set Methods in Genomics
Type: Contributed
Date/Time: Thursday, August 1, 2019 : 10:30 AM to 12:20 PM
Sponsor: Section on Statistics in Genomics and Genetics
Abstract #307148
Title: Advances in the Hard Clustering of Categorical Data
Author(s): Karin Dorman*
Companies: Iowa State University
Keywords: clustering; categorical data; k-modes; microbiome; next generation sequencing

Mining clusters from datasets is an important endeavor in several contexts. The k-means method is a popular and efficient distribution-free approach to clustering numerical-valued data, but it cannot be applied to categorical-valued observations. The k-modes clustering fills this lacuna by introducing new dissimilarity measures for categorical data and by replacing the mean with the mode for each categorical feature in the modified objective function. We provide a fast and computational efficient implementation of k-modes, which reduces unnecessary computations and finds better optima than existing k-modes algorithms. Furthermore, we extend k-modes to cluster categorical data when observations include error that is communicated as misclassification probabilities. We apply the algorithm to next generation sequencing data with accompanying quality scores to identify species in microbiome communities.

Authors who are presenting talks have a * after their name.

Back to the full JSM 2019 program