JSM 2017 Online Program

Activity Number:	588 - Statistical Learning: Clustering
Type:	Contributed
Date/Time:	Wednesday, August 2, 2017 : 2:00 PM to 3:50 PM
Sponsor:	Section on Statistical Learning and Data Science
Abstract #324601	View Presentation
Title:	Variable Selection in K-Means Clustering
Author(s):	Nicholas Scott Berry* and Ranjan Maitra
Companies:	Iowa State University and Iowa State University
Keywords:	K-means ; Bootstrap ; Variable Selection ; Clustering ; Unsupervised
Abstract:	Clustering is often used as a first look at a dataset in order to determine if there are hard to detect patterns in the dataset. When clustering, the number of clusters and the dimensions analyzed both affect the usability of the information obtained via clustering, but are both generally unknowable without doing heavy analysis of the dataset beforehand. We use a bootstrap approach to simultaneously identify the number of clusters and which dimensions should be included in the most practical clustering scheme for that data. The algorithm iteratively tests between simple and complex models, testing whether the information added by a more complex model is worth the added complexity of that model. The algorithm is used on a dataset of 100 km race runners, a dataset of written digits, and a simulation study is performed to assess the performance in differing situations and in comparison to other similar techniques.

Authors who are presenting talks have a * after their name.