JSM 2016 Online Program

Activity Number:	540
Type:	Contributed
Date/Time:	Wednesday, August 3, 2016 : 10:30 AM to 12:20 PM
Sponsor:	Section on Statistical Computing
Abstract #321515	View Presentation
Title:	A New Approach to Visualizing and Clustering Mixed Categorical and Numeric Data
Author(s):	Samuel Buttrey* and Lyn Whitaker
Companies:	Naval Postgraduate School and Naval Postgraduate School
Keywords:	Inter-point distance ; Mixed data ; Visualization
Abstract:	The job of measuring distances or dissimilarities between observations is critical in tasks like clustering, collaborative filtering, and pattern recognition. An inter-observation dissimilarity should account for categorical variables, scale numeric ones, protect against distortion from outliers, remove noisy or redundant variables, and handle missing values gracefully. We propose such a dissimilarity based on a set of classification and regression trees. Our dissimarity performs better than competitors in the face of noise and outliers, and can be scaled to large data. The procedure can produce a new data set whose inter-point Euclidean distances reflect the tree-based dissimilarities. This keeps the size of the problem manageable and allows low-dimensional visualization through methods like multidimensional scaling or the t-SNE algorithm of van der Maaten and Hinton (2008). Some examples with their corresponding visualizations are presented.

Authors who are presenting talks have a * after their name.