Online Program Home
My Program

Abstract Details

Activity Number: 540
Type: Contributed
Date/Time: Wednesday, August 3, 2016 : 10:30 AM to 12:20 PM
Sponsor: Section on Statistical Computing
Abstract #321515 View Presentation
Title: A New Approach to Visualizing and Clustering Mixed Categorical and Numeric Data
Author(s): Samuel Buttrey* and Lyn Whitaker
Companies: Naval Postgraduate School and Naval Postgraduate School
Keywords: Inter-point distance ; Mixed data ; Visualization
Abstract:

The job of measuring distances or dissimilarities between observations is critical in tasks like clustering, collaborative filtering, and pattern recognition. An inter-observation dissimilarity should account for categorical variables, scale numeric ones, protect against distortion from outliers, remove noisy or redundant variables, and handle missing values gracefully. We propose such a dissimilarity based on a set of classification and regression trees. Our dissimarity performs better than competitors in the face of noise and outliers, and can be scaled to large data. The procedure can produce a new data set whose inter-point Euclidean distances reflect the tree-based dissimilarities. This keeps the size of the problem manageable and allows low-dimensional visualization through methods like multidimensional scaling or the t-SNE algorithm of van der Maaten and Hinton (2008). Some examples with their corresponding visualizations are presented.


Authors who are presenting talks have a * after their name.

Back to the full JSM 2016 program

 
 
Copyright © American Statistical Association