Abstract:
|
The job of measuring distances or dissimilarities between observations is critical in tasks like clustering, collaborative filtering, and pattern recognition. An inter-observation dissimilarity should account for categorical variables, scale numeric ones, protect against distortion from outliers, remove noisy or redundant variables, and handle missing values gracefully. We propose such a dissimilarity based on a set of classification and regression trees. Our dissimarity performs better than competitors in the face of noise and outliers, and can be scaled to large data. The procedure can produce a new data set whose inter-point Euclidean distances reflect the tree-based dissimilarities. This keeps the size of the problem manageable and allows low-dimensional visualization through methods like multidimensional scaling or the t-SNE algorithm of van der Maaten and Hinton (2008). Some examples with their corresponding visualizations are presented.
|