301 – Large Data Discriminant, Classification, and Detection
A New Approach to the Parallel Coordinates Method for Large Data Sets
Norman Matloff
University of California, Davis
Yingkang Xie
University of California, Davis
Parallel coordinates is an exploratory method aimed at visualizing interrelations among variables. The concept is highly appealing, but the method becomes difficult or impossible to use when the number of data points n and/or the number of variables p become even moderately large. The former causes the "black screen problem," while the latter makes relations between "distant" axes difficult to discern. Various remedies such as line density, alpha-blending and axes permutation have been proposed. In this work we present a fresh approach, again based on multivariate density estimation, but in a very different manner: We plot only lines with the highest densities, to have a few "typical" lines in the graph. This solves the large-n problem, and ameliorates the large-p problem. The user may also specify that the lines having the smallest densities be plotted, as a means of detecting outliers. Finally, the user can specify that lines with locally-maximum densities be plotted, with the goal of cluster-hunting. We also present an application to regression diagnostics. The software uses parallel processing to speed the computation, and is available on CRAN.