Multivariate categorical data sets, in addition to non-linearly separable clusters, are ubiquitous across scientific and industrial fields. For example, in the area of biologics and vaccine manufacturing process development it is necessary to understand the distribution of process performance across many batches. Additionally, clinical and real world observational data can contain a large number of categorical variables. Further complicating the analysis is the potential presence of clusters that are not linearly separable.
Various dimensionality reduction methods have gained acceptance, with principal component analysis (PCA) being the simplest, and kernel PCA being one of the more complicated. Unsupervised random forest (URF) is another method capable of discovering patterns in multivariate data. In this talk we will examine the use of URF for clustering and variable selection in both categorical multivariate data and non-linearly separable clusters. Our focus will be three-fold: 1) assessing the ability of URF to recover clusters in a multivariate latent space, 2) presenting a strategy for reconstructing the original features, and 3) assessing variable importance