Online Program Home
My Program

Abstract Details

Activity Number: 301 - SPEED: Statistics for Biopharmaceutical Studies
Type: Contributed
Date/Time: Tuesday, July 31, 2018 : 8:30 AM to 10:20 AM
Sponsor: Biopharmaceutical Section
Abstract #328736 Presentation
Title: A Study in the Use of Unsupervised Random Forest in the Analysis of Data Sets Composed of Categorical Variables/Features
Author(s): Nelson Lee Afanador* and Richard Baumgartner and Dai Feng
Companies: Merck and Merck and Merck
Keywords: unsupervised random forest; random forest; multivariate data; dimensionality reduction; categorical data; principal component analysis

Multivariate categorical data sets, in addition to non-linearly separable clusters, are ubiquitous across scientific and industrial fields. For example, in the area of biologics and vaccine manufacturing process development it is necessary to understand the distribution of process performance across many batches. Additionally, clinical and real world observational data can contain a large number of categorical variables. Further complicating the analysis is the potential presence of clusters that are not linearly separable.

Various dimensionality reduction methods have gained acceptance, with principal component analysis (PCA) being the simplest, and kernel PCA being one of the more complicated. Unsupervised random forest (URF) is another method capable of discovering patterns in multivariate data. In this talk we will examine the use of URF for clustering and variable selection in both categorical multivariate data and non-linearly separable clusters. Our focus will be three-fold: 1) assessing the ability of URF to recover clusters in a multivariate latent space, 2) presenting a strategy for reconstructing the original features, and 3) assessing variable importance

Authors who are presenting talks have a * after their name.

Back to the full JSM 2018 program