Abstract:
|
The idea of using data to train models that are both accurate and interpretable has been around for decades. One desires to build such an effective model based on the predictors. However, in the age of big data, it is becoming increasingly common that a data set is high-dimensional, meaning the number of predictors vastly exceeds the number of observations. In this setting, many long standing statistical modeling techniques, such as linear and logistic regression, no longer suffice. Regularization is a popular technique that imposes a penalty on the original model; in some cases the models are sparse, meaning they are very interpretable. In this study, we investigate the potential effectiveness of using clustering algorithms to generate a grouping structures for high-dimensional data sets. Using various regularization techniques, we seek to determine if the generated groups are truly relevant to the response and if the accuracy and interpretability of the models can be improved. We support the clustered group structure theory using two real-world data sets.
|