Conference Program

Return to main conference page

All Times ET

Thursday, June 9
Computational Statistics
Machine Learning
New Models, Methods, and Applications I
Thu, Jun 9, 3:45 PM - 5:15 PM

CGMM: an algorithm for constrained model-based clustering (310134)

Yujia Li, Department of Biostatistics, School of Public Health, University of Pittsburgh 
George C. Tseng, Department of Biostatistics, School of Public Health, University of Pittsburgh 
*Jian Zou, Department of Biostatistics, School of Public Health, University of Pittsburgh  

Keywords: machine learning, model-based clustering, constrained clustering

Model-based clustering methods assign samples into groups based on distributional assumption of observations, which have been widely used for rigorous inference, interpretation, and prediction. The conventional model-based clustering methods primarily consider distributional similarity among observations without pre-determination of cluster size boundary, which have three drawbacks in practice: 1) losing clustering accuracy due to vulnerability to local optimum; 2) encountering empty cluster problems when cluster sizes are imbalanced and small groups can be merged into large groups; 3) failing to fulfill the clustering goal when cluster size constrain is mandatory. To the best of our knowledge, little has been developed for model-based cluster analysis with cluster size constrain. To bridge this gap, we develop a novel algorithm named constrained Gaussian mixture model (CGMM) by extending Gaussian mixture model (GMM). We also generalize CGMM to SCGMM (sparse CGMM) using lasso penalty to allow feature selection in high-dimensional data. Extensive simulations and three real applications demonstrate the superior performance of our proposed method.