Online Program

Return to main conference page
Friday, May 18
Computational Statistics
Invitation to Statistical Analysis and Data Mining
Fri, May 18, 3:30 PM - 5:00 PM
Grand Ballroom E

Phyloclustering: A Model-Based Approach for Identifying Microbial Populations (304773)

*Wei-Chen Chen, pbdR Core Team 

Keywords: Continuous time Markov Chain model, EM algorithm, model-based clustering, phyclust, unsupervised learning

Phylogenetic clustering (Phyloclustering) is an evolutionary, model-based approach that identifies population structure based on molecular data and is especially efficient for large sequence data sets. A Continuous Time Markov Chain (CTMC) model is assumed for the mutation process, with sequences evolving from unknown ancestral sequences. A finite mixture model for the process is proposed and an Expectation-Maximization (EM) algorithm with analytic formulae for both the E- and M-steps are established for finding maximum likelihood estimators. Individual sequences are clustered based on their maximum posterior probabilities. A bootstrap procedure for model selection, including determination of the number of clusters, is also introduced. In simulation studies, phyloclustering outperforms existing methods. An R package, phyclust, implements the phyloclustering approach and combines other utilities for bootstrap and simulation studies. An application to identify populations of Equine Infectious Anemia Virus (EIAV) will be demonstrated. The viral sequences are analyzed and it is seem that the identified clusters are strikingly associated with the stage of the disease.