Keywords: Continuous time Markov Chain model, EM algorithm, model-based clustering, phyclust, unsupervised learning
Phylogenetic clustering (Phyloclustering) is an evolutionary, model-based approach that identifies population structure based on molecular data and is especially efficient for large sequence data sets. A Continuous Time Markov Chain (CTMC) model is assumed for the mutation process, with sequences evolving from unknown ancestral sequences. A finite mixture model for the process is proposed and an Expectation-Maximization (EM) algorithm with analytic formulae for both the E- and M-steps are established for finding maximum likelihood estimators. Individual sequences are clustered based on their maximum posterior probabilities. A bootstrap procedure for model selection, including determination of the number of clusters, is also introduced. In simulation studies, phyloclustering outperforms existing methods. An R package, phyclust, implements the phyloclustering approach and combines other utilities for bootstrap and simulation studies. An application to identify populations of Equine Infectious Anemia Virus (EIAV) will be demonstrated. The viral sequences are analyzed and it is seem that the identified clusters are strikingly associated with the stage of the disease.