Many popular methods in comparative genomics are based on fitting and interpreting continuous time discrete state Markov models to large sets of homologous sequences in a variety of statistical frameworks, all of which involve the evaluation of complex and expensive phylogenetic likelihood functions. When applied to coding sequence data, parameter estimates from these models can be interpreted to glean important biological insights, e.g., genomic locations and evolutionary times during which natural selection (conservation and adaptation) acted upon sequences. These methods can gain significant power when applied to large datasets, but they become computationally intractable in current implementations.
We describe an adaptation of the ideas popularized in latent Dirichlet allocation literature to this domain. In particular, our implementation reduces the number of required likelihood calculations to an a priori fixed number (independent of the size of the data), obtains relevant parameter estimates via VB inference, and readily scales to datasets 100x larger than can be tackled with current methods. We demonstrate applications of these methods to selection inference.