Abstract:
|
Since the completion of the Human Genome Project, substantial effort has been put into identifying and annotating its functional DNA elements. With no universal definition of what constitutes function, we now have for any genetic variant, whether protein coding or noncoding, a diverse set of functional annotations. Current machine-learning methods focus primarily on predictive accuracy of a particular functional class. However, they seldom take into account correlations between functional scores. We propose latent annotation class estimation (LACE), an unsupervised learning algorithm that integrates multiple annotations. Our model defines functional status as a vector of binary variables, each meant to capture function defined by a specific group of annotations, e.g. evolutionary conservation scores, epigenetic scores. Our approach calculates the posterior probability as a composite score of a genomic position being functional using the EM algorithm. It also allows for correlations within and between the groups of annotations. We compare the predictive performance of LACE with existing supervised/unsupervised methods for both coding and non-coding variants in multiple databases.
|