Abstract:
|
B cells, an essential part of our immune system, develop antibodies through a process of random mutation of their DNA sequences to find variants with improved binding. High-throughput sequencing indicates these mutations are not distributed uniformly across sequence sites, and are consistent with ``mutation motifs'', short DNA subsequences surrounding a site that affect how likely a given site is to experience a mutation. Quantifying the effect of motifs on mutation rates is challenging: the problem is high-dimensional when a large number of motifs are considered, and the unobserved history of the mutation process leads to a nontrivial missing data problem. We introduce an $\ell_1$-penalized proportional hazards model to infer mutation motifs and their effects. In order to estimate model parameters, our method uses a Monte Carlo EM algorithm to marginalize over the unknown ordering of mutations. We show that our method performs better on simulated data compared to current methods and leads to more parsimonious models, and it formalizes the current methods in a statistical framework that can be easily extended to analyze the effect of other biological features on mutation rates.
|