Abstract:
|
Detection of unknown patterns from a randomly generated sequence of observations is a problem arising in fields ranging from signal processing to computational biology. An example that we focus on is the detection of short recurring patterns in DNA sequences, called motifs, that represent potential protein binding sites during gene regulatory processes. What makes this problem difficult is that these patterns can vary stochastically. We describe here a novel Bayesian data augmentation strategy for detecting such patterns based on a stochastic "dictionary'" model, under which conserved patterns and nucleotides (stochastic words) are assumed to be generated according to probabilistic rules. Our missing data approach addresses other related problems, such as finding patterns of unknown width and those having varying degrees of insertions and deletions. However, the flexibility of this model is accompanied by a high degree of computational complexity, which is tackled by means of recursion methods. Bayesian techniques are proposed for evaluating the statistical significance of found motifs, and results are illustrated by means of simulation studies and a real data example.
|