Abstract:
|
Modern high-throughput technologies yield large matrices of counts, and complex modeling is required to account for the many sources of bias in these data. For example, in cancer genomics, whole-exome sequencing data is commonly used to infer somatic copy number alterations across the genome. Accurate inference in this problem requires a model that accounts for locus-specific covariates (such as GC content), sample-specific covariates (such as sample material type), a latent changepoint model (to handle locally constant copy ratios), and latent factors (to handle unobserved batch effects). We develop a hierarchical generalized bilinear regression model for these types of datasets, and a fast approximate inference algorithm that easily scales up to datasets with hundreds of thousands of dimensions and hundreds of samples. We demonstrate the method on simulated and real genome sequencing data.
|