Abstract:
|
We develop statistically interpretable counterparts to interestingness measures from the data mining literature, on Bayesian and Frequentist foundations. We develop scalable estimation procedures for the corresponding statistics, focusing on stable distributions, overdispersion parameters for count data, and heteroskedasticity parameters for continuous data. We illustrate a method (most interesting subgroup prediction) that connects interestingness measures to decision tree/random forest methods, especially for risk prediction.
We present applications to traditional market basket analysis, genomic risk prediction, and clinical trial subgroup analysis, and to geographic Covid incidence variability.
|