Keywords: data mining, association mining, latent correlation, binary data
A common problem in statistical data mining is to identify groups of variables, often from a much larger pool, that are strongly associated. We introduce Coherent Set Mining (CSM), a new method of association mining in high-dimensional binary data. CSM makes use of an iterative testing-based method for extracting significant associated variable sets. Our approach relies a new measure of association, coherence, which captures latent relationships between variables when data consists of thresholded sample observations. An estimator of coherence is proposed based on a null model and corresponding consistent parameter estimators. Relevant significance tests for coherence are derived from asymptotic results. We demonstrate the effectiveness of CSM via applications in market basket data, text mining, and music recommendation.