Abstract:
|
The coresets approach, also called subsampling or subset selection, aims to select a subsample as a surrogate for the observed sample. Such an approach has been used pervasively in large-scale data analysis. Existing coresets methods construct the subsample using a subset of rows from the predictor matrix. Such methods can be significantly inefficient when the predictor matrix is sparse or numerically sparse. To overcome the limitation, we develop a novel element-wise subset selection approach, called core-elements. We provide a deterministic algorithm to construct the core-elements-based estimator, requiring a computational cost only at the order of O(nnz(X)+p^3), where X is the predictor matrix with p covariates and nnz denotes the number of non-zero elements. Theoretically, we show the proposed estimator is unbiased, and it approximately minimizes an upper bound of the estimation variance. We also provide a coresets-like finite sample bound for the proposed estimator with approximation guarantees. Numerical studies on various synthetic and real-world datasets demonstrate the proposed method's superior performance compared to mainstream competitors.
|