All Times EDT
Keywords: Optimal subsampling, Randomized algorithm, Massive data, Generalized linear models
For massive data where the full data size n is much larger than the number of covariates p, subsampling techniques are popular methods to alleviate the computational burden by downsizing the data size. An approach for the subsampling is that subsampling probabilities for each data point are specified to obtain an informative sub-data and then estimates based on the sub-data are obtained to approximate estimates from the full data. To define the subsampling probabilities, an A-optimality criterion can be used, which is minimizing the asymptotic mean squared error of the estimator from a general subsample. However, it still requires demanding computing time to calculate the assigned probabilities under the A-optimality criterion. In this paper, we first reviewed the optimal subsampling probability based on A-optimality for GLMs (dispersion parameter is known, and derived the optimal subsampling probability based on A-optimality for Gaussian linear model (dispersion parameter is unknown) after establishing the asymptotic result of the subsample estimator in the model. Also, we proposed FASA algorithms to approximate the optimal subsampling probability based on A-optimality to alleviate the computing burden. To develop these algorithms, a Johnson-Lindenstrauss Transform and a Subsampled Randomized Hadamard Transform which are the methods to reduce matrix volume are used for efficient computing time. Simulation studies indicate that the estimators based on the developed algorithms have good performance for statistical inference and have substantial savings in the computing time compared to the exact calculation of the A-optimal subsampling probabilities.