Abstract:
|
Variable selection is an important problem in data analysis, especially in massive data analysis. Many variable selection methods, such as Lasso and SCAD, have been developed for high-dimensional data. However, these methods are impractical for conducting variable selection in massive data analysis, due to high computational cost. This manuscript introduces a framework of making these methods practical for conducting variable selection in massive data analysis, via the divide and conquer approach. The divide and conquer approach divides the massive data into subgroups randomly to which the existing variable selection methods can be applied computationally efficiently, and then find the most frequent set of informative variables. The proposed variable selection method reduces the computational cost significantly when applied to massive data. Some asymptotic properties, such as the selection consistency and convergence rate, are derived. The performance of the proposed method is evaluated via simulation studies and real-data applications.
|