Abstract:
|
Sufficient dimension reduction (SDR) reduces the data dimensionality without specifying a regression model and thus being called "sufficient" for regression analysis. Most SDR approaches, such as Sliced Inverse Regression (SIR) and Sliced Average Variance Estimation (SAVE), work well with continuous responses, but not with binary cases due to the limited number of slices. In this article, we develop a novel SDR approach, called the representative approach, to deal with binary responses. By converting a block of data points into a representative data point, the corresponding binary responses become continuous and the size of the data is reduced significantly. Therefore, the proposed representative approach provides an ideal solution for big data dimension reduction and can be incorporated with the classical SDR approaches naturally. By both theoretical justification and simulation studies, we show that the proposed approach can recover the central subspace better than the original SDR methods. In order to be applicable for a massive dataset, we develop a streaming algorithm for big data dimension reduction and apply it to a real big dataset, the Airline on-time performance data.
|