Abstract:
|
In various real-world problems, we are presented with positive and unlabelled data, referred to as presence-only responses and where the number of covariates p is large. The combination of presence-only responses and high dimensionality presents both statistical and computational challenges. In this paper, we develop the PUlasso algorithm for variable selection and classification with positive and unlabelled responses. Our algorithm involves using the majorization-minimization (MM) framework which is a generalization of the well-known expectation-maximization (EM) algorithm. In particular to make our algorithm scalable, we provide two computational speed-ups to the standard EM algorithm. We provide a theoretical guarantee where we first show that our algorithm is guaranteed to converge to a stationary point, and then prove that any stationary point achieves the minimax optimal mean-squared error of slogp/n, where s is the sparsity of the true parameter. We also demonstrate through simulations that our algorithm out-performs state-of-the-art algorithms in the moderate p settings in terms of classification performance and performs well on a real biochemistry example.
|