Abstract:
|
Suppose we have a response Y and 1,000 covariates Xi, and Y is a function of the first three covariates plus noise. Y and Xi may not even be Euclidean. How can we identify the "true" three predictive variables exactly given 200 iid observations? A variable selection algorithm, which we call KFOCI, can do the job even without the need to specify the number of variables to select, achieving much superior performance compared with its predecessors, and is provably consistent under sparsity assumption. KFOCI is an application of the kernel partial correlation coefficient (KPC) we propose, which is a number between 0 and 1 measuring the strength of conditional dependence--the KPC between two random variables Y and Z given a third variable X is 0 if and only if Y is conditionally independent of Z given X, and 1 if and only if Y is a measurable function of Z and X. Given the predictors that have already been selected, KFOCI selects the next predictor Xi such that the sample KPC between Y and Xi given the selected predictors is maximized, and stops when all such sample KPC are negative. Both KPC and KFOCI are easily accessible through our package KPC available on CRAN.
|