Keywords: fast k nearest neighbor graph construction; text analysis
The performance of several existing fast k nearest neighbor graph construction approaches are investigated using a large scientific publication corpus (on the order of tens of millions of publications). Particular attention is given to the domain specific case of cosine similarity between document vectors (i.e., sparse high-dimensional non-negative vectors). Exact and approximate methods are included, and performance reported with respect to time and k nearest neighbor recall. Additional analysis on the methods is reported with respect to varying corpus size, choice of k, and dimensionality (i.e., feature selection).