Abstract:
|
Support vector machines (SVM) is a powerful supervised learning method for classification. However, training SVM may be computationally infeasible for large datasets. In this case, methods for instance selection (IS), in which a subset of representative units are selected for training, may be used. We propose the use of threshold clustering (TC), a recently-developed efficient clustering method, for IS when training SVM. Given a fixed size threshold t, TC forms clusters of t or more units while ensuring that the maximum within-cluster dissimilarity is small. Unlike most traditional clustering methods, TC is designed to form many small clusters of units, making it ideal for IS. Our proposed method begins by performing TC on each class in the training set. Then, the centroids of all clusters are formed creating a reduced training set. TC may be repeated if data reduction after this first step is insufficient. We show, via simulation and application to datasets, that TC efficiently reduces the size of training sets without sacrificing the prediction accuracy of SVM. Moreover, it often outperforms competing methods for IS both in terms of the runtime and prediction accuracy.
|