Abstract:
|
Classification problems often have a lot of missing data in the training set used for classification. A widely used solution to this problem is imputation of missing values based on k nearest neighbors (kNN) of the missing observation. However, most of the former studies do not take into account the presence of the class label in the classification problem with missing data. Also, the existing kNN imputation methods use Minkowski distance or its variants as a measure of distance, which does not work well with heterogeneous data. In this paper, we propose a novel iterative kNN imputation technique based on class weighted gray distance between the missing datum and all the training data. Gray distance works well in heterogeneous data with missing instances. The distance is weighted by Mutual Information (MI) which is a measure of feature relevance between the discrete or continuous features and the class label. This ensures that the imputed dataset is better directed towards improving the classification performance. This class weighted gray distance based kNN imputation algorithm is compared with traditional kNN imputation algorithms as well as MICE and missForest using UCI datasets.
|