Abstract:
|
Data discretization is a common pre-processing step for many statistical, machine learning and data mining methods. The greatest challenge in discretizing (binning) a dataset is preserving the original distribution of the data, while maintaining a reasonable bin size. In other words, the greatest challenge is to identify the optimal bin size of the data, while preserving the data distribution. It is a daunting task to do it manually and error prone as well. The published research focusses only on preserving the data distribution, as a result, the bin widths determined by these methods are often very small, hence not much useful for subsequent methods. The work presented here addresses this issue and provides a discretization method, based on optimizing a cost function, which while preserving the data distribution, simultaneously ensures a reasonably large bin width, allowing a meaningful discretization of the data. This method optimizes two competing factors simultaneously i.e. preservation of data distribution and bin width size. The proposed method has been successfully tested with data belonging to a wide range of distributions and compared with the state-of-the-art methods.
|