Online Program Home
  My Program

All Times EDT

Abstract Details

Activity Number: 246 - Data Science
Type: Contributed
Date/Time: Wednesday, August 11, 2021 : 10:00 AM to 11:50 AM
Sponsor: Section on Statistical Computing
Abstract #318448
Title: Optimized Data Discretization
Author(s): Rita Chattopadhyay*
Companies: Intel Corp
Keywords: Statistics; Discretization; data distribution; Cost function; bin width; preservation
Abstract:

Data discretization is a common pre-processing step for many statistical, machine learning and data mining methods. The greatest challenge in discretizing (binning) a dataset is preserving the original distribution of the data, while maintaining a reasonable bin size. In other words, the greatest challenge is to identify the optimal bin size of the data, while preserving the data distribution. It is a daunting task to do it manually and error prone as well. The published research focusses only on preserving the data distribution, as a result, the bin widths determined by these methods are often very small, hence not much useful for subsequent methods. The work presented here addresses this issue and provides a discretization method, based on optimizing a cost function, which while preserving the data distribution, simultaneously ensures a reasonably large bin width, allowing a meaningful discretization of the data. This method optimizes two competing factors simultaneously i.e. preservation of data distribution and bin width size. The proposed method has been successfully tested with data belonging to a wide range of distributions and compared with the state-of-the-art methods.


Authors who are presenting talks have a * after their name.

Back to the full JSM 2021 program