509 – Balanced and Unbalanced Data
The Art of Balancing
Zhen Zhang
C Spire Wireless
Justin Croft
C Spire Wireless
Kendell Churchwell
C Spire Wireless
Data mining technology has been widely used in strategic marketing to uncover actionable information for a wide spectrum of critical marketing decisions. A common problem in many data mining applications is that data is often skewed, and skewed data often leads to degenerated algorithms that assign most or all cases to the most common outcome. For the modeling projects that have extremely skewed targets, such as churn prediction or fraud detection, data balancing techniques applied prior to modeling process are crucial steps to ensure a useful model. As modeling cases are domain and algorithm sensitive, there is no one-size-fit-all solution for the right balancing strategy. In this paper we present empirical guidelines on balancing strategies for extremely skewed data with binary outcome. Best practices are suggested pertaining to decision trees, logistic regression algorithm and neural network models.