Keywords: Infrastructure data science, machine learning, model interpretability
Cloud computing is ubiquitous and transformative for business by leveraging the cloud infrastructure and applications with ease. On the other side, site reliability is a paramount priority for Salesforce as a cloud provider to build trust with our customers. A proactive approach for incidents management will help identify the trend to pinpoint potential problem areas and allow corrective actions before incidents occur. Building an interpretable machine learning model for incident prediction can greatly augment and enhance the efficiency and productivity of site reliability team and performance engineering teams to proactively act on cloud infrastructure operation at scale. However, predicting incidents as rare events is challenging in complex and evolving infrastructure systems. Besides, model interpretability is crucial to understand and triage root causes in a business context. The Salesforce infrastructure Analytics team develops a robust and interpretable machine learning model for proactive incidents management. We will walk through a practical data science workflow from data collection, exploratory analysis, modeling to production and share lessons learned on this use case. We will discuss how we applied techniques such as visualization, resampling, active learning and SHAP values to address the challenges of noisy data, incorporating domain knowledge into feature engineering, unbalanced classification and model interpretation.