Keywords: random forest, malware, classifier, cyber-security
Colleagues at the Naval Surface Warfare Center applied a random forest classifier to the problem of classifying software as benign or malicious using (normalized) byte-count histograms. Though naïve, the approach proved phenomenally successful when applied to the datasets on-hand. We will discuss the motivation of the original investigators for using these features and classifier and present the results obtained. Knowing that the performance is “too good to be true,” we will then discuss our efforts to understand how the classifier is assigning the class label and our evidence that such classification success will almost surely fail to generalize. Finally, we hope to address the issues encountered in presenting these somewhat subtle technical arguments to a (largely non-technical) audience of managers.