Online Program

Return to main conference page
Saturday, May 19
Data Science
Data Science and Machine Learning in Naval Applications
Sat, May 19, 8:30 AM - 10:00 AM
Grand Ballroom G
 

Using Found Data – A Cautionary Tale (304364)

Presentation

*David A. Johannsen, Naval Surface Warfare Center - Dahlgren 
David Marchette, Naval Surface Warfare Center 

Keywords: random forest, malware, classifier, cyber-security

Colleagues at the Naval Surface Warfare Center applied a random forest classifier to the problem of classifying software as benign or malicious using (normalized) byte-count histograms. Though naïve, the approach proved phenomenally successful when applied to the datasets on-hand. We will discuss the motivation of the original investigators for using these features and classifier and present the results obtained. Knowing that the performance is “too good to be true,” we will then discuss our efforts to understand how the classifier is assigning the class label and our evidence that such classification success will almost surely fail to generalize. Finally, we hope to address the issues encountered in presenting these somewhat subtle technical arguments to a (largely non-technical) audience of managers.