All Times ET

Friday, June 10

New Models, Methods, and Applications II, Part 2

Fri, Jun 10, 10:30 AM - 11:25 AM
Allegheny I

Random Forest Is a Robust Model Choice on Feature Transformed Data for Binary Classification Task (310248)

*Emma Minasyan, Mimecast

Keywords: random forest, naive bayes, pca, mca

Random Forest is shown to be a more robust modeling algorithm than Naïve Bayes when trained on structured datasets of varying data types and transformations to solve a classification problem. A URL dataset composed of malicious and benign websites containing path length, symbols, number of characters and other structural features was used. Feature transformation was performed using Principal Component Analysis (PCA) and Multiple Correspondence Analysis (MCA). To establish an initial baseline, Naïve Bayes and Random Forest models with the default hyperparameters were trained on the training proportion of PCA-transformed dataset and evaluated through k-fold cross-validation. Hyperparameter optimization was then performed to obtain the best models for Random Forest with PCA data, Naïve Bayes with PCA data, and Naïve Bayes with PCA and MCA data. A 5-by-2 paired cross-validation test was used to statistically assess the accuracy of Random Forest with PCA data vs Naïve Bayes with PCA data by providing an objective performance estimate in classifying benign and malicious URLs. Naïve Bayes models using PCA data and PCA with MCA data were also compared, and it was determined that the Naïve Bayes model suffered in performance with the inclusion of MCA data. The evaluating results indicate that Random Forest achieves higher precision and accuracy and is much less sensitive to data manipulation than Naïve Bayes.

Conference Program

Random Forest Is a Robust Model Choice on Feature Transformed Data for Binary Classification Task (310248)

American Statistical Association