Conference Program

Return to main conference page

All Times ET

Thursday, June 9
Practice and Applications
Applying and Evaluating Logistic Regression Models
Thu, Jun 9, 1:15 PM - 2:45 PM
Allegheny Grand Ballroom
 

Searching the Web for the Drone Industry: Classifying Websites in Multiple Countries and Languages with a Single Model (310081)

Presentation

*Piet J.H. Daas, Statistics Netherlands 
Blanca de Miguel Moline, Universitat Politècnica de València 
Maria de Miguel Moline, Universitat Politècnica de València 

Keywords: Classification model, texts, languages

This paper describes the development of a model able to identify drone company websites for multiple European countries in different languages. Drone word based, Positive Unlabeled learning and Supervised Machine learning classifications were investigated. Supervised logistic regression (L2-norm) based classification performed best with an test set accuracy of 88%. The model was created on Spanish websites translated into English. Applying the model to unseen websites for Spain, Ireland and Italy, after translation of the 1600 words included in the model (if needed) followed by manual inspection of random samples by experts, revealed that the results were between 84-86% accurate. These findings were additionally confirmed with lists of drone websites for Spain and Italy provided by experts.