All Times ET
Keywords: Classification model, texts, languages
This paper describes the development of a model able to identify drone company websites for multiple European countries in different languages. Drone word based, Positive Unlabeled learning and Supervised Machine learning classifications were investigated. Supervised logistic regression (L2-norm) based classification performed best with an test set accuracy of 88%. The model was created on Spanish websites translated into English. Applying the model to unseen websites for Spain, Ireland and Italy, after translation of the 1600 words included in the model (if needed) followed by manual inspection of random samples by experts, revealed that the results were between 84-86% accurate. These findings were additionally confirmed with lists of drone websites for Spain and Italy provided by experts.