Online Program

Return to main conference page

All Times ET

Thursday, June 3
Practice and Applications
Classification and Simulation: Methods, Analyses, and Applications
Thu, Jun 3, 10:00 AM - 11:35 AM
TBD
 

Identifying different types of companies via their website text (309790)

*Piet J. Daas, Eindhoven University of Technology 
Nick J. de Wolf, Statistics Netherlands 

Keywords: classification, text mining, innovation, platform economy, artificial intelligence

In this poster, we describe the findings of our work on identifying different types of companies based on the text on their websites. We have focused so far on identifying innovative, platform economy and artificial intelligence companies in the Netherlands. To enable this, for each case, at least 2000 company websites were used and split into an 80% training and an 20% test set. For innovation, the results of the Community Innovation Survey were used. This is a biannual European standardized survey to detect various forms of innovation; we focused on product innovation. For the other cases, positive examples of company websites were provided by experts. To these equal numbers of negative cases were added by taking a random sample of company websites from the Business Register (which is maintained at our office). After preprocessing, various classification algorithms included in the scikit-learn library of Python were applied to determine which of them was best able to discern between the two cases; e.g. innovative vs. non-innovative, platform vs. non-platform economy and artificial intelligence vs. non-artificial intelligence. In addition, the effect of adding WordEmbeddings was also tested. We found that logistic regression with WordEmbeddings worked best to detect innovation (accuracy 88%), linear-SVM worked best for platform economy websites (accuracy 82%) and logistic regression worked best to detect artificial intelligence companies (accuracy 92%). In the first case, only the text on the main page of the website could be used while for the other cases the text on all pages scraped was required. Including WordEmbeddings based vectors did not improve the findings for platform economy and artificial intelligence. When the probability-based classification results of the models were checked, clear U-shape distributions were found for the test set in all cases. This demonstrated that the models developed are well able to discern the cases in each application.