Keywords: Web scraping, Tekst mining, Innovation, Big Data
Getting an overview of innovative companies in a country is a challenging task. Traditionally this is done by a survey. The results obtained can be used to derive how many innovative companies there are in a country or area. This approach, however, puts a burden on companies, may result in a considerable non-response and such a survey usually focusses on large companies and hardly on the small ones. We therefore investigated an alternative approach: determining if a company is innovative by studying the text on the main page of their website. To enable this the following steps were applied, namely: I) Selecting a set of known innovative and non-innovative companies from the survey data; II) Making sure that for each company the corresponding URL of their web site is available; III) Scraping the main page of each web site and preprocessing the text displayed; IV) Developing a model to determine if a company is innovative or not based on the pre-processed texts and other features. We started with a sample of 3000 innovative and 3000 non-innovative companies according to the Community Innovation Survey of Statistics Netherlands. Next, links to the web pages needed to be found as we discovered that for almost two-thirds of these companies this link (the URL) was absent in the register of the Chamber of Commerce. After scraping each page, a model was developed. Here, it was found that logistic regression with L1-norm performed well. With a 70%-30% training and test set, the trained model was able to determine if a company was innovative or not with 93% accuracy. This model was applied to more than 500.000 company websites in the Netherlands revealing very detailed information on innovative areas in the country. More details, issues ran into and the implementation of this Big Data based approach into official statistics production are discussed in the presentation.