Online Program

Return to main conference page

All Times ET

Program is Subject to Change

Monday, June 14
Mon, Jun 14, 10:30 AM - 12:00 PM
TBD
Topics in Classification and Frame Development

How a Statistical Office Scrapes and Processes Internet Data (308065)

*Crt Grahonja, Statistical office of the Republic of Slovenia 

Keywords: web scraping, web crawling, machine learning, text processing, text mining

In the last decade Internet data have been increasingly used in the field of official statistics. The Statistical Office of the Republic of Slovenia has also noticed the potential, quantity and velocity of such data. As such we take part in multiple internal and EU-wide projects that aim to utilize Internet data to our advantage and to the advantage of our users. However, such data present some issues related to extraction and processing. While solving these issues it is also expected of us to follow online rules of conduct known as netiquette. At the office we capture semi-structured and unstructured data from the Internet using custom-made programs written in Python with heavy use of community freeware modules (multiprocessing, selenium, BeautifulSoup and scrapy). We differentiate between two types of scraping jobs according to the general structure of the pages: scraping of list-based information and full-page capturing. A specific extraction process had to be devised for each of the two types. Scraping semi-structured list-based information enables the programs to work fast, scraping only pertinent data and moving onto the next known or expected page. While such extraction proves to be relatively easy, mistakes in format and shape of online data are frequent due to its own nature. On the other hand, full-page capturing involves parsing and searching across a massive pool of content which may or may not include connections to the next page. Furthermore, establishing if any relevant information is included in the site is time and resource consuming. For such a job we employ rule based (white-listed expressions), machine-learning algorithms (regressions, AdaBoost, kNN, clustering, neural networks, etc.) and text mining algorithms (tokenization, lemmatization, stopword filtering, BoW, named entity recognition). The presentation aims to show our endeavours in extracting and pre-processing Internet data and mitigating the drawbacks that such methods pose.