Abstract:
|
Digital Marketers use the IAB taxonomy to categorize customers' interests based on web pages browsed. We report on the categorization of the long-term, high volume web page requests, made directly from end-user devices, into the IAB taxonomy. The corpus varies by time, personalization, etc. and is very noisy as it contains database query results, scripts, et al. Our categorizer addresses these issues. A taxonomy was constructed and periodically updated. First, semantic concepts were extracted from the URL's string and page's content. Second, the concepts were modeled to create a vocabulary for each category. Third, the taxonomy was validated against Wikipedia and the observed, browsed web corpus. For estimation, a sample of 110,000 categorized URLs was used, with 70 of the 345 categories accounting for 95% of URLs. Our findings include: Semantic concepts are effective for classification. A flattened version of the two-tier IAB taxonomy suffices for classification. Naïve Bayes and Random Forest classifiers produce the best results and with comparable accuracies. Standard Naive Bayes achieved 78.4 % accuracy, 75.9% precision and 77.9% recall.
|