JSM 2017 Online Program

Online Program Home

My Program

Abstract Details

Activity Number:	416 - A Tour of Statistical Innovations in Marketing Research
Type:	Contributed
Date/Time:	Tuesday, August 1, 2017 : 2:00 PM to 3:50 PM
Sponsor:	Section on Statistics in Marketing
Abstract #324811	View Presentation
Title:	An IAB Taxonomy-Based Classifier for Categorizing High Volume Noisy Data of User Generated URLs
Author(s):	Zainab Jamal* and Peng Wang and Roger Brooks and Prateek Jain and Mudit Jain and Kuldeep Jiwani and Karun Arora and Nisha Vashist and Sahil Jain and Shipra Kapadia
Companies:	Guavus Inc and Guavus and Guavus and Guavus and Guavus and Guavus and Guavus and Guavus and Guavus and Guavus
Keywords:	taxonomy generation ; URL categorization ; Naïve-bayes classifier ; Random Forest Classifier ; Singular ValuedDecomposition ; IAB
Abstract:	Digital Marketers use the IAB taxonomy to categorize customers' interests based on web pages browsed. We report on the categorization of the long-term, high volume web page requests, made directly from end-user devices, into the IAB taxonomy. The corpus varies by time, personalization, etc. and is very noisy as it contains database query results, scripts, et al. Our categorizer addresses these issues. A taxonomy was constructed and periodically updated. First, semantic concepts were extracted from the URL's string and page's content. Second, the concepts were modeled to create a vocabulary for each category. Third, the taxonomy was validated against Wikipedia and the observed, browsed web corpus. For estimation, a sample of 110,000 categorized URLs was used, with 70 of the 345 categories accounting for 95% of URLs. Our findings include: Semantic concepts are effective for classification. A flattened version of the two-tier IAB taxonomy suffices for classification. Naïve Bayes and Random Forest classifiers produce the best results and with comparable accuracies. Standard Naive Bayes achieved 78.4 % accuracy, 75.9% precision and 77.9% recall.

Authors who are presenting talks have a * after their name.

Back to the full JSM 2017 program

Copyright © American Statistical Association

Privacy Policy | Conduct Policy | Previous JSMs