JSM 2016 Online Program

Activity Number:	88
Type:	Invited
Date/Time:	Sunday, July 31, 2016 : 6:00 PM to 8:00 PM
Sponsor:	Section on Statistical Learning and Data Science
Abstract #320716
Title:	Text Mining on Domain Names
Author(s):	Kenneth E. Shirley*
Companies:	Amazon
Keywords:	Text Mining ; Word Segmentation ; Topic Models ; Domain Names
Abstract:	In this poster I'll present two different types of text mining analyses on domain names: one supervised, and the other unsupervised. To generate features for domain names (which are unique alphanumeric strings) we apply a probabilistic word segmentation algorithm to the domain names as a key pre-processing step, dividing them into their individual constituent words. Then, in the supervised learning problem, we gather ground truth labels for whether each domain name is "malicious" (typically this means it is associated with phishing or malware), and we train a model to learn which individual words within domain names are most strongly associated with maliciousness. In the unsupervised learning problem, we experiment with a variety of clustering and topic modeling techniques to see if we can detect groups of words that occur together frequently within domain names.

Authors who are presenting talks have a * after their name.