Abstract:
|
In this poster I'll present two different types of text mining analyses on domain names: one supervised, and the other unsupervised. To generate features for domain names (which are unique alphanumeric strings) we apply a probabilistic word segmentation algorithm to the domain names as a key pre-processing step, dividing them into their individual constituent words. Then, in the supervised learning problem, we gather ground truth labels for whether each domain name is "malicious" (typically this means it is associated with phishing or malware), and we train a model to learn which individual words within domain names are most strongly associated with maliciousness. In the unsupervised learning problem, we experiment with a variety of clustering and topic modeling techniques to see if we can detect groups of words that occur together frequently within domain names.
|