JSM 2017 Online Program

Activity Number:	529 - SPEED: Machine Learning
Type:	Contributed
Date/Time:	Wednesday, August 2, 2017 : 10:30 AM to 11:15 AM
Sponsor:	Section on Statistical Learning and Data Science
Abstract #325097
Title:	Grapheme, Phoneme, Morpheme: Features for Text Classification
Author(s):	Taylor Arnold*
Companies:	University of Richmond
Keywords:	text classification ; model selection ; feature hashing
Abstract:	Letter n-grams are surprisingly efficient features for many type of text classification models. These features can outperform bigram and trigram word features with a significantly smaller dictionary size. At first glance, this result is surprising: letters (or more generally, graphemes) carry no intrinsic linguistic meaning. This talk presents results showing that letter n-grams are serving as a proxy for phonetic information in stylistic classifiers such as gender detection and for morphological information in topical classifiers. To show this, we comparing the error rates of a cross-validated elastic net over different classes of feature matrices. The feature matrices are constructed, separately, from graphemes, phonemes and morphemes. A feature hashing scheme is used to control for the overall capacity of each linguistic unit. The talk will conclude with a link to the associated R package gtm_text and practical implications of the research for other text classification efforts.

Authors who are presenting talks have a * after their name.