Online Program Home
  My Program

Abstract Details

Activity Number: 529 - SPEED: Machine Learning
Type: Contributed
Date/Time: Wednesday, August 2, 2017 : 10:30 AM to 11:15 AM
Sponsor: Section on Statistical Learning and Data Science
Abstract #325097
Title: Grapheme, Phoneme, Morpheme: Features for Text Classification
Author(s): Taylor Arnold*
Companies: University of Richmond
Keywords: text classification ; model selection ; feature hashing
Abstract:

Letter n-grams are surprisingly efficient features for many type of text classification models. These features can outperform bigram and trigram word features with a significantly smaller dictionary size. At first glance, this result is surprising: letters (or more generally, graphemes) carry no intrinsic linguistic meaning. This talk presents results showing that letter n-grams are serving as a proxy for phonetic information in stylistic classifiers such as gender detection and for morphological information in topical classifiers. To show this, we comparing the error rates of a cross-validated elastic net over different classes of feature matrices. The feature matrices are constructed, separately, from graphemes, phonemes and morphemes. A feature hashing scheme is used to control for the overall capacity of each linguistic unit. The talk will conclude with a link to the associated R package gtm_text and practical implications of the research for other text classification efforts.


Authors who are presenting talks have a * after their name.

Back to the full JSM 2017 program

 
 
Copyright © American Statistical Association