Abstract:
|
Letter n-grams are surprisingly efficient features for many type of text classification models. These features can outperform bigram and trigram word features with a significantly smaller dictionary size. At first glance, this result is surprising: letters (or more generally, graphemes) carry no intrinsic linguistic meaning. This talk presents results showing that letter n-grams are serving as a proxy for phonetic information in stylistic classifiers such as gender detection and for morphological information in topical classifiers. To show this, we comparing the error rates of a cross-validated elastic net over different classes of feature matrices. The feature matrices are constructed, separately, from graphemes, phonemes and morphemes. A feature hashing scheme is used to control for the overall capacity of each linguistic unit. The talk will conclude with a link to the associated R package gtm_text and practical implications of the research for other text classification efforts.
|