Online Program

Return to main conference page

All Times ET

Friday, February 19
Fri, Feb 19, 1:30 PM - 3:00 PM
Virtual
Controlling Text and Texting Controls

Guidelines in Selecting Appropriate Text Preprocessing Methods (304141)

*Christine P. Chai, Microsoft 

Keywords: natural language processing, text mining, computational linguistics, data preprocessing, unstructured data

Statisticians and data scientists spend a large amount of time preprocessing the data for analysis, and unstructured text data are no exception. Many text preprocessing methods are available, but there is not a one-size-fits-all procedure in preparing text corpora for the model. The appropriate steps depend on not only the application goals, but also the nature of the corpus. For instance, separating each sentence in a document may not be important in topic modeling, but essential in end-user applications like machine translation and question answering.

Therefore, we provide some guidelines on how to select the appropriate text preprocessing methods for a new dataset. We evaluate the pros and cons of methods such as removing punctuation, removing stopwords, stemming and lemmatization, and n-gramming to retain word order. We also review examples of text analysis to demonstrate the need of particular text preprocessing methods, empowering statistical practitioners to make better preprocessing decisions. This talk assumes the audience has a basic understanding of natural language processing, and probably has performed simple analysis of text data before.