Online Program Home
My Program

Abstract Details

Activity Number: 452
Type: Contributed
Date/Time: Tuesday, August 2, 2016 : 3:05 PM to 3:50 PM
Sponsor: Section on Statistical Learning and Data Science
Abstract #321753
Title: Maximizing Text Mining Performance: The Impact of Pre-Processing
Author(s): Dario Gregori* and Paola Berchialla and Nicola Soriani and Ileana Baldi and Corrado Lanera
Companies: University of Padova and University of Torino and University of Padova and University of Padova and University of Padova
Keywords: text mining ; preprocessing ; n-grams ; neural networks ; svm ; boosting

In recent years, with the rise of electronic medical records, there has been a dramatic increase of text to analyze. While the selection and tuning of the outperforming algorithm, which are intertwined to the scalability and robustness of the algorithm itself, can be implemented in ready-to-use systems, the pre-processing is still not an automated step of the analysis. The importance of the pre-processing relies on the fact it serves as the basis of any further analysis and a poor pre-processing can hamper the performance even of the best tuned algorithm. In this work, we studied the impact of the most common text pre-processing steps, such as stripping white space, removing stopwords, stemming or building n-grams, on multi-labelling and classification. The motivating example is the labelling and the classification of abstracts retrieved from bibliographic databases (PubMed, WoS, Scopus among others) and electronic clinical reports. The pre-processing is assessed in conjunction with neural networks, support vector machines and boosting to highlight their synergistic impact and the importance of the order in which the single steps are carried out.

Authors who are presenting talks have a * after their name.

Back to the full JSM 2016 program

Copyright © American Statistical Association