‹‹ Go Back

Corrado Lanera

University of Padova



‹‹ Go Back

Paola Berchialla

University of Torino



‹‹ Go Back

Ileana Baldi

University of Padova



‹‹ Go Back

Dario Gregori

University of Padova



‹‹ Go Back

Please enter your access key

The asset you are trying to access is locked for premium users. Please enter your access key to unlock.


Email This Presentation:

From:

To:

Subject:

Body:

←Back IconGems-Print

353 – SPEED: Statistical Learning and Data Science

Maximizing Text Mining Performance: The Impact of Pre-Processing

Sponsor: Section on Statistical Learning and Data Science
Keywords: Text Mining, Preprocessing, Electronic Medical Records, Machine Learning, Boosting

Corrado Lanera

University of Padova

Paola Berchialla

University of Torino

Ileana Baldi

University of Padova

Dario Gregori

University of Padova

In recent years, with the rise of electronic medical records, there has been a dramatic increase of text to analyze. While the selection and tuning of the outperforming algorithm, which are intertwined to the scalability and robustness of the algorithm itself, can be implemented in ready-to-use systems, the pre-processing is still not an automated step of the analysis. The importance of the pre-processing relies on the fact it serves as the basis of any further analysis and a poor pre-processing can hamper the performance even of the best tuned algorithm. In this work, we studied the impact of the most common text pre-processing steps, such as stripping white space, removing stopwords, stemming or building n-grams, on multi-labelling and classification. The motivating example is the labelling and the classification of abstracts retrieved from bibliographic databases (PubMed, WoS, Scopus among others) and electronic clinical reports. The pre-processing is assessed in conjunction with neural networks, support vector machines and boosting to highlight their synergistic impact and the importance of the order in which the single steps are carried out.

"eventScribe", the eventScribe logo, "CadmiumCD", and the CadmiumCD logo are trademarks of CadmiumCD LLC, and may not be copied, imitated or used, in whole or in part, without prior written permission from CadmiumCD. The appearance of these proceedings, customized graphics that are unique to these proceedings, and customized scripts are the service mark, trademark and/or trade dress of CadmiumCD and may not be copied, imitated or used, in whole or in part, without prior written notification. All other trademarks, slogans, company names or logos are the property of their respective owners. Reference to any products, services, processes or other information, by trade name, trademark, manufacturer, owner, or otherwise does not constitute or imply endorsement, sponsorship, or recommendation thereof by CadmiumCD.

As a user you may provide CadmiumCD with feedback. Any ideas or suggestions you provide through any feedback mechanisms on these proceedings may be used by CadmiumCD, at our sole discretion, including future modifications to the eventScribe product. You hereby grant to CadmiumCD and our assigns a perpetual, worldwide, fully transferable, sublicensable, irrevocable, royalty free license to use, reproduce, modify, create derivative works from, distribute, and display the feedback in any manner and for any purpose.

© 2016 CadmiumCD