Abstract:
|
In recent years, with the rise of electronic medical records, there has been a dramatic increase of text to analyze. While the selection and tuning of the outperforming algorithm, which are intertwined to the scalability and robustness of the algorithm itself, can be implemented in ready-to-use systems, the pre-processing is still not an automated step of the analysis. The importance of the pre-processing relies on the fact it serves as the basis of any further analysis and a poor pre-processing can hamper the performance even of the best tuned algorithm. In this work, we studied the impact of the most common text pre-processing steps, such as stripping white space, removing stopwords, stemming or building n-grams, on multi-labelling and classification. The motivating example is the labelling and the classification of abstracts retrieved from bibliographic databases (PubMed, WoS, Scopus among others) and electronic clinical reports. The pre-processing is assessed in conjunction with neural networks, support vector machines and boosting to highlight their synergistic impact and the importance of the order in which the single steps are carried out.
|