Abstract:
|
Deep learning-based NLP techniques are commonly applied to EHR data to extract narrative information. Major drawbacks of these methods are the requirements for labor-intensive manual annotation and significant amount of computing resources to train models. To overcome these issues, we propose a semi-supervised system: unsupervised deep learning for pre-training and supervised machine learning with minimal use of data. We randomly selected 100 patients seen in 2014-2018 from UTMB pulmonary clinics, collected 1010 scanned polysomnogram laboratory reports from EHR (EpicCare), and applied LSTM-based Tessereact OCR engine (NeuralNetsInTesseract4.00, 2019) to obtain machine-readable text. Subsequently, we built an unsupervised system using publicly available pre-trained BERT (Bidirectional Encoder Representations from Transformers). Last, we used the output of BERT as embedding to train a random forest model. A sample of 50 reports were used in training. Evaluation on the 960 held-out reports showed 97.6% precision and 92.7% sensitivity. In conclusion, while our system used small sample size its performance was similar to the one achieved by time-consuming deep learning classifiers.
|