Online Program Home
  My Program

All Times EDT

Abstract Details

Activity Number: 132 - SLDS CSpeed 1
Type: Contributed
Date/Time: Monday, August 9, 2021 : 1:30 PM to 3:20 PM
Sponsor: Section on Statistical Learning and Data Science
Abstract #318723
Title: Efficient Semi-Supervised Deep Learning and Machine Learning NLP System to Extract Clinical Measurements from Polysomnogram Laboratory Reports
Author(s): Ioannis Malagaris* and David En Shuo Hsu and Yong-fang Kuo
Companies: University of Texas Medical Branch and University of Texas Medical Branch and University of Texas Medical Branch
Keywords: NLP; Deep Learning; BERD; OCR; Unsupervised Learning; Machine Learning
Abstract:

Deep learning-based NLP techniques are commonly applied to EHR data to extract narrative information. Major drawbacks of these methods are the requirements for labor-intensive manual annotation and significant amount of computing resources to train models. To overcome these issues, we propose a semi-supervised system: unsupervised deep learning for pre-training and supervised machine learning with minimal use of data. We randomly selected 100 patients seen in 2014-2018 from UTMB pulmonary clinics, collected 1010 scanned polysomnogram laboratory reports from EHR (EpicCare), and applied LSTM-based Tessereact OCR engine (NeuralNetsInTesseract4.00, 2019) to obtain machine-readable text. Subsequently, we built an unsupervised system using publicly available pre-trained BERT (Bidirectional Encoder Representations from Transformers). Last, we used the output of BERT as embedding to train a random forest model. A sample of 50 reports were used in training. Evaluation on the 960 held-out reports showed 97.6% precision and 92.7% sensitivity. In conclusion, while our system used small sample size its performance was similar to the one achieved by time-consuming deep learning classifiers.


Authors who are presenting talks have a * after their name.

Back to the full JSM 2021 program