Online Program Home
  My Program

All Times EDT

Abstract Details

Activity Number: 356 - Statistical Learning: Methods and Applications
Type: Contributed
Date/Time: Wednesday, August 5, 2020 : 10:00 AM to 2:00 PM
Sponsor: Section on Statistical Learning and Data Science
Abstract #311128
Title: Combination of Optical Character Recognition and Natural Language Processing to Identify Patients with Sleep Apnea in EHR Data
Author(s): Enshuo Hsu* and Yong-Fang Kuo and Rizwana Sultana and Gulshan Sharma
Companies: University of Texas Medical Branch and University of Texas Medical Branch and University of Texas Medical Branch and University of Texas Medical Branch
Keywords: EHR; deep learning; text detection; OCR; NLP

Though NLP techniques have been applied in EHR studies, extracting information from images remains challenging. Fortunately, novel algorithms utilizing deep learning for text detection and optical character recognition (OCR) are recently available. To our knowledge, there is no research that combines them with recurrent neural network (RNN) for laboratory test results. We developed a data pipeline for sleep study interpretation reports to identify sleep apnea diagnoses, which is an increasing disease (Gelburd, 2018). We randomly selected 100 patients seen in 2014-2018 from UTMB pulmonary clinics, collected scanned reports from EHR (EpicCare), and applied LSTM-based Tessereact OCR engine (NeuralNetsInTesseract4.00, 2019) to obtain machine-readable text. We then trained an RNN NLP model to identify sleep apnea diagnosis and measurement values including, apnea hypopnea index and oxygen saturation. Validation by physician chart-review shows 100% sensitivity and 80% specificity of the proposed data pipeline. Future studies are needed to generalize this pipeline for other information, such as PLM arousal index, Cardiac arrhythmia.

Authors who are presenting talks have a * after their name.

Back to the full JSM 2020 program