Online Program Home
  My Program

All Times EDT

Abstract Details

Activity Number: 1 - Invited E-Poster Session
Type: Invited
Date/Time: Sunday, August 2, 2020 : 12:30 PM to 3:30 PM
Sponsor: Section on Statistical Learning and Data Science
Abstract #312183
Title: Unsupervised Multi-Granular Word Segmentation and Medical Term Discovery via Graph Partition
Author(s): Zheng Yuan and Yuanhao Liu and Qiuyang Yin and Boyao Li and Sheng Yu*
Companies: Tsinghua University and University of Michigan and Tsinghua University and Tsinghua University and Tsinghua University
Keywords: word segmentation; electronic health records; term discovery; graph partition; deep learning; text mining
Abstract:

Concept extraction is a fundamental step in free text analysis for electronic health record (EHR) data and heavily relies on the comprehensiveness of the medical dictionary. Data-driven discovery of medical terms can help identify unrecorded terms, which are usually informal variants of the preferred standard ones. We propose an unsupervised algorithm that performs multi-granular word segmentation and term discovery, primarily for the Chinese language, but potentially works for other languages as well. For an input sentence, an undirected graph is built, where the nodes are the input characters, and the edge weights, computed based on corpus statistics, represent how likely the two connected characters should be in the same word. The Ratio Cut is used to partition the graph to subgraphs that each corresponds to a word, which is a potential term if it is unrecorded in the dictionary. A BERT-based classifier is trained with simulated data to further evaluate the likeliness of the candidate terms. Benchmark test on annotated EHR text for multi-granular term discovery shows that the proposed algorithm outperforms existing ones by a large margin.


Authors who are presenting talks have a * after their name.

Back to the full JSM 2020 program