Abstract:
|
Concept extraction is a fundamental step in free text analysis for electronic health record (EHR) data and heavily relies on the comprehensiveness of the medical dictionary. Data-driven discovery of medical terms can help identify unrecorded terms, which are usually informal variants of the preferred standard ones. We propose an unsupervised algorithm that performs multi-granular word segmentation and term discovery, primarily for the Chinese language, but potentially works for other languages as well. For an input sentence, an undirected graph is built, where the nodes are the input characters, and the edge weights, computed based on corpus statistics, represent how likely the two connected characters should be in the same word. The Ratio Cut is used to partition the graph to subgraphs that each corresponds to a word, which is a potential term if it is unrecorded in the dictionary. A BERT-based classifier is trained with simulated data to further evaluate the likeliness of the candidate terms. Benchmark test on annotated EHR text for multi-granular term discovery shows that the proposed algorithm outperforms existing ones by a large margin.
|