Abstract:
|
Text mining has attracted much attention with rapid developments of digitization and OCR technologies, which requires a large demand of text analysis tools. Domain-specific Chinese texts have various structures and styles, such as ancient Chinese prose and eletronic health records. They have different syntactical structures and word usage frequencies from modern official Chinese which is used in newspaper. In practical applications, pre-specified vocabularies and relevant corpora are sometimes not available, thus, semi-supervised methods with statistical modelling are preferred. I will introduce a weakly supervised method TopWORDS 2 for Chinese meta-pattern discovery and named entity recognition. From my research, I found that TopWORDS 2 could effectively extract valuable information and facilitate further text analysis.
|