Abstract:
|
With the rapid growth of Internet and digitization technologies, large quantities of digitized text data can be easily collected. There is a great need in developing computational tools to automatically extract information from these data and create new knowledge. Because natural language is complex and Internet texts are massive and noisy, it is not feasible to analyze these text data using precise linguistics. Instead, statistical learning methods exhibit significant advantages even though they miss some subtleties in the natural language. Since the Chinese language differs most significantly from alphabet-based languages in not specifying word boundaries in a sentence, most existing Chinese text mining methods require a pre-specified vocabulary and/or a large amount of proper training corpus, which can be difficult to obtain under many circumstances. We propose a set of statistical approaches that can achieve high quality unsupervised learning for big Chinese texts. Various of real data applications show that the proposed methods enjoy great advantages in word discovery, classification and sentence segmentation for mining domain specific Chinese texts, to which supervised approa
|