Online Program Home
  My Program

Abstract Details

Activity Number: 203 - Emerging Statistical Methods for Big Tensor Data in Chemometrics and Related Fields
Type: Invited
Date/Time: Monday, July 31, 2017 : 2:00 PM to 3:50 PM
Sponsor: Section on Physical and Engineering Sciences
Abstract #322173 View Presentation
Title: Novel Statistical Approaches for Mining Big Text Data
Author(s): Ke Deng* and Jun Liu
Companies: Tsinghua University and Harvard University
Keywords: Chinese text mining ; word discovery ; word classification ; sentence segmentation ; EM algorithm
Abstract:

With the rapid growth of Internet and digitization technologies, large quantities of digitized text data can be easily collected. There is a great need in developing computational tools to automatically extract information from these data and create new knowledge. Because natural language is complex and Internet texts are massive and noisy, it is not feasible to analyze these text data using precise linguistics. Instead, statistical learning methods exhibit significant advantages even though they miss some subtleties in the natural language. Since the Chinese language differs most significantly from alphabet-based languages in not specifying word boundaries in a sentence, most existing Chinese text mining methods require a pre-specified vocabulary and/or a large amount of proper training corpus, which can be difficult to obtain under many circumstances. We propose a set of statistical approaches that can achieve high quality unsupervised learning for big Chinese texts. Various of real data applications show that the proposed methods enjoy great advantages in word discovery, classification and sentence segmentation for mining domain specific Chinese texts, to which supervised approa


Authors who are presenting talks have a * after their name.

Back to the full JSM 2017 program

 
 
Copyright © American Statistical Association