Online Program Home
My Program

Abstract Details

Activity Number: 229 - Advances in the Neyman-Pearson Classification
Type: Topic Contributed
Date/Time: Monday, July 29, 2019 : 2:00 PM to 3:50 PM
Sponsor: WNAR
Abstract #304942 Presentation
Title: Intentional Control of Type I Error Over Unconscious Data Distortion: a Neyman-Pearson Approach to Text Classification
Author(s): Richard Zhao* and Lucy Xia and Xin Tong and Yanhui Wu
Companies: Pennsylvania State University and Stanford University and University of Southern California and University of Southern California
Keywords: Neyman-Pearson; text classification; data distortion; social media; data mining; type I error

Digital texts have become an important source of data for social studies. However, textual data from open platforms are vulnerable to manipulation / distortion, often leading to bias in subsequent empirical analysis. This work addresses the challenges due to data distortion in classifying posts published on a leading Chinese micro-blogging platform. The classical classification paradigm that minimizes the overall classification error can yield an undesirably large type I error and data distortion exacerbates this situation. As a solution to inestimable data distortion, we propose the Neyman-Pearson (NP) classification paradigm which minimizes type II error under a user-specified type I error constraint. Theoretically, we show that the NP oracle is unaffected by data distortion when the class conditional distributions remain the same. Even though the training and test data are susceptible to different distortion rates, our approach controls the more-severe error within test data at the targeted level. This case study can be generalized to many applications in which data distortion is present and controlling one type of error is the top priority.

Authors who are presenting talks have a * after their name.

Back to the full JSM 2019 program