Abstract:
|
Digital texts have become an important source of data for social studies. However, textual data from open platforms are vulnerable to manipulation / distortion, often leading to bias in subsequent empirical analysis. This work addresses the challenges due to data distortion in classifying posts published on a leading Chinese micro-blogging platform. The classical classification paradigm that minimizes the overall classification error can yield an undesirably large type I error and data distortion exacerbates this situation. As a solution to inestimable data distortion, we propose the Neyman-Pearson (NP) classification paradigm which minimizes type II error under a user-specified type I error constraint. Theoretically, we show that the NP oracle is unaffected by data distortion when the class conditional distributions remain the same. Even though the training and test data are susceptible to different distortion rates, our approach controls the more-severe error within test data at the targeted level. This case study can be generalized to many applications in which data distortion is present and controlling one type of error is the top priority.
|