Abstract:
|
A 5-elment identity shows that for any dataset of size n, probabilistic or not, the difference between the sample and population averages is the product of three measures: (1) data quality, (2) data quantity, and (3) problem difficulty. This decomposition tells us: (I) Probabilistic sampling ensures high data quality by controlling a data defect index at the level of 1/vN, where N is the population size; (II) When we lose this control, the estimation error, relative to the benchmarking rate 1/vn, increases with vN, forming the Law of Large Populations; (III) The "bigness'' of Big Data (for population inferences) should be measured by the relative size n/N, not the absolute size n; (IV) When combining data sources for population inferences, those relatively tiny but higher quality ones should be given far more weights than suggested by their sizes. An application to 2016 election reminds us that, when we ignore data quality, population inferences with Big Data are subject to a Big Data Paradox: the more the data, the surer we fool ourselves. The identity also reveals the possibility of enhancing simultaneously data privacy and data quality for non-probabilistic samples.
|