69 – Federal Perspectives on Privacy, Confidentiality, and Data Quality
Inappropriate Use of Statistical Measures in the Name of Balancing Data Quality and Confidentiality of Tabular Format Magnitude Data
Ramesh A. Dandekar
U.S. Energy Information Administration
Statisticians are aware of the fact that measures such as: mean, variance, Pearson correlation coefficient are disproportionately influenced by relatively few extremely large observations and, therefore, are unreliable as statistical measures in comparing overall quality of data with an extremely skewed distribution. Tabular data cells follow an extremely skewed distribution. In this paper we show that linear-programming-based controlled tabular adjustments (CTA), which generates synthetic tabular data (Dandekar2001), makes use of a least absolute difference linear regression model and is well-suited to control overall data quality on its own without additional steps proposed by quality preserving controlled tabular adjustments (QP-CTA) that has been heavily promoted to the statistical community since 2003.