Abstract:
|
Multivariate categorical data nested within households often include reported values that fail edit constraints - for example, a participating household reports a child's age as older than his biological parent - as well as missing values. Typically, agencies seek to create datasets free from such erroneous or missing values, which then can be used for analysis or disseminated to secondary data users. We present a model-based engine for doing so based on a Bayesian hierarchical model that includes (i) a nested data Dirichlet process mixture of products of multinomial distributions as the model for the true unobserved values of the data, truncated to allow for only households that satisfy all edit constraints, (ii) a model for the location of errors, and (iii) a reporting model for the observed responses in error. The approach propagates uncertainty due to unknown locations of errors and missing values, generates plausible datasets that satisfy all edit constraints, and can preserve multivariate relationships within and across individuals in the same household. We evaluate the performance of the approach using data from the 2012 American Community Survey.
|