Online Program

Return to main conference page
Thursday, May 17
Computing Science
Combining Federal and Regional Data Sources: Challenges and Solutions
Thu, May 17, 5:15 PM - 6:15 PM
Lake Fairfax B
 

Six Classes of Methodological Research Questions in the Integration of Multiple Data Sources for Granular Estimation (304716)

*John Eltinge, U.S. Census Bureau 

Keywords: big data; break in series; disclosure limitation; elicitation of utility functions; fault-tolerant designs; incomplete data

Recent reports from the Committee on National Statistics, the Commission on Evidence-Based Policymaking and other groups have highlighted increased stakeholder interest in the production and dissemination of statistical information at relatively fine levels of cross-sectional and temporal granularity. To address these interests in practical settings, it generally is necessary to use a range of statistical methods (e.g., record linkage, imputation, multiple-frame approaches, observational-propensity modeling, hierarchical modeling and small domain estimation) to integrate multiple data sources. Prospective data sources often include sample surveys, but may also include administrative or commercial records, sensors or other forms of “non-designed data” (sometimes labeled “organic data” or “big data”). In addition, some prospective data sources are captured on a consistent national basis, while others may have quality characteristics that differ substantially across regions or other subpopulation groupings, or over time.

This paper outlines six areas in which extensions of current statistical methodology may be of value in addressing the goals described above: (1) Identification of high-priority estimands: Do the stakeholders need to estimate small-domain means, or something else? (2) Use of broad statistical design concepts to allocate efficiently the resources required to acquire and manage multiple data sources; and to produce and disseminate estimates based on those sources. (3) Empirical assessment of the predominant factors that affect the quality of estimators based on the design developed under (2). (4) Characterization and mitigation of risks arising from the prospective loss of, or undetected changes in, one or more data sources. (5) Calibration of customary measures of data quality and risk with stakeholder utility functions. (6) Use of results from (1)-(5) to inform decisions about the use of specific prospective data sources.