Introductory Overview Lectures (IOLS)

Introductory overview lectures provide high-level overviews of important and timely statistical topics relevant to establishment surveys. The invited presenters are specifically selected as experts in their topic. IOLS are designed to give a basic understanding on an unfamiliar topic (in preparation for other sessions) or to catch up on the latest research in an emerging field.


IOL Session: Use of Statistical Data in Administrative Outputs

Editing and Imputation of Administrative Data – Jeffrey Hoogland (Statistics Netherlands)
Administrative data may contain errors that affect economic statistics based on these data. The types of errors for administrative data can be different than for survey data. Data quality aspects and possible types of errors are explained. Several editing strategies are discussed, both for single and integrated administrative data sources. Administrative data can also be incomplete for statistical purposes. Imputation methods are treated and some examples are given regarding the use of administrative data for economic statistics.

Use of Administrative Data in Statistical Outputs: Survey Estimation – Wesley Yung (Statistics Canada)
Economic surveys have been using administrative data in many different ways for many years. These uses include frame improvement, use in editing and imputation, estimation and data confrontation. In this presentation, we look at the use of administrative data in the survey estimation process. We discuss different ways that administrative data can be used such as direct replacement or as auxiliary data for modelling or calibration. We also touch on pre-conditions that must be fulfilled before considering the use of administrative data for survey estimation and the role that big data may have in survey estimation for economic surveys.

IOL Session: Big Data

Practical Applications of Big Data for Official Statistics – Peter Struijs and Barteld Braaksma (Statistics Netherlands)
It is widely recognized that big data have a high potential for official statistics. As a recent worldwide UN survey has shown, many initiatives have been started at the national as well as the international level to realize this potential. Most initiatives are aimed at exploring big data sources. So far only a few have led to actual dissemination of statistics that are based on big data. Referring to Dutch experiences with big data in explorative research as well as official statistics, this overview lecture aims at showing the main issues and possible ways of dealing with the challenges.

The following aspects of dealing with big data will be discussed: getting access to big data sources; data exploration; privacy aspects; data processing and IT aspects; methodological issues such as selectivity, integration of big data with other data sources, and the use of model-based techniques; and data visualization. The challenges also have an organizational dimension, for instance regarding the launching and fitting of big data activities in an existing organization, acquiring the knowledge and skills needed, and collaboration with private and non-private partners. There may be important cultural and strategic implications. The presentation will cover these aspects as well.

The examples used as a reference concern the use of (1) traffic sensor data for statistics on road traffic intensities, (2) mobile phone location data for following the spatial distribution of a population, (3) social media data for estimating sentiment, (4) scanner data for price statistics and (5) use of website data for statistics production.

Overview of Big Data Research in European Statistical Agencies – Loredana Di Consiglio, Martin Karlberg, Michail Skaliotis and Ioannis Xirouchakis (Eurostat)
The European Statistical System (ESS) has committed itself into exploring the potential of big data for producing official statistics by adopting the Scheveningen Memorandum in 2013 and the big data Action Plan and Roadmap in 2014. As a result, an ambitious collaborative research programme has been launched in big data and official statistics. In parallel, several statistical agencies–as well as Eurostat–are engaged in further methodological research and experimentation in specific statistical areas (e.g. prices, enterprise characteristics, tourism accommodation statistics, culture, job vacancies, etc.). These first experiences have highlighted the great potential and the great challenges of the use of big data for official statistics. Intensive use of web scraping has been emerging as a new tool for collection of relevant statistical information, while, on the other side, the change in the nature, and hence in complexity, of the new types of data has underlined the need for enlarging the spectrum of methods in Official Statistics such as the use of machine learning and text mining. At the same time the use of big data sources requires a (new) reflection on ethical and privacy issues which, in turn, may imply methodological and technological investigations.

The aim of this overview is twofold: (i) to provide a summary and assessment of recent initiatives undertaken by European statistical agencies in big data research of direct relevance to enterprise surveys, business registers, the Euro Groups Register (EGR), and (ii) to raise awareness about near-future research opportunities available in the EU’s Horison2020 programme.

IOL Session: Imputation Methods

Single Imputation – Ton de Waal (Statistics Netherlands)
In this lecture, we will discuss single imputation methods, i.e. methods where the missing data are imputed only once. We will start by examining simple methods, such as (group) mean imputation and ratio imputation. We will discuss cross-sectional and longitudinal versions of these methods.

Mean imputation and ratio imputation are special cases of regression imputation. In regression imputation missing values are imputed by means of a regression model. Regression imputation can be carried out in two ways: with a stochastic term or without a stochastic term. In the lecture, we will discuss both options.

We will also examine hot deck donor imputation, where data from a selected donor unit are used to impute missing data in another unit. We will describe two approaches for selecting such a donor: random hot deck imputation, where a donor is selected randomly, and nearest-neighbour hot deck imputation, where a donor is selected by minimizing a distance function.

A further imputation method that we will discuss is predictive mean matching. Predictive mean matching is a hybrid method that first uses a regression model to predict values for the missing data of a unit and then uses these predicted values to find the nearest-neighbour donor.

We will briefly sketch how variances can be estimated for single imputation methods.Data sometimes have to satisfy logical relations, for example the profit of an enterprise should equal its total turnover minus its total expenses. In other cases, population totals may already be known, for example from an administrative data source, or have been estimated before. We will end the lecture with briefly sketching how single imputation methods can be extended so the imputed data satisfy logical relations and preserve known or already estimated population totals.

Multiple Imputation – Dr. Rebecca Andridge (Ohio State University)
In many surveys, imputation procedures are used to account for nonresponse bias induced by either unit nonresponse or item nonresponse. In this lecture, we will discuss multiple imputation methods, whereby missing data values are filled in more than one time in to create multiple completed data sets. These completed data sets are then analyzed using so-called combining rules in order to obtain valid inference.

We will start by motivating the use of multiple imputation through a worked example, which provides an intuitive rationale behind the use of multiple rather than single imputation. We will then review the theoretical underpinnings of multiple imputation, originally developed by Rubin in the context of missing data in sample surveys. How single imputation methods may have to be altered to provide valid multiple imputation inference will be discussed. We will also discuss the important relationship between the imputation model and the analysis (substantive) model.

Missing data can occur throughout a data set, and often occur in what has been referred to as a “Swiss cheese” missingness pattern. We will review two general multiple imputation approaches that have been proposed to handle such sporadic missingness: joint modeling and fully conditional specification (also referred to as multiple imputation by chained equations).

While many implementations of multiple imputation require an assumption of multivariate normality, many establishment surveys collect data that clearly violate this assumption. We will illustrate alternative multiple imputation methods, including adaptations of ratio imputation, hot deck imputation, and predictive mean matching, that can be used when a normality assumption is questionable.

Multiple imputation has grown in popularity and is now widely available in a range of statistical software packages. Throughout this lecture, we will illustrate the use of MI methods in standard software.

IOL Session: Questionnaire Design and Response Burden

Questionnaire Design for Business Surveys – Jaki S. McCarthy (USDA’s National Agricultural Statistics Service)
One of the critical pieces of any survey is the questionnaire which collects data from the sample units. This introductory overview lecture will discuss the steps involved in designing and testing a business survey questionnaire, from defining the data user needs, identifying the relevant survey design elements, operationalizing the survey concepts, writing questions and testing the questionnaire. Good survey estimates are only possible with complete and accurate data on a questionnaire. A well-designed questionnaire instrument makes this possible.

General questionnaire design principles used to collect accurate data will be discussed, as well as considerations unique to establishment surveys. Establishment survey specific issues can include identification of the reporting unit and respondents, the nature of information collected from establishments, the impact of business record keeping, and the nature of business populations. The lecture will also briefly discuss additional considerations for internet based questionnaires. As more data is collected with online instruments, the benefits of electronic questionnaires can be more fully realized. The importance of questionnaire testing will also be addressed. Attendees of the lecture will come away with a basic understanding of the steps in business survey questionnaire design and areas where there may be differences in questionnaire design between surveys of establishments and survey of households or individuals.

Response Burden in Business Surveys – Mojca Bavdaž (University of Ljubljana)
Response burden has long been a concern for survey organizations. It grew into a more urgent matter as business working time became more precious with global competition, pressures on achieving higher productivity, spreading of lean approaches to eliminate any “waste” activity etc. Voluntary business surveys request and many government business surveys simply require businesses to dedicate some of this time to questionnaire’s completion. Businesses have been loudly questioning the need to provide data to survey organizations, especially because of existence of alternative data sources, growing demand for data as well as lack of convincing explanations about the necessity of participation and usefulness of resulting statistics.

This introductory overview lecture will first introduce the concept of response burden that may refer to the time needed for a survey response, the costs associated with such a response or the feeling experienced when faced with a survey. The lecture will then explain the relevance of response burden, and purposes and challenges of its measurement. The measurement of response burden is not straightforward because it has to deal with complex relations among people within and between businesses, and may represent an additional burden for businesses. The lecture will review common approaches to measuring response burden and present burden-reduction actions in national statistical institutes. Attendees of the lecture will take away broad understanding of the response burden issues and awareness of current solutions and gaps in present knowledge of response burden.

IOL Session: Adaptive Design

Responsive Design and Paradata: Paradata-Based Quality Indicators – James Wagner (University of Maryland)
Judging the quality of survey estimates has become more difficult. Recent research has demonstrated that the response rate is not a good indicator for when nonresponse bias is likely to occur. Nonresponse bias is a function of both the response rate and the survey values of responders and nonresponders. In establishment surveys, the size of the organization is also an important factor. Small establishments have less potential to create bias when unobserved. Evaluations of potential biases should examine both predictors of response and predictors of the survey variables. This would include examining the relationship between paradata and the observed survey data. It is also useful to carefully examine the relationship between the paradata and response probabilities. Such an analysis might usefully inform survey design if actions can be identified that simultaneously work on both dimensions of the problem–that is, actions which increase response propensities and bring in a different kind of respondent. This presentation will look at analyses of several surveys that examine both dimensions of the problem.

Responsive Data Collection Design: Tailoring Fieldwork Effort – Annemieke Luiten, Ger Snijkers (Statistics Netherlands), and Barry Schouten (Statistics Netherlands and Utrecht University)
Statistical agencies in Europe and the US face several constraints. On the one hand, there is the demand for high quality data. On the other hand, collecting these data has become more difficult. Response rates have been declining. Costs have been increasing, and budgets have been decreasing. As a result, statistical agencies are looking for design options that control costs and errors. This situation has led to a growing interest in adaptive survey designs. Various institutes like the U.S. Census Bureau, Statistics Canada, RTI International, Statistics Sweden and Statistics Netherlands are using or considering adaptive survey designs for production. Adaptive survey designs are based on the rationale that any population is both heterogeneous in its response and answering behaviour to surveys and in its costs to be recruited and interviewed. Different survey design features may be effective for different members of the population.

The main components of adaptive survey designs are a set of candidate treatments, a stratification of the population into relevant subpopulations, a set of explicit quality and cost criteria that need to be optimized, and input parameters based on (historic) survey data that represent the effectiveness of the treatments for each of the subpopulations.

In this lecture, we will first discuss the kinds of adaptive survey designs that are distinguished, the circumstances under which these are deployed, and their implications.

Subsequently, the relevant steps in an Adaptive Survey design are discussed: the identification of design features that potentially affect survey errors and costs, the identification of indicators of quality and costs, the monitoring and analysis of process data and the decision rules that govern appropriate interventions. The largest part of the lecture is dedicated to a discussion and comparison of various fieldwork approaches, both in person and business surveys.

IOL Session: Disclosure Avoidance Methods

Statistical Disclosure Methods for Tabular Data – Juan-Jose Salazar-Gonzalez (Universidad de La Laguna, Tenerife)
Statistical agencies collect individually identifiable data, process them, and publish statistical summaries (tables). During this process, the agencies are required to protect individually identifiable data through a variety of policies. In all cases, the scope is to provide the data users with useful statistical information, and to assure that the responses from the individuals are protected.

To this end, and due to the size of the data, combinatorial problems appear and require algorithmic approaches to find optimal or near-optimal solutions. This talk summarizes and compares the most common statistical disclosure control methods to minimize information loss while keeping small the disclosure risk from different data snoopers. A common definition of protection is first introduced.

Later the methods are described to find protected tables in accordance with the given definition. Two integer linear programming models described in the literature for the cell suppression methodology are extended to work also for the controlled rounding methodology. In addition, two relaxed variants are presented using two associated linear programming models, called partial cell suppression and partial controlled rounding, respectively.

A final discussion shows how to combine the four methods and how to implement a cutting-plane approach for the exact and heuristic resolution of the combinatorial problems in practice. The methods are in a free-and-open-source software called tau-ARGUS.

For details, we refer the reader to the book Statistical Confidentiality: Principles and Practice, by George Duncan, Mark Elliot and Juan-Jose Salazar-Gonzalez, Springer 2011.

Statistical Disclosure Methods for Microdata – Anna Oganian (Georgia Southern University and National Center for Health Statistics)
In this talk I will define key concepts of microdata protection and describe some relevant Statistical Disclosure Limitation (SDL) methods. I will start with basic definitions and describe a structure of a microdata file. Before releasing such data to the public, statistical agencies have an obligation by law to protect the confidentiality of the respondents/data providers and at the same time they strive to release a product that would satisfy the ever-growing demands of potential data users. Thus, the goal of microdata protection is two-fold: minimize the risk of disclosure of respondents’ confidential information and maximize the utility of the released data for the user. The key issue here is that these goals are conflicting goals. To decrease the disclosure risk, data protector typically has to perturb microdata in some way, which often leads to decreased utility of the resultant data to the user. On the other hand, when the data are modified with the goal to improve the utility, some protection maybe undone which will increase disclosure risk. Hence, a trade-off between data utility and disclosure risk is the main issue of SDL practice. This is why a decision about how to define and measure data utility and disclosure risk should be among the first steps in the process of microdata protection. It will help to better understand and compare the existing SDL methods, choose the most appropriate one, and develop the most appropriate protection strategy for a particular scenario of data release. I will give examples of such definitions and discuss their advantages and disadvantages. In what follows, I will present several SDL methods suitable for the protection of microdata and discuss their effectiveness based on the proposed metrics of utility and disclosure risk.

IOL Session: Economic Classification – Industry/Activity vs. Product

Part 1 – Overview of Industry/Activity and Product Classification Systems in North America and Europe –Klas Blomqvist (Statistics Sweden) and Andrew Baer (U.S. Census Bureau)
Economic classification systems provide the common language used by statistical agencies to collect, tabulate, present, and analyze economic activity. They constitute a fundamental tool for the whole production process for official statistics. This presentation will provide an overview and introduction to the international classification systems of industry and product classifications. It will try to give an answer to the question “What is an industry classification and what is a product classification”? This is mainly done by explaining what the conceptual basis is for the classifications and how the classifications are structured. There will also be focus on the distinctions between industry and product classification systems used in North America and Europe. We will describe the conceptual organization, structure, and uses of the North American Industry and Product Classification Systems (NAICS, NAPCS), and the Statistical Classifications of Economic Activities and Products By Activity in the European Community (NACE, CPA) and show how the international classifications are linked to the regional and national classifications (for Europe only).

Part 2 – Practical Applications of Economic Classification Systems for Establishment Surveys – Klas Blomqvist (Statistics Sweden) and Andrew Baer (U.S. Census Bureau)
This part of the presentation introduces best practices, guidelines for classification management, maintenance, how they can be structured and modelled, and how classifications can be accessed in databases. We will consider both practical and methodological aspects of conducting establishment surveys by product and by industry. On the practical side, this includes the availability of business register sampling frames, respondent burden, and data collection costs. On the methodological side, this includes a look at which survey types are best suited for measuring prices, output, and productivity.