Short Courses will be given on Monday, June 20, 2016.
The courses are whole-day courses. Lunch, as well as morning and afternoon refreshments, are included in the fee.
List of Short Courses
- Ger Snijkers (contact person), Statistics Netherlands:
- Gustav Haraldsen, Statistics Norway:
Diane Willimack, U.S. Census Bureau:
Using a process-quality approach, derived from the Generic Statistical Business Process Model (GSBPM), this course provides an introductory overview of methodological considerations in designing and conducting business surveys. We will present an integrated approach to methods and procedures that optimize the design of a survey within constraints, covering all stages in the survey data collection process, but focusing on (1) data collection issues (questionnaire design, response improvement communication strategies, data capture and editing); (2) related planning issues, and (3) survey management and monitoring. This course will not include topics in statistical methodology for business surveys, such as sampling, estimation, imputation, and analysis.
This course is appropriate for practitioners, researchers, methodologists, and survey managers in Statistical Institutes, universities, non-profit and for-profit survey organizations, international statistical organizations (e.g., OECD, IMF, UN, Eurostat), and Central Banks around the world. It is also relevant for users of survey data and statistics, such as policy makers, analysts, and researchers, to improve their knowledge and aid their interpretation of statistical outputs that form the basis for policy decisions or statistical analyses.
Participants will gain an understanding of the stages in the business survey process, particularly with regard to the following:
- Differences between business and social surveys, and their impact on the survey design and processes
- Steps in the survey production process and quality trade-off decisions
- Effects of the business context on survey participation and response processes
- The concept of response burden
- Quality dimensions to be considered in business surveys
- Procedures for planning a business survey, and how to consider the organizational context when making design decisions
- How to develop an effective designed questionnaire communication, taking the business context into account
- How to develop a communication plan aimed at response improvement strategies, integrating contextual considerations
- Monitoring and managing data collection using paradata-based indicators
- Issues in capturing, coding and cleaning survey data
In advance of the class, participants will be invited to provide examples or cases from their own work, related to the course topics listed above, for classroom discussion. Selected cases will be offered for group discussion during the final session, to illustrate practical application of course content.
- Kennon Copeland, NORC:
This short course provides an overview of sampling and estimation for establishment surveys. Examples from existing surveys will be interspersed within the materials. Attendees will benefit most who have some familiarity with sampling applications and theory.
Sample design and estimation approaches for establishment surveys differ from those used for household surveys due to the nature of population distributions and the amount of information typically available from sampling frames. Establishment populations tend to be skewed with a small number of very large units which greatly influence total measures of interest and a large number of very small units which contribute little to total measures of interest. Data sources used in creating sampling frames for establishment surveys typically have measures of size, as well as other characteristics about individual establishments that can be used to develop efficient sample designs and estimation approaches. Following an overview, in which we will examine unique aspects of establishment distributions and characteristics of establishment surveys, including data from surveys of establishment surveys, the course will focus on two segments.
The first segment will address aspects related to establishment survey sample design, and will cover topics related to sampling frame development and maintenance, common sample designs, stratification, PPS sampling, and sample size determination and allocation. We will also discuss special topics such as use of auxiliary data, cutoff sampling, certainty selection, panel survey sample design, representing establishment births and deaths, and subsampling within establishments.
The second segment will focus on estimation for establishment surveys, and will cover topics related to design-based weights under common sample designs, unit nonresponse adjustment, calibration techniques, and variance estimation. We will also discuss special topics such as survey weighting processes, use of auxiliary information, panel survey estimation, model-based estimation, linearization and replication variance estimation approaches, and generalized variance functions.
The sampling strategy for estimating a set of parameters of interest is defined by the couple sampling design and estimator. In particular, the design must entail a random selection scheme of the sample, in accordance with a given set of inclusion probabilities. In case of the Stratified Simple Random Sampling (SSRS) design, the set of the inclusion probabilities fix the sample size in each stratum, thus obtaining the allocation of the sample within the strata.
The literature on sampling design has devoted much attention to sample allocation, seeking to satisfy certain criteria of optimality in terms of efficiency and budget constraints. In the SSRS, the optimal allocation for a univariate population is well-known (Cochran, 1977). In the multivariate scenario, where more than one characteristic is to be measured on each sampled unit, the optimal allocation for individual characteristics do not have much practical use, unless the characteristics under study are highly correlated. The criteria established for the problem’s multidimensionality leads to a definition of an allocation that loses precision, compared to the individual optimal allocation. For these reasons, the methods are sometimes referred as compromise allocation methods (Khan et al., 2010). Several authors have discussed various criteria for obtaining a useable compromise allocation. Among these see Cochran (1977), Kokan and Khan (1967), Chromy (1987), Bethel (1989), Falorsi and Righi (2008) and Choudhry et al. (2012).
The course proposes a general method for deriving the inclusion probabilities which minimizes the cost for collecting data, disseminating estimates of pre-established accuracy for a multiplicity of both variables and domains of interest. Given the constraints on either the required accuracy of the estimates or the overall budget, the optimal solution is derived. The method is based on the theory shown in Falorsi and Righi (2015, to be published).
The method can be fruitfully used for several Business survey contexts, such as: standard sampling designs (SSRS, PPS, Balanced sampling design – Deville and Tillé, 2005 – , etc.); not standard sampling designs (multi-way stratification design, incomplete stratification design – Falorsi and Righi, 2015 -, indirect sampling design – Lavallée, 2007-, etc.); direct and indirect domain estimation. Furthermore, the method can be easily applied for finding the optimal inclusion probabilities in some special cases: (i) the sampling frame is obtained by a probabilistic linkage procedure (with linkage errors); (ii) the response homogeneity groups are known at design stage.
During the course will be introduced a SAS code and/or R code implementing the method with practical applications on real Business data.
Mauro Scanu (contact person) (ISTAT):
- Tiziana Tuoto (ISTAT)
The goal of record linkage procedures is to identify pairs of records which belong to the same unit. The course will introduce participants to a formal procedure to find records belonging to the same unit being from either different or the same data source. These procedures are based on probabilistic instead of deterministic criteria and rely on the equivalence of values when comparing those from two different records on a field-by-field basis; and then, on the probability of agreement between values given the true – and unknown – status of the pair of records – that is, they actually do belong to the same unit or they actually do not. This course will tackle both standard and alternative approaches for probabilistic record linkage: the former is widely known as the Fellegi-Sunter theory (Fellegi and Sunter, Journal of the American Statistical Association, 1969) and the latter is represented by some Bayesian approaches.
The course will also provide a critical view of the record linkage problem highlighting the limit of the available statistical methodologies. The record linkage problem will also be compared with the general problem of data integration and the different techniques that can be considered (as statistical matching, micro integration, big data...).
On the practical side, the record linkage techniques will be illustrated in all their different phases, from the harmonization of the data files to their quality evaluation and their actual use for estimation. A software tool developed at the Italian National Institute of Statistics (ISTAT) will be illustrated in order to allow participants to put in practice the course content. Examples will refer to the business area.
Attendants are expected to have a basic knowledge of statistical inference (likelihood function; statistical tests; regression analysis). A basic knowledge of methods for treatment of missing values and analysis of partially observed data set is a benefit.
Pascal Ardilly, INSEE:
When we wish to estimate parameters (e.g., revenues, expenses, investments, etc.) defined on small-size subpopulations (also called « domains ») using a sample survey, we are facing problems because classical methods often produce bad quality estimates. This is a mechanical consequence of the relative small sample sizes obtained for these subpopulations, which may be small areas (e.g., town, villages) or subpopulations defined by a rather fine industrial/production criteria (e.g., use of benzene in industrial processes, capital investments, research and development). To improve the accuracy of the estimators, it is then necessary to use some auxiliary information obtained from complete data sources, or from very large-size sample surveys. We then obtain a set of methods called « small area estimation methods » (or “small domain estimation methods”), which are structured according to the availability of auxiliary information, which we can summarize as follows.
A first approach is calibration on a set of auxiliary variables known for all the units in the small domain (or small area). By weighting the sampled units so as to find the exact structures known at the small domain level, we can generally reduce the sampling error in a rather considerable way, without depending on a particular model about the units.
A second class of methods relies on descriptive hypotheses about some components of the parameters that we try to estimate. These hypotheses assume that a local average behavior (in the small domain) corresponds to a global average behavior (in the complete population). For example, to estimate a small domain mean, we might divide the global population into particular subpopulations and then consider that for each of these subpopulations, the true mean of the small domain is equal to the true mean of the complete sub-population. We can also make similar hypotheses about coefficients of regression, i.e., consider that the relationship between variables is identical in the small domain and in the complete population. This kind of hypothesis allows to carry out a local estimation that use all the sampled units, which contributes to stabilize the estimations and thus to reduce the global sampling error. It should be noted that this introduces a bias that the magnitude depends on the relevance of the hypotheses.
The third approach, certainly the richest and the most common one, is based on stochastic modeling, where the modeled unit can be either the ‘basic’ individuals of the population, or the small domain itself. Numerous competitive models exist to produce small area estimates (e.g., linear mixed models, generalized linear mixed models, Bayesian techniques), but in every case, the underlying principle is the following: using a good explanatory auxiliary information that is available for the whole population, we estimate the model parameters using to the global sample, and then we produce small area estimations using these estimated model parameters. Therefore, we benefit from a large stability of the local estimators, because they integrate all the sampled units, in and outside the small domain. Naturally, the relevance of this approach is depending on the validity of the model. Fortunately, there are some quality indicators to measure this.
The course will summarize these approaches and will specify the benefits and the limits of each of them.
Jean-François Beaumont, Statistics Canada:
David Haziza (contact person), Université de Montréal:
When the distribution of variables in a survey is highly skewed, it is likely that the methodologist will face the problem of influential values. The latter are problematic because they tend to lead to very unstable estimators. This problem is especially acute in business surveys because the distribution of most economic variables is highly skewed. It is possible to guard against the impact of influential values at the design stage by selecting with certainty the potentially influential units. For example, in business surveys, it is customary to use a stratified simple random sampling without-replacement design containing one or more take-all strata that are usually composed of large units that are expected to be influential. Unfortunately, it is seldom possible to completely eliminate the problem of influential values at the design stage. For example, in a survey that collects dozens of variables of interest, it is not unlikely that some of them will have little or no correlation with the stratification variables, which may result in the presence of influential values. Another problem that leads to influential values in the sample is the presence of stratum jumpers, which arises when the stratification information collected in the field is different from the information in the sampling frame. Classical estimators exhibit (virtually) no bias, but they can be very unstable in the presence of influential values. Thus, it is desirable to develop robust estimation procedures whose mean square error is significantly smaller than that of classical estimators when there are influential values in the population but which do not suffer a serious loss of efficiency when there are none.
During the course, we will attempt to answer the following questions:
- What is an influential value in the context of surveys?
- How measure the influence of a unit?
- How reduce the impact of units that have a large influence at the estimation stage?
At the end of the course, the participants should have a better understanding of:
- The differences between concepts related to robust estimation such as outliers, extreme values and influential values under different modes of inference;
- The methodological issues related to outliers, extreme values and influential values;
- How to treat this type of data.