Student Contest

ICES VI is sponsoring a student contest with awards given in two tracks: non-response treatment and analysis/visualization of economic statistical data. The student contest will create interest and innovation in the establishment survey field by inspiring students and the faculty they work with to create interesting and challenging applications that test their technical skill and creativity.

The winner of each contest track will receive support to attend the conference and present their work at a special session.


Student Contest General Information

As the sixth in the series of international conferences on establishment statistics, ICES VI is designed to look at key issues and challenges pertaining to establishment statistics.

For this conference, we are introducing a student contest with awards given in two tracks: non-response treatment and analysis/visualization of economic statistical data. The winners will present their research at ICES VI.

Winners will receive $1,500 (USD), to be used toward registration and/or travel expenses to attend and present their paper.

Eligibility

  • Participants must be current students.
  • Students will perform their research independently or with a group of up to five students. However, support to attend the conference and present the work will only be awarded to one author from each track.
  • Students are to carry out this assignment autonomously. Contributions from faculty advisers, if any, should be made clear in the report, as well as their expertise.
  • The paper must be presented at the conference by one of the student participants.

Submission Deadline and Details

  • Submissions should be a report summarizing the methods and results, limited to 6,000 words, excluding tables, figures, references, and appendices.
  • The report must be submitted in English, the official conference language.
  • Supporting materials (e.g., figures, tables, appendices) are limited to up to 10 pages.
  • The software used should be R or SAS, and programming code must be submitted in an appendix (for ease of submission, programming code may be submitted as one file separate from the report).
  • You must submit your report by February 15, 2020.
  • The report (and programming code) must be sent to ices@amstat.org with the subject line “Student Contest: Nonresponse Track” or “Student Contest: Data Analysis/Visualization Track.”

Winners will be notified on March 31, 2020. You may not submit the paper to any other 2020 student/young investigator award competition until this decision is made public. If you have questions concerning these contests, you can contact ices@amstat.org. Use the subject line “Student Contest: Inquiry.”

We encourage the contest winners to produce a paper based on the report that could be submitted to a publication from the ICES VI conference. It should be noted that submission does not guarantee publication, as the report will go through the normal review procedures.

Judging Submissions

The reports will be reviewed by a panel of international experts in establishment survey design and analysis chosen by the ICES VI Program Committee.

Student Contest Track 1 – Nonresponse Treatment

This contest track challenges participants to deal with missing data in a survey to minimize its impact on data quality. This real-life problem is not limited to surveys and must be addressed whether the data is collected specifically for a survey, obtained from administrative sources, or harvested from big data sources. The contest requires students to treat the missing data to produce valid and precise population estimates for specific variables and domains from a survey on small businesses in the United States. The title of the survey is the Survey of Business Owners (SBO), and it was conducted by the US Census Bureau to gather information about the 2007 reference year.

To produce valid population estimates, the sample design and probabilities of selection must be properly used in the estimation procedures.

Read general information about sampling and estimation methods.

Read about the specific sample design and estimation procedures used in the Survey of Small Business Owners.

For this contest, a data set has been created from the complete survey data set, where missing values have been introduced according to probabilities specific to each unit. In practice, these probabilities are unknown and the nonresponding units must be accounted properly to ensure the inferences from the data are valid.

Two approaches are commonly used to treat missing data in statistical data, namely imputation or reweighting. In reweighting, new estimation weights are produced, whereas the missing values are estimated and completed in imputation. Each approach offers flexibility to use auxiliary information available at the unit level, as well as different models for the probability of response, or the model underlying the values of a variable. Any assumptions underlying models used in reweighting or imputation should be clearly spelled out and verified to ensure they can be applied to the data.

For a discussion about nonresponse mechanisms, as well as information about reweighting and imputation, refer to Särndal, C.-E., and S. Lundström. 2005. Estimation in surveys with nonresponse. New York: Wiley.

The focus of this contest is to develop a methodology to produce valid and efficient estimators for the population in the presence of nonresponse. While several basic references are provided in this text, participants are encouraged to research and find information about recent developments in the field. Participants are encouraged to use state-of-the-art methods in treatment of missing data and explore data science techniques—for use in imputation and reweighting in particular. Furthermore, contest participants are encouraged to look beyond traditional statistics and survey methodology practices and consider theories and techniques from disciplines such as behavioral economics, marketing, social psychology, organizational psychology, decision-making theories, (mass) communication sciences, theories on influence and resistance, and motivation theories.

Participants will submit a report including the following three main components:

  1. An evaluation of the pattern of missing data, including formal models used to describe it and plausible explanations for the cause of missing data
  2. A detailed description of the proposed methodology for producing statistical estimates, including any models or assumptions related to the nonresponse treatment
  3. An evaluation of the application of the methods to address missing data in terms of their impact on the quality of population-level estimates

Note that all programming code (SAS, R, or both) used to produce estimates and develop evaluation statistics must also be provided.

Students may work individually, but we recommend working in small groups (of up to five students) to carry out this assignment. In practice, survey researchers often work in project groups to design and conduct a survey. Students are to carry out this assignment autonomously; contributions from faculty advisers, if any, should be made clear in the report.

Instructions
The data sets we have provided include only the domains of interest and are incomplete due to unit and item nonresponse. Your challenge is to develop a methodology to produce estimates of population totals for total employment (EMPLOYMENT_NOISY) from businesses primarily owned by Males (SEX1=M) and Females (SEX1=F) in the states of Louisiana (FIPST=22) and Texas (FIPST=48).

The survey estimates obtained from the publicly available complete data set are presented in the table below. You are not expected to reproduce these estimates exactly; you should assume you do not know these statistics, which are provided for validation purposes only. In the data set provided for the competition, employment (EMPLOYMENT_NOISY and sex (SEX1-SEX4) variables may be subject to both unit nonresponse and item nonresponse. Nonresponding variables are represented by a missing value for the unit. Information such as state, weights, and payroll (PAYROLL_NOISY) are assumed to be available from the frame and thus not subject to nonresponse.

  1. Download the data sets.
    The data is available at the following link in CSV format:

    https://ww2.amstat.org/meetings/ices/2020/studentcontest/track1missingdatacontestdata.csv

    In addition, documentation on the format of the file (variable names, etc.) is available from the SBO pumps user guide:

    https://www2.census.gov/econ/sbo/07/pums/2007_sbo_pums_users_guide.pdf

  2. Analyze the data sets, produce population-level estimates, and prepare your report.

  3. Send your report and the programming code file to ices@amstat.org with the subject line “Student Contest: Nonresponse Treatment Track” by February 15, 2020.

Data
Based on the SBO, the estimated population totals for Employment within the relevant domains are as follows (rounded to the nearest unit):


    Male Female
Louisiana Total 759,298 133,305
Texas Total 3,586,251 744,181

Note that these values are provided for validation purposes only. Submissions will be evaluated based on the criteria identified below and not the ability to reproduce these estimates exactly.

Report
Your submission will be a written report that could serve as the basis for a scientific paper to be published in a journal. The report is limited to no more than 6,000 words—excluding tables, figures, and references—and must do the following:

  • Describe how the survey design is taken into consideration when producing population estimates
  • Present an evaluation of the missingness, including formal models used to describe patterns or mechanisms, validation or sensitivity analysis procedures, and plausible explanations for potential causes of missing data in establishment data
  • Detail how the missing data was treated in estimating the population totals and justify the choice of methods relative to other options
  • Provide an evaluation of the application of the methods used in terms of their accuracy and impact on the quality of population-level estimates
  • Contain a list with the names, scientific disciplines, and levels of all student participants (bachelor/undergraduate, master’s/graduate, PhD/graduate, as well as—in the case of their involvement—the names, expertise, role, and contributions of faculty advisers
  • Provide the programming code used to perform imputation and compute estimates and variance estimates. SAS and R are acceptable programming languages. The code must be submitted separately from the report and does not count against the word count or page limit count for the report. Failure to provide the programming code will result in disqualification.

Criteria for Judging Submissions

  1. Well-founded theoretical basis for the proposed method(s), as well as compliance with general scientific standards
  2. Clarity of explanation of the proposed method(s) and implementation approach
  3. Originality, as well as effectiveness at achieving good survey results in terms of reproducing population totals (as closely as possible) and minimizing the variance of the estimators
  4. Method of validating proposed model(s) to describe and treat missing data. Specifically, address how the procedures achieve the desired properties listed in the instructions (and other objectives of importance to the authors)
  5. Ideas for future improvements or other applications

Suggested References
General references on imputation and nonresponse were provided above, but the references below provide specific information about individual methods or concepts. These references are not meant to be a comprehensive list, however, and participants are encouraged to conduct independent research and literature reviews.

Imputation in General
Kalton, G., and D. Kasprzyk. 1986. The treatment of missing survey data. Survey Methodology 12:1–16.

Sande, I.G. 1982. Imputation in surveys: Coping with reality. The American Statistician 36(3):145–152.

Individual Imputation Methods
Andridge, R.R., and R.J.A. Little. 2010. A review of hot deck imputation for survey non-response. International Statistical Review 78(1):40–64.

Kim, J.K., and W. Fuller. 2004. Fractional hot deck imputation. Biometrika, 91:559–578.

Chauvet, G., J.-C. Deville, and D. Haziza. 2011. On balanced random imputation in surveys. Biometrika, 98(2):459–471.

Multiple Imputation Zhang, P. 2003. Multiple imputation: Theory and methods. International Statistical Review 71(3):581–592.

https://stats.idre.ucla.edu/stata/seminars/mi_in_stata_pt1_new/

https://stats.idre.ucla.edu/stata/faq/how-can-i-perform-multiple-imputation-on-longitudinal-data-using-ice/

Analysis and Variance Estimation with Imputed Data
Rao, J.N.K. 1996. On variance estimation with imputed survey data. Journal of the American Statistical Association 91(434):499–506.

Little, R.J.A., and D.B. Rubin. 2002. Statistical analysis with missing data (2nd Ed). New York: Wiley.

Särndal, C.-E., and S. Lundström. 2005. Estimation in surveys with nonresponse. New York: John Wiley & Sons, Inc.

Shao, J., and P. Steel. 1999. Variance estimation for survey data with composite imputation and nonnegligible sampling fractions. Journal of the American Statistical Association 93:254–265.

Student Contest Track 2 – Data Analysis and Visualization

This contest track challenges participants to answer a research question using appropriate analysis and visualizations with data from a large establishment survey. Students will write a report explaining the methods used and presenting the results with an emphasis on visualizations.

Survey Data: 2007 Survey of Business Owners (SBO)
The Survey of Business Owners (SBO) is conducted every five years by the US Census Bureau. It is the only regularly collected source of information about US businesses and business owners including gender, race, ethnicity, and veteran status. Read details about the SBO.

The 2007 SBO is a large survey with a complex sampling design that collects data from all 50 states. The full SBO data set contains more than 2 million records; for the purposes of this student contest, data from only four states (California, Georgia, Massachusetts, and Ohio) are provided, producing a data set with 391,093 records. In addition to using only a subset of states for analysis, only a subset of variables is provided.

To produce valid population-level estimates, the sample design and unequal probabilities of selection must be properly used in estimation procedures. Read general information about sampling and estimation methods.

The specific sample design and estimation procedures used in the 2007 SBO are described in the 2007 SBO User Guide.

Note that, as stated in the user guide, the data contains “rounded, noise-infused estimates of receipts, payroll, and employment” for confidentiality reasons and disclosure avoidance. No special treatment is needed for these “noisy” variables (i.e., treat them as if they were not noise-infused).

The data is available at the following link in CSV format:
https://ww2.amstat.org/meetings/ices/2020/studentcontest/track2sbo.csv

Description of variables and codes for categorical variables are in the Data Dictionary. Note that some variables have been modified for ease of use in the contest (specifically, some character variables have been modified to be numeric). Thus, the Data Dictionary provided should be used as the source of information (instead of referring to variable information at the end of the 2007 SBO User Guide).

Background and Research Question
A business is not the primary source of income for all business owners. “Running a side small business while working a full-time job is relatively common. Some people operate side businesses to supplement their incomes. Others simply keep their jobs for financial stability while trying to launch full-fledged companies.” https://smallbusinessbc.ca/article/how-run-your-business-while-working-a-job/

Logically, the hours dedicated to the business are likely to be smaller if the business is not the primary source of income than if it is. The number of hours all owners of a business combined dedicate to the business could vary across domains.

Research Question: How does the extent to which income is not the primary source of income for a business owner vary by owner’s sex, ethnicity, race, and veteran status and by business characteristics (e.g., size, sector, location)?

Using data from four states in the 2007 SBO, students should answer this question using appropriate analysis methods and creative data visualizations. The final product will be a report (see below for details such as length) describing the methods and presenting results.

Criteria for Judging Submissions
Entries will be judged based on the following criteria:

  • Do the data analyses and visualizations address/answer the research question?
  • Is the data analysis methodology appropriate (for the complex survey design), and is it explained clearly?
  • Is the data visualized in a creative way that provides insight into the research question?
  • Are the data visualizations explained and interpreted clearly? Are they appropriate given the data source (i.e., a complex sample survey), and are they effective at conveying information?
  • Are limitations of the analyses explained in view of limitations of the data source and challenges encountered?

Submission Instructions

  • Submissions should be a report summarizing the methods and results, limited to 6,000 words, excluding tables, figures, references, and appendices.
  • The report must be submitted in English, the official conference language.
  • Supporting materials (e.g., figures, tables, appendices) are limited to 10 pages.
  • The software used should be R or SAS, and programming code must be submitted in an appendix (for ease of submission, programming code may be submitted as one file separate from the report).
  • You must submit your report by February 15, 2020.
  • The report (and programming code) must be sent to ices@amstat.org with the subject line “Student Contest: Data Analysis/Visualization Track.”

Data Tips

  • Each business in the 2007 SBO can report up to four owners. Thus, each owner-level characteristic is described by a set of four variables. For example, ethnicity of the business owner(s) is in the variables {ETH1, ETH2, ETH3, ETH4} for the first through fourth owner.
  • Some of the data in the file comes from administrative records and is thus relatively complete. For example, total employment (number of employees), payroll, and receipts are nonmissing for all units. However, some business owner characteristics collected on the survey are incomplete (i.e., there is missing data). The focus of this contest is not on handling missing data, and thus sophisticated methods for handling any missing data are not required (though they can, of course, be used). However, as noted in the last judging criteria, limitations of the analyses due to missing data can and should be addressed in the report.

Key Dates

  • January 2, 2019
    Invited Session Proposal Submission Opens
  • June 13, 2019
    Invited Session Proposal Submission Closes
  • July 16, 2019
    Topic Contributed Session Proposal Submission Opens
  • August 15, 2019
    Topic Contributed Session Proposal Submission Closes
  • August 20, 2019
    Software Demonstration Proposal Submission Opens
  • October 16, 2019
    Contributed Abstract Submission Opens
  • December 3, 2019
    Contributed Abstract Submission Closes
  • December 12, 2019
    Software Demonstration Proposal Submission Closes
  • February 11, 2020
    Early Registration and Housing Opens
  • April 15, 2020
    Speaker Registration Deadline
  • May 7, 2020
    Early Registration Deadline
  • May 8, 2020
    Regular Registration (increased fees apply)
  • May 22, 2020
    Housing Deadline, 5:00 p.m. ET
  • June 15, 2020 – June 18, 2020
    ICES VI in New Orleans, LA