JSM 2014 Home
Online Program Home
My Program

Legend: Boston Convention & Exhibition Center = CC, Westin Boston Waterfront = W, Seaport Boston Hotel = S
A * preceding a session name means that the session is an applied session.
A ! preceding a session name means that the session reflects the JSM meeting theme.

Activity Details


CE_01C Sun, 8/3/2014, 8:30 AM - 5:00 PM CC-162AB
Enhancing Big Data Projects Through Statistical Engineering — Professional Development Continuing Education Course
ASA
Massive data sets, or Big Data, have become much more common recently, due to improved technology for data acquisition, storage, and processing of data. With the advent of Big Data, several disciplines, not only statistics, have developed new tools to analyze such data, including classification and regression trees (CART), neural nets, methods based on bootstrapping, such as random forests, and various clustering algorithms. These tools make high--- powered statistical methods available to not only professional statisticians, but also to casual users. As with any tools, the results to be expected are proportional to the knowledge and skill of the user, as well as the quality of the data. Unfortunately, much of the data mining, machine learning, and Big Data literature may give casual users the impression that if one has a powerful enough algorithm and a lot of data, good models and good results are guaranteed at the push of a button. Conversely, if one applies sound principles of statistical engineering (Anderson--- Cook and Lu 2012, Snee et al. 2013) to the Big Data problem, several potential pitfalls become obvious. We consider four important principles of statistical engineering that in our opinion have been either overlooked or underemphasized in the Big Data literature: * Need for a clear strategy to guide the analysis of Big Data sets and the solution of the associated problems of interest. * The importance of using sequential approaches to scientific investigation, as opposed to the "one---shot study" so popular in the algorithms literature. * The need for empirical modeling to be guided by domain knowledge (subject---matter theory), including interpretation of data within the context of the processes and measurement systems that generated it, and * The inaccuracy of the typical unstated assumption that all data are created equal, and therefore that data quantity is more important than data quality. After introduction of the Big Data problem, and brief overview of the newer tools now available to analysts, we will discuss the problems that can arise when these statistical engineering fundamentals are ignored, even with well---designed and powerful analytic tools. Then, we will share our thoughts on how to improve Big Data projects by incorporating these principles into the overall project. Specifically, we plan to review, and discuss how to apply, the building blocks of statistical engineering that we feel will help avoid major errors in Big Data projects. Next, we discuss how the major phases of typical statistical engineering projects can help provide a strategic approach to Big Data problems. During breakout exercises, attendees will develop approaches for each phase of statistical engineering, within the context of Big Data projects.
Instructor(s): Ronald Snee, Snee Associates, Richard D. De Veaux, Williams College, Roger Hoerl, Union College



2014 JSM Online Program Home

For information, contact jsm@amstat.org or phone (888) 231-3473.

If you have questions about the Professional Development program, please contact the Education Department.

The views expressed here are those of the individual authors and not necessarily those of the JSM sponsors, their officers, or their staff.

ASA Meetings Department  •  732 North Washington Street, Alexandria, VA 22314  •  (703) 684-1221  •  meetings@amstat.org
Copyright © American Statistical Association.