Online Program Home
My Program

Abstract Details

Activity Number: 272 - Statistical Learning for Complex and High-Dimensional Data
Type: Invited
Date/Time: Tuesday, July 30, 2019 : 8:30 AM to 10:20 AM
Sponsor: IMS
Abstract #300126
Title: How to Deal with Big Data? Understanding Large-Scale Distributed Regression
Author(s): Edgar Dobriban* and Yue Sheng
Companies: University of Pennsylvania and University of Pennsylvania
Keywords: linear regression; distributed computing; machine learning; random matrix theory
Abstract:

Modern massive datasets pose an enormous computational burden to practitioners. Distributed computation has emerged as a universal approach to ease the burden: Datasets are partitioned over machines, which compute locally, and communicate short messages. Distributed data also arises due to privacy reasons, such as in medicine. It is important to study how to do statistical inference and machine learning in a distributed setting. In this talk, we present results about one-step parameter averaging in statistical linear models under data parallelism. We do linear regression on each machine, and take a weighted average of the parameters. How much do we lose compared to doing linear regression on the full data? Here we study the performance loss in estimation error, test error, and confidence interval length in high dimensions, where the number of parameters is comparable to the training data size. We discover several key phenomena. First, averaging is not optimal, and we find the exact performance loss. Second, different problems are affected differently by the distributed framework. Estimation error and confidence interval length increases a lot, while prediction error increases much


Authors who are presenting talks have a * after their name.

Back to the full JSM 2019 program