Online Program Home
My Program

Abstract Details

Activity Number: 28 - SPEED: A Mixture of Topics in Health, Computing, and Imaging
Type: Contributed
Date/Time: Sunday, July 29, 2018 : 2:00 PM to 3:50 PM
Sponsor: Section on Statistical Computing
Abstract #329473 Presentation
Title: Fast Generalised Linear Models in a Database
Author(s): Thomas Lumley*
Companies: University of Auckland
Keywords: database; generalized linear model; sampling; two-phase sampling; algorithms

It is well known that one step of Fisher scoring from a sqrt(n)-consistent starting point gives a fully efficient estimator for a generalised linear model. It is less well known that the starting point need only be better than fourth-root consistent. In particular, the starting point can be the maximum likelihood estimator from a subsample of the data, as long as the subsample size is large compared to the square root of the data set size. For a billion-row dataset, a 100,000-row subsample is more than adequate. Furthermore, the Fisher scoring update can be computed in a single SQL aggregation query, allowing efficient computation either on a local computer or in the cloud. Depending on the types of queries that are efficient in a particular database, the initial subsample need not be a simple random sample: case-control or other two-phase sampling designs may increase computational efficiency.

Authors who are presenting talks have a * after their name.

Back to the full JSM 2018 program