Name: 2018 Joint Statistical Meetings
Start: 2018-07-28T07:00:00+00:00
End: 2018-08-02
Location: Vancouver Convention Centre

Activity Number:	28 - SPEED: A Mixture of Topics in Health, Computing, and Imaging
Type:	Contributed
Date/Time:	Sunday, July 29, 2018 : 2:00 PM to 3:50 PM
Sponsor:	Section on Statistical Computing
Abstract #329473	Presentation
Title:	Fast Generalised Linear Models in a Database
Author(s):	Thomas Lumley*
Companies:	University of Auckland
Keywords:	database; generalized linear model; sampling; two-phase sampling; algorithms
Abstract:	It is well known that one step of Fisher scoring from a sqrt(n)-consistent starting point gives a fully efficient estimator for a generalised linear model. It is less well known that the starting point need only be better than fourth-root consistent. In particular, the starting point can be the maximum likelihood estimator from a subsample of the data, as long as the subsample size is large compared to the square root of the data set size. For a billion-row dataset, a 100,000-row subsample is more than adequate. Furthermore, the Fisher scoring update can be computed in a single SQL aggregation query, allowing efficient computation either on a local computer or in the cloud. Depending on the types of queries that are efficient in a particular database, the initial subsample need not be a simple random sample: case-control or other two-phase sampling designs may increase computational efficiency.

Authors who are presenting talks have a * after their name.

JSM 2018 Online Program