Abstract:
|
It is well known that one step of Fisher scoring from a sqrt(n)-consistent starting point gives a fully efficient estimator for a generalised linear model. It is less well known that the starting point need only be better than fourth-root consistent. In particular, the starting point can be the maximum likelihood estimator from a subsample of the data, as long as the subsample size is large compared to the square root of the data set size. For a billion-row dataset, a 100,000-row subsample is more than adequate. Furthermore, the Fisher scoring update can be computed in a single SQL aggregation query, allowing efficient computation either on a local computer or in the cloud. Depending on the types of queries that are efficient in a particular database, the initial subsample need not be a simple random sample: case-control or other two-phase sampling designs may increase computational efficiency.
|