Abstract #301937

This is the preliminary program for the 2003 Joint Statistical Meetings in San Francisco, California. Currently included in this program is the "technical" program, schedule of invited, topic contributed, regular contributed and poster sessions; Continuing Education courses (August 2-5, 2003); and Committee and Business Meetings. This on-line program will be updated frequently to reflect the most current revisions.

To View the Program:
You may choose to view all activities of the program or just parts of it at any one time. All activities are arranged by date and time.

The views expressed here are those of the individual authors
and not necessarily those of the ASA or its board, officers, or staff.


Back to main JSM 2003 Program page



JSM 2003 Abstract #301937
Activity Number: 475
Type: Contributed
Date/Time: Thursday, August 7, 2003 : 10:30 AM to 12:20 PM
Sponsor: Section on Statistical Computing
Abstract - #301937
Title: Sampling for Data Mining
Author(s): Douglas Willson*+
Companies: National Analysts, Inc.
Address: 1835 Market St., Philadelphia, PA, 19103-2968,
Keywords: data mining ; instance selection ; data squashing
Abstract:

Several "data squashing" algorithms have recently been proposed that generate a small dataset for modeling purposes from a much larger "mother" dataset containing thousands or perhaps millions of records. Models estimated on the squashed dataset appear to accurately reproduce important characteristics of models estimated on the mother data, and perform much better than models estimated with samples of the same size. However, the benchmark sample designs that have been used for comparative purposes have been simple random samples--and these are typically not optimal sample designs for estimating models. The paper discusses "optimal" sample designs for building models in data mining. Optimal sample designs are developed for a class of nonlinear models. The role of calibration of weights so that sample moments match moments from the mother data is discussed. Models estimated from samples are also compared with models estimated from squashed datasets of the same size.


  • The address information is for the authors that have a + after their name.
  • Authors who are presenting talks have a * after their name.

Back to the full JSM 2003 program

JSM 2003 For information, contact meetings@amstat.org or phone (703) 684-1221. If you have questions about the Continuing Education program, please contact the Education Department.
Revised March 2003