JSM Activity #CE2003_18C

This is the preliminary program for the 2003 Joint Statistical Meetings in San Francisco, California. Currently included in this program is the "technical" program, schedule of invited, topic contributed, regular contributed and poster sessions; Continuing Education courses (August 2-5, 2003); and Committee and Business Meetings. This on-line program will be updated frequently to reflect the most current revisions.

To View the Program:
You may choose to view all activities of the program or just parts of it at any one time. All activities are arranged by date and time.

The views expressed here are those of the individual authors
and not necessarily those of the ASA or its board, officers, or staff.


Back to main JSM 2003 Program page



Legend: = Applied Session, = Theme Session, = Presenter
Hotels: H = Hilton San Francisco, R = Reniassance Parc Hotel 55, N = Nikko San Francisco
Add To My Program
CE2003_18C Tue, 8/5/03, 8:00 AM - 12:00 PM N-Monterey Room II
Data Quality and Data Cleaning: An Overview - Continuing Ed
ASA
Instructor(s): Tamraparni Dasu, AT&T Research Labs
The success of any data mining exercise hinges upon the quality of the data. Conventionally data quality has been limited to static definitions such as completeness, accuracy, uniqueness and consistency. However, the nature and definition of data have expanded tremendously. For example, we have new paradigms such as streaming data where the rate of data accumulation is very high. The world wide web has yielded new types of data such as "web scraped data" and web server logs. In addition, the scale and dimensionality of data have exploded along with a high degree of heterogeneity caused by integrating the data from diverse sources (federated and enterprise data). All these factors have made the data increasingly uncontrolled and glitch ridden. Our expectations of data have changed as well. We no longer merely want to analyze data, we want to predict and use the data to drive important decisions that have far reaching consequences on corporations, on the economy and in various scientific endeavors. In this course, we propose a general framework for defining, detecting, measuring and resolving data quality issues. Data quality is a very complex and ill-defined concept. Generalization is difficult because it is so very often context specific. In practice, there are many technical and sociological factors that need to be addressed. We start with a vastly updated and dynamic definition of data quality, considering it as a data quality continuum where we have to detect and monitor data quality issues during the various stages in the life cycle of the data. These stages include data gathering, data storage and retrieval, data integration, development of schema and business related constraints, data summarization and publishing, and data analysis/mining. We will describe each of these in detail. Furthermore, we will present an array of interdisciplinary tools drawn from the areas of process management, statistics, database research and metadata/domain expertise management for the detection and resolution of data quality issues. In our approach, which has a significant focus on automatic detection for massive data sets, we particularly emphasize data exploration. Familiarizing ourselves with the data set and its nuances is a very important step in data cleaning. We conclude with a discussion of data quality metrics and their implementation. We include several case studies. Pre-requisite: None. Text: Exploratory Data Mining and Data Cleaning Publisher: John Wiley & Sons, Inc. $70.00 Fees: M-$200 ($270 after July 18), NM-$260 ($330 after July 18), SM- $125 ($200 after July 18)
 

JSM 2003 For information, contact meetings@amstat.org or phone (703) 684-1221. If you have questions about the Continuing Education program, please contact the Education Department.
Revised March 2003