JSM 2004 Online Program

Abstract #301533

This is the preliminary program for the 2004 Joint Statistical Meetings in Toronto, Canada. Currently included in this program is the "technical" program, schedule of invited, topic contributed, regular contributed and poster sessions; Continuing Education courses (August 7-10, 2004); and Committee and Business Meetings. This on-line program will be updated frequently to reflect the most current revisions.

To View the Program:
You may choose to view all activities of the program or just parts of it at any one time. All activities are arranged by date and time.

The views expressed here are those of the individual authors
and not necessarily those of the ASA or its board, officers, or staff.

Back to main JSM 2004 Program page

Activity Number:	46
Type:	Topic Contributed
Date/Time:	Sunday, August 8, 2004 : 4:00 PM to 5:50 PM
Sponsor:	Section on Statistical Computing
Abstract - #301533
Title:	Comparison of Two Multiple-tree Algorithms on High-throughput Screening Data from Drug Discovery: RandomForest and Partitionator
Author(s):	Katja S. Remlinger*+ and Jacqueline M. Hughes-Oliver
Companies:	North Carolina State University and North Carolina State University
Address:	2304-C Bedford Ave., Raleigh, NC, 27607,
Keywords:	classification trees ; data-mining ; dimension reduction ; variable importance ; accumulation curves
Abstract:	In drug discovery, large chemical libraries are screened to identify active compounds. Screening the entire library is not very cost- or time-efficient. Methods are needed that can predict the biological activity of a compound based on that compound's chemical structure. Data-mining techniques are good candidates for this task. They perform well on large datasets, and they are very flexible. This paper compares two multiple tree algorithms, RandomForest and Partitionator, on a dataset from drug discovery that was used in the KDD Cup 2001. We first give a brief description of both algorithms, point out their differences, and then compare their performance on the KDD dataset. Both algorithms achieve weighted accuracies on predicting the test set activities that are in the top 5% of all competitor results. Furthermore, we propose three different approaches to define the screening order of test set compounds that are suitable for multiple-tree algorithms.

The address information is for the authors that have a + after their name.
Authors who are presenting talks have a * after their name.

Back to the full JSM 2004 program