Abstract #301078

This is the preliminary program for the 2003 Joint Statistical Meetings in San Francisco, California. Currently included in this program is the "technical" program, schedule of invited, topic contributed, regular contributed and poster sessions; Continuing Education courses (August 2-5, 2003); and Committee and Business Meetings. This on-line program will be updated frequently to reflect the most current revisions.

To View the Program:
You may choose to view all activities of the program or just parts of it at any one time. All activities are arranged by date and time.

The views expressed here are those of the individual authors
and not necessarily those of the ASA or its board, officers, or staff.


Back to main JSM 2003 Program page



JSM 2003 Abstract #301078
Activity Number: 86
Type: Contributed
Date/Time: Monday, August 4, 2003 : 8:30 AM to 10:20 AM
Sponsor: Section on Physical and Engineering Sciences
Abstract - #301078
Title: Quantitative Structure-Activity-Relationship Modeling Using Leo Breiman's Random Forest
Author(s): Christopher H. Tong*+ and Vladimir B. Svetnik and Andy I. Liaw
Companies: Merck & Co., Inc. and Merck and Company and Merck & Co., Inc.
Address: Biometrics Research RY 33-300, Rahway, NJ, 07065-0900,
Keywords: Random Forest ; classification ; regression ; tree ; data mining ; drug discovery
Abstract:

Quantitative structure-activity-relationship (QSAR) models relate a measure of a molecule's biological activity with its chemical structure. QSAR continues to be a very active area of research in chemistry and statistics. Unfortunately, most of the existing tools fail to incorporate multiple mechanisms of action (MOA), and therefore have seen limited application in drug discovery. Recent progress, specifically in ensemble learning for classification and regression, has revealed new opportunities. We present one example, Leo Breiman's Random Forest, with which we had some success. Its advantages include the ability to handle multiple MOAs, assessment of variable importance, and a meaningful measure of molecular similarity. Application of random forest is illustrated by an example of QSAR for P-glycoprotein transport, using a publicly available data set. Our results compare favorably with previous methods. We give particular attention to variable selection and demonstrate potential pitfalls of a naive approach. We also discuss the R language interface, available on CRAN, that we have developed for Breiman and Cutler's Random Forest Fortran software.


  • The address information is for the authors that have a + after their name.
  • Authors who are presenting talks have a * after their name.

Back to the full JSM 2003 program

JSM 2003 For information, contact meetings@amstat.org or phone (703) 684-1221. If you have questions about the Continuing Education program, please contact the Education Department.
Revised March 2003