Online Program Home
My Program

Abstract Details

Activity Number: 361
Type: Contributed
Date/Time: Tuesday, August 2, 2016 : 10:30 AM to 12:20 PM
Sponsor: Section on Statistical Learning and Data Science
Abstract #319691
Title: Ranking Homologous Proteins Using an Ensemble of Logistic Regression Models Based on Subsets of Feature Variables
Author(s): Jabed Tomal* and William J. Welch and Ruben H. Zamar
Companies: University of Toronto and The University of British Columbia and The University of British Columbia
Keywords: Ensemble ; Ranking ; Logistic Regression Model ; Protein Homology

Homologous proteins are considered to have a common evolutionary origin. To produce an evolutionary sequence of proteins, a scientist needs to predict their biological homogeneity. We have proposed a model to predict biological homogeneity of proteins using feature variables obtained from the similarity search between candidate and target proteins. The assumption is that the structural similarity of proteins relates to their biological homogeneity. The proposed model is an ensemble of logistic regression models (LRM), where each constituent LRM is fitted to a subset of feature variables. An algorithm is developed to group the variables into subsets in a way that the variables in a subset appear to be good to put together in an LRM, and the variables in different subsets appear to be good in separate LRMs. The strength of the ensemble depends on the algorithm's ability to identify strong and diverse subsets of feature variables. The methods are applied to rank rare homologous proteins ahead of non-homologous proteins in protein homology data obtained from the 2004 KDD cup website. The performances of our ensemble are found better than the winning procedures of the competition.

Authors who are presenting talks have a * after their name.

Back to the full JSM 2016 program

Copyright © American Statistical Association