Activity Number:
|
361
|
Type:
|
Contributed
|
Date/Time:
|
Tuesday, August 2, 2016 : 10:30 AM to 12:20 PM
|
Sponsor:
|
Section on Statistical Learning and Data Science
|
Abstract #319691
|
|
Title:
|
Ranking Homologous Proteins Using an Ensemble of Logistic Regression Models Based on Subsets of Feature Variables
|
Author(s):
|
Jabed Tomal* and William J. Welch and Ruben H. Zamar
|
Companies:
|
University of Toronto and The University of British Columbia and The University of British Columbia
|
Keywords:
|
Ensemble ;
Ranking ;
Logistic Regression Model ;
Protein Homology
|
Abstract:
|
Homologous proteins are considered to have a common evolutionary origin. To produce an evolutionary sequence of proteins, a scientist needs to predict their biological homogeneity. We have proposed a model to predict biological homogeneity of proteins using feature variables obtained from the similarity search between candidate and target proteins. The assumption is that the structural similarity of proteins relates to their biological homogeneity. The proposed model is an ensemble of logistic regression models (LRM), where each constituent LRM is fitted to a subset of feature variables. An algorithm is developed to group the variables into subsets in a way that the variables in a subset appear to be good to put together in an LRM, and the variables in different subsets appear to be good in separate LRMs. The strength of the ensemble depends on the algorithm's ability to identify strong and diverse subsets of feature variables. The methods are applied to rank rare homologous proteins ahead of non-homologous proteins in protein homology data obtained from the 2004 KDD cup website. The performances of our ensemble are found better than the winning procedures of the competition.
|
Authors who are presenting talks have a * after their name.