Online Program Home
My Program

Abstract Details

Activity Number: 294 - SPEED: Statistical Learning and Data Science Speed Session 2, Part 1
Type: Contributed
Date/Time: Tuesday, July 30, 2019 : 8:30 AM to 10:20 AM
Sponsor: Section on Statistical Learning and Data Science
Abstract #307201 Presentation
Title: Predicting Sub-Cellular Location of Plant Protein Using Supervised Machine Learning
Author(s): David Arthur* and Benjamin Annan and Eric Quayson and Jack Min and Guang-Hwa Andy Chang
Companies: and Youngstown State University and Youngstown State University and Youngstown State University and Youngstown State University
Keywords: Neural Networks; Sub-cellular locations; Gradient Boosting; Class Imbalance; PseAA

The U.S National Library of Medicine states proteins play critical roles in the body, they are required for the function of the body’s tissues and organs. Several smaller units of twenty different types of amino acids are combined to make a protein. A unique feature of eukaryotic cells is that specific functions are performed in spatially defined membrane bound compartments, thus the subcellular location of protein gives information about its function. The use of this information makes strides in several fields like drug design. While several papers exist on identifying the subcellular locations of proteins, there has been no significant improvement in prediction accuracy over the years as will be discussed in this paper. A general issue is how proteins are represented and what algorithms can be used to improve prediction accuracy. This paper uses the Pseudo amino acid composition to represent proteins sequence and the classifications algorithms used includes Random Forest, AdaBoost, and SAMME, Support Vector Machines and Artificial Neural Networks. To improve prediction accuracy, gradient boosting algorithms is applied as well as some techniques to tackle class imbalance.

Authors who are presenting talks have a * after their name.

Back to the full JSM 2019 program