Online Program

Return to main conference page

All Times EDT

Thursday, June 4
Machine Learning
Software & Data Science Technologies
Machine Learning and Software and Data Science Technologies Posters
Thu, Jun 4, 2:00 PM - 5:00 PM
TBD
 

WITHDRAWN Decision Tree Model-Based Gene Selection and Classification for Breast Cancer Risk Prediction (308340)

Ismail El Moudden, Eastern Virginia Medical School 
Jiangtao Luo, Eastern Virginia Medical School 
Hamim Mohamed, Ecole Nationale Supérieure D’arts Et Métiers, University Hassan II 
Mohan Dev Pant, Eastern Virginia Medical School 

Keywords: Gene Selection; Feature Extraction; Supervised Learning, Dimensionality Reduction; Microarray Data Analysis

According to www.cancer.org, the average risk of an American woman developing breast cancer is about 13%. Early detection of breast cancer is the key in fight against this disease. With the advent of gene expression technology and machine learning algorithms, the idea of developing a prediction model to detect breast cancer with high accuracy has become more realistic than before. However, the most common challenge of gene expression data is its high dimensionality, which makes any prediction approach difficult to apply. The main objective of this study is to build a prediction model that utilizes the minimum number of genes and can be used on new observations to classify patients with the highest possible accuracy. In this context, we propose a 2-phase model for gene selection and classification. The model combines 2 feature selection methods—the Fisher score based filter method to reduce the dimension and the C5.0 classifier to select an optimal subset of important genes—to improve classification performance. Four classifiers (ANN, C5.0, LR and SVM) are combined with the gene selection process to classify each sample. All model building experiments were conducted on publicly available microarray breast cancer data. The dataset consists of 24,481 gene expressions for 97 patients. The sample of 97 patients is divided into the training set and the test set. The training set consists of 78 patients, 44 of which are healthy and the rest are diagnosed with breast cancer. In the test set, there are 7 healthy patients and 12 diagnosed with breast cancer. The results show that the proposed approach significantly reduced the number of genes (only five genes are retained instead of 24,481) which can achieve a higher prediction accuracy that attains 93.28%. The present work is expected to test the ability of the proposed approach on a new microarray dataset with different properties in terms of the number of genes, the number of samples, and the number of classes.