Friday, May 31

Machine Learning E-Posters, II

Fri, May 31, 3:00 PM - 4:00 PM

Grand Ballroom Foyer

Grand Ballroom Foyer

**Keywords:** Machine Learning, Survival Analysis, Personalized medicine

Survival Analysis is to analyze and model the data where the outcome is the time until the occurrence of an event of interests. However, due to the censoring data in the survival analysis, the general predictive machine learning algorithms such as the classification and regression cannot be applied in the survival analysis. Traditionally, statistical approaches (non-parametric, semi-parametric, and parametric) have been widely used to anlyaze the censor data.

With help from the development of various data acquisition and big data technologies in a last decade, the machine learing algorithms (Bayesian methods, support vector machine and neural networks) are developed to analyze the censor data. Random Survival Forests (RSF) is an extension of Random forest, which is a non-parametric statistical method and no distribution assumptions require. This paper presents the development of tree-based method and RSF for the survival data of lung cancer patient data.

I would like to apply the statistical and machine learning methods to the lung cancer data and want to see if there is any statistically significant difference in the lung cancer type and treatment. I used the data set from the statistical analysis of failure time data (D Kalbfleisch and RL Prentice (1980)) with 137 observations of lung cancer patient data and 8 variables. Lung cancer has 4 types: Non-Small Cell Lung Cancer (NSCLC) - Squamous, Small Cell Lung Cancer (SCLC), Adenocarcinoma (Adeno), and Large Cell. Each lung cancer type shows the different survival probability and it is statistically significant at alpha = 0.05 % (p-value: 1.27e-05). Initially, Large celltype has the higher survival time, but at the end NSCLS (Squamous) has the higher survival time than Large celltype. Indeed, I applied the Cox Proportional Hazard Model and found that the p-value for all three overall tests (Likelihood, Wald, and Score) are significant, which indicates that the model is significant at alpha = 0.05.