Abstract:
|
The application of statistical methods to very large data sets with many variables has experienced dramatic growth over the past few years. Data mining is a broad subject that encompasses several topics and problems including supervised and unsupervised learning. In supervised learning problems of classification and regression, these concerns are effectively addressed by cross-validation-that is, by dividing the data into a training subset to build a prediction model and a test subset to evaluate the model's performance. Empirical comparison of Classification techniques including naive Bayes, support vector machine, decision tree, and random forest were studied and a Classification Learning Toolbox was developed using R statistical programming language to analyze the date sets and report the relationships and prediction accuracy between the classifiers.
|