Online Program

Return to main conference page

All Times ET

Friday, June 4
Education
Data Science Education and Applications
Fri, Jun 4, 1:20 PM - 2:55 PM
TBD
 

Who teaches data science concepts? - Results from course catalog mining with machine learning (309806)

Presentation

Linda E Clark, Brown University 
Ethan Hawk, Valparaiso University 
Katherine M Kinnaird, Smith College 
*Sasha Lioutikova, Yale University 
Mikael Moise, Smith College 
Marius Orehovschi, Colby College 
Bjorn Sandstede, Brown University 
Karl R. B. Schmitt, Trinity Christian College 
Sydney E Shearer, Juniata College 
Ellie Strauss, Bates College 
Frankie Vazquez, Valparaiso University 
Ruth E.H. Wertz, Valparaiso University 

Keywords: data science, curriculum, courses, higher education, research, machine learning, data mining, random forests

This project explores the interdisciplinarity of data science instruction by identifying courses, and their respective departments, which included topics associated with a data science body of knowledge in their published course descriptions. This work identifies these courses by applying a random forest algorithm to course catalog data, originally obtained directly from the websites of higher education institutions. As a preprocessing step, catalogs were converted from PDF into usable data-entries in XML files by applying natural language processing techniques. This results in each data point consisting of a department prefix, course number, title, and vector of words from the description. Training data for classification is selected by comparing the word-vector from the catalog descriptions to a word-vector developed from the EDISON Data Science Body of Knowledge (BoK). Courses that included refined terms from the BoK were labeled as data science. The final training datasets are constructed using a stratified 10-fold approach with undersampling to improve the balance between the target classifications of data science course and non-data science course. Finally, cross-validation is performed to assess how well the algorithm will classify courses when additional catalogs are added to the dataset. Using the trained random forest algorithm we are able to identify courses outside of the primary data science disciplines (mathematics, statistics, and computer science) that teach data science concepts. Identification of these courses provides valuable insight into where data science is being used and where data education is occurring.