Keywords: Data Science Education, Big Data, multidisciplinary
Data Science, as defined by the draft NIH Strategic Plan, is “the interdisciplinary field of inquiry in which quantitative and analytical approaches … extract knowledge and insights from increasingly large and/or complex sets of data.” Such definitions implicitly stress both the multidisciplinary nature of data science and the broad set of necessary skills.
This presentation describes efforts at one institution to build capacity in data science, including 1) development of educational modules, 2) overarching efforts to share resources, and 3) coordination of educational efforts across departments.
The educational modules focus on 1) bioinformatics analysis of TCGA data and the Human Microbiome Project; 3) text and natural language processing of social media data; 4) machine learning for image segmentation and registration; 5) discovery of plausible underlying causal relationships, and 6) causal inferences through propensity score-base methods. Once developed, these modules will be packaged in a common framework designed to facilitate comparison, indexing, and reuse with necessary metadata. The modules as a whole will emphasize reproducibility, skills for managing big data, and traditional statistical concepts (e.g., bias-variance trade offs). Materials will be placed in accessible open-source hosting environments.
Similar work across 13 other institutions is being coordinated to leverage efforts and produce complementary training. In addition, collaborations at the University level are working to characterize definitions of data science, educational goals and competencies, and variations in educational programs across the schools of Medicine, Computing and Information, Public Health, and Arts and Sciences.
Thus far, data science educational programs have been insufficient to meet rapidly expanding workforce demands. We hope these efforts will motivate further collaboration that better aligns with the multidisciplinary nature of data science.