Keywords: Bioconductor, S4, machine learning, assay, omic
Bioconductor's S4 data containers for genomic assays are popular, well established data structures. Their data architecture facilitates the application of common analytical procedures and well established statistical methodologies to large assay data. They are extensible to encompass new emerging technologies and analytical methods. However, the S4 system enforces strict constraints on the data and these constraints raise barriers for interoperability and integration with software and packages outside of Bioconductor's repository.
mlr is a comprehensive package for machine learning. It aggregates hundreds of supervised and unsupervised models and facilitates analytics such as resampling, benchmarking, tuning, and ensemble. The mlrCPO package extends mlr's pre-processing and feature engineering functionality via composable Preprocessing Operators (CPO) 'pipelines'.
Bioc2mlr is a compact utility package designed to bridge between these approaches. It deploys transformations of SummarizedExperiment and MultiAssayExperiment S4 data structures into mlr's expected format. It also implements Bioconductor's popular feature selection (filtering) methods used by limma package and others, as a CPO. The vignettes present comparisons to the MLInterfaces package, which aims to achieve similar goals, and presents workflows for popular publicly available genomic datasets such as curatedTCGAData.