Abstract:
|
Identification of genomic biomarkers is a primary data analysis task in the context of drug discovery experiments. These experiments consist of several high dimensional datasets which contain information about a set of new drugs (compounds) under development. This data structure introduces the challenge of multi-source data integration which is needed to identify the biological pathways related to the new set of drugs under development. In order to process all these information, High performance Computing (HPC) techniques are required. Though R packages for parallel computing are available, they are not optimized for a specific setting or data structure. In the current study, we proposed a new “master-slave” framework, for data analysis using R in parallel, in a computer cluster. The proposed data analysis workflow is applied to a multi-source high dimensional drug discovery dataset and a performance comparison is made between the new framework and existing R packages for parallel computing. Different configuration settings, for parallel programming in R, are presented to show that the computation time, for the specific application under consideration, can be reduced significantly.
|