Abstract:
|
It is increasingly common in the clinical setting that the number of variables far exceed the sample size in the context of integrating multimodal datasets (i.e. demographics, lab data and high-throughput biomarkers). Since a number of these variables could be noise, it is important to identify a small subset of variables that are informative and clinically actionable (i.e. prognostic, predictive or biologically relevant to the target drug or disease), and understand relative variable importance in the presence of correlation. Further, missingness pattern could be different across modalities resulting in small sample size with complete data. Multiple methods, especially for identifying subgroups, do not perform well in high-dimensional setting because of computational and multiplicity burden. Due to these reasons, we recommend two-stage strategy of first nominating a subset of informative variables from each data domain, and then combine such subsets in the integrative analysis (e.g. subgroup identification). We illustrate the strategy in integrating clinical, lab and GWAS data to identify important predictors/subgroups in a phase 2 clinical study of non-alcoholic steatohepatitis.
|