
Keywords: Type 2 diabetes, filtering, Lasso, random forest
Background: Diabetic kidney disease (DKD) is a major comorbidity of Type 2 diabetes (T2DM). There is an urgent need to identify novel biomarkers that can reliably predict future DKD. Sample: Urine samples from 995 T2DM CRIC patients and 198 quality controls (QC) were assayed in duplicate for relative metabolite abundance yielding 15434 untargeted features (1899 annotated). Data Processing: We developed stringent filtering criteria to eliminate noisy features. Using technical duplicate QC samples, we computed Spearman & Pearson correlations (QC CC), intraclass correlation (QC ICC) and coefficient of variation (QC CV) for each metabolite. We used the 995 subjects to calculate intraclass correlations (CRIC ICC) for each metabolite. Metabolites with low reliability (QC CC < 0.85, QC ICC = 0.05, QC CV = 0.05), or low biological variation CRIC ICC < 0.35 were excluded. Statistical Modeling: After filtering, we fit prognostic models for kidney function decline (defined as eGFR slope), using penalized (Lasso) and machine-learning (Random forest) models, with metabolites and clinical predictors (age, gender, race, smoking, baseline BMI, blood pressure, HbA1c, eGFR, albuminuria). The models with lowest prediction error were further evaluated on the time-to-ESRD outcome via C-statistics. Five-fold cross validation was repeated 100 times to obtain the median and 95% CI of c-statistics. Results: The sample was 56% male, with mean (SD) age 59.9(9.4) yrs, eGFR 40.6(11.2) ml/min/1.732, HbA1C 7.6 (1.5)% and annual eGFR slope -1.8(1.9). After filtering, we had ~2000 reliably measured features (700 annotated). The eGFR slope models selected 9 - 122 features depending on lasso penalty and random forest variable importance metric. The best ESRD model with 20 metabolites & 9 clinical factors, had median (95% CI) c-statistic of 0.85 (0.85, 0.86). Conclusion: Modern statistical methods applied to untargeted metabolomics can reveal novel insights in DKD.