Therapies in development for various chronic diseases often target a specific biomarker. A method of identifying patients with higher probability of the biomarker present would be valuable. Multiple existing databases, such as cohort studies or registries, can be used to develop algorithms for this purpose. However, the heterogeneity across various study designs posts challenges. We implemented a new sampling framework using nested-cross-validation, accompanied by a stratified-subsampling procedure. The proposed method can alleviate problems caused by heterogeneity among these databases and make them more comparable to a target population. An innovative visualization method, a heatmap, is proposed to represent the statistical model that produces probability of biomarker positivity with different combinations of risk factors. We found that these statistical procedures facilitated the model validation in an independent dataset and also facilitate tuning the model for specific target populations (e.g., a particular clinical setting), thereby improving the tool utility.