Abstract:
|
Reinforcement learning (RL) has shown great success in estimating sequential treatment regimes (STR) which account for patient heterogeneity. However, used as the reward for RL methods, health-outcome information is often not well coded but embedded in clinical notes. Extracting outcome information is a resource-intensive task, so most of the available well-annotated cohorts are small. We propose a semi-supervised learning (SSL) approach that efficiently leverages a small labeled dataset with true outcome observed, and a large unlabeled data with outcome surrogates. In particular, we propose a semi-supervised, efficient approach to Q-learning and doubly robust off policy value estimation. Generalizing SSL to STR brings interesting challenges: 1) Feature distribution for Q-learning is unknown as it includes previous outcomes. 2) The surrogate variables we leverage are predictive of the outcome but not informative to the optimal policy or value function. We provide theoretical results for our Q-function and value function estimators to understand the degree of efficiency gained from SSL. In addition, our method is robust to misspecification of the imputation models.
|