Dynamic treatment regimes operationalize precision medicine as a sequence of decision rules, one per stage of clinical intervention, that map up-to-date patient information to a recommended intervention. An optimal treatment regime maximizes mean utility when applied to the population of interest. Q-learning is a primary method for estimating an optimal treatment regime from data collected in an observational or randomized study. However, methods for accommodating missing data in the context of Q-learning remains underdeveloped. Standard practice is to use multiple imputation. However, this requires estimating the joint distribution of patient trajectories which can be high-dimensional especially when there are multiple stages of intervention. We propose a variant of Q-learning based on augmented inverse probability weighting that does not require modeling the trajectory distribution. The proposed estimator is shown to be consistent under mild regularity conditions and to perform well relative to multiple imputation in simulation experiments.