Long-term observational clinical study is a common setup to evaluate treatment safety and efficacy in chronic mental illness like bipolar disorder. Dynamic treatment regime is a sequence of decision rules that maps the evolving patient information to personalized treatment so as to maximize the patient outcome. This is a challenging task because the patient information is a noisy realization of the latent disease state process. Moreover, the potential number of decision stages is large and varies across patients. In addition, we must balance the long-term and short-term benefit of treatment when defining the patient outcome. In this context, we estimate the optimal dynamic treatment regime using an infinite-horizon partially observable Markov decision process. We show the consistency of the proposed estimators under regularity conditions and evaluate the performance via simulations. Finally, we apply the proposed method to constructing an optimal treatment regime for patients under standard care pathway in the Systematic Treatment Enhancement Program for Bipolar Disorder.