We propose a new method termed stabilized O-learning for deriving stabilized dynamic treatment regimes (DTRs), which are sequential decision rules for individual patients not only adapt over the course of the disease progression but also consistent over time in its format. The method provides a robust and efficient learning framework for constructing DTRs by directly optimizing a doubly robust estimator of the expected long-term outcome. It can accommodate various types of outcomes, including continuous, categorical and potentially censored survival outcomes. In addition, the method is flexible to incorporate clinical preferences into a qualitatively fixed rule, where the parameters indexing the decision rules that are shared across stages can be estimated simultaneously. We conducted extensive simulation studies, showing a superior performance of the proposed method. We analyzed the data from the prospective Canary Prostate Cancer Active Surveillance study using the proposed method.