Dynamic treatment regimes (DTRs) are sequential decision rules that focus simultaneously on treatment individualization and adaptation over time. We will first present an adaptive contrast weighted learning (ACWL) for estimating optimal DTRs. At each stage, we develop robust semiparametric regression-based contrasts with the adaptation of treatment effect ordering for each patient, and the adaptive contrasts simplify the problem of optimization with multiple treatment comparisons to a weighted classification problem. We further develop a tree-based reinforcement learning (T-RL) method to directly estimate optimal DTRs in a multi-stage multi-treatment setting. At each stage, T-RL builds an unsupervised decision tree that maintains the nature of batch-mode reinforcement learning. Unlike ACWL, T-RL handles the optimization problem with multiple treatment comparisons directly through the purity measure constructed with augmented inverse probability weighted estimators. T-RL is generally robust, efficient and easy to interpret for the identification of optimal DTRs. However, ACWL seems more robust against tree-type misspecification than T-RL when the true optimal DTR is non-tree-type.