A dynamic treatment regime (DTR) is a sequence of decision rules that adapt to time-varying states of an individual. Interpretable DTRs are desirable for human experts to understand and implement. Many existing methods require specifying working models, which can struggle especially with data from large observational studies. We develop a stochastic tree-based reinforcement learning method, ST-RL, for estimating optimal DTRs in a multi-stage multi-treatment setting with data from either randomized trials or observational studies. At each stage, ST-RL constructs a decision tree by first modeling the mean of counterfactual outcomes via Bayesian nonparametric regression models, and then stochastically searching for the optimal tree-structured regime using a Markov Chain Monte Carlo algorithm. We implement the proposed method in a backward inductive fashion through multiple decision stages. Compared to existing methods, ST-RL delivers optimal DTRs with better interpretability, and does not require explicit working model specifications. In addition, ST-RL demonstrates stable and outstanding performance with moderately high dimensional data.