For advanced cancer patients, there are multiple lines of treatment. In each line of treatment, there are often multiple treatment options such as chemotherapy, immunotherapy and targeted therapy. For example, if the first treatment line fails, clinicians can choose alternative therapies as the second line. Patients may also get third, fourth or later lines if previous treatments do not work. Therefore, the selection of proper therapies in optimal sequences becomes important in both drug development and clinical practice. In machine learning and data-mining fields, Q-learning is a model-free learning technique, which can be used to find an optimal action-selection policy for decision process. Q-learning method also has certain statistical properties. We use Q-learning approach to find the optimal treatment regime or treatment strategy. Results from simulation studies will be presented. Performance of optimal treatment regime estimation will be presented. We will also compare the statistical properties of Q-learning method with the properties of other causal inference methods such as weighting method.