All Times EDT
Policy optimization (with neural networks as actor and critic) is the workhorse behind the success of deep reinforcement learning. However, its global convergence remains less understood, even in classical settings with linear function approximators. In this talk, I will show that coupled with neural networks, a variant of proximal/trust-region policy optimization (PPO/TRPO) globally converges to the optimal policy. In particular, I will illustrate how the overparametrization of neural networks enable us to establish strong guarantees. (Joint work with Qi Cai, Jason Lee, Boyi Liu, Zhuoran Yang)