Global guarantees for policy gradient methods.
- Jalaj Bhandari (Columbia University)
Policy gradient algorithms are amongst the most widely used reinforcement learning methods and have driven a lot of recent empirical success in applications of reinforcement learning, especially with deep neural network based policies. These methods apply to complex, poorly understood control problems by performing stochastic gradient descent over a parameterized class of policies. However, even for simple control problems solvable by classical dynamic programming techniques, policy gradient algorithms face non-convex optimization problems and are widely understood to converge only to a stationary point. We make fundamental progress in developing a theory for policy gradient methods - providing insights into when and why policy gradients are guaranteed to converge to globally optimal policies. In particular, we identify structural properties of the underlying markov decision process (MDP) which ensure that despite non-convexity, the policy gradient objective function has no suboptimal stationary points. We also show that the policy gradient objective satisfies a Polyak-lojasiewicz (gradient dominance) condition when some of these conditions are strengthened, yielding convergence rates. When some of these conditions are relaxed, we provide bounds on the optimality gap of any stationary point. Our results leverage on a key connection to policy iteration, a classic and well-studied dynamic programming algorithm, via a policy gradient theorem.