Simultaneous estimation of rewards and dynamics in inverse reinforcement learning problems

Herman, Michael

PhD Thesis, University of Freiburg, January 28, 2020

Abstract:

As the capabilities of autonomous systems improve, they can solve more and more tasks in increasingly complex environments. Often, the autonomous system needs adjustments to the specific task or environment, which typically requires extensive research and engineering. By allowing non-experts to adjust the systems to new behaviors and goals, they are faster and easier to deploy for various tasks and environments, which increases their acceptability. An intuitive way to describe a task is to provide demonstrations of desired behavior. These demonstrations can be used to learn a representation of the expert’s motivation and goal. Learning from Demonstration is a class of approaches offering the possibility to teach new behaviors by demonstrating the task instead of programming it directly. Two subfields of Learning from Demonstration are Behavioral Cloning and Inverse Reinforcement Learning (IRL). Approaches from Behavioral Cloning estimate the expert’s policy directly from demonstrations and therefore learn to mimic the expert. However, the learned policies are typically only appropriate if the environment, the dynamics, and the task remain unchanged. A popular approach that learns more generalizable representations is Inverse Reinforcement Learning, which estimates the unknown reward function of a Markov Decision Process (MDP) from demonstrations of an expert. Many Inverse Reinforcement Learning approaches exist that solve the problem under various assumptions. Most of them assume the environment’s dynamics to be known, that they can be learned from expert demonstrations, that additional samples from suboptimal behavior can be queried, or that appropriate heuristics account for unknown transition models. However, these assumptions are often not satisfied, since transition models of environments can be complex, accurate models may be unknown, querying samples may be too expensive, and heuristics may tamper with the reward estimate. To solve IRL problems under unknown dynamics, we propose a framework that simultaneously estimates both the reward function and the dynamics from expert demonstrations. This is possible, as both influence the expert’s policy and thus the long-term behavior. Therefore, not only the observed transitions of the expert’s demonstrations but also the observed actions contain information about the dynamics of the environment. The contribution of this thesis is the formulation of a new problem class, called Simultaneous Estimation of Rewards and Dynamics (SERD). Furthermore, we derive several solutions to this problem under different assumptions on how experts compute their policy. The evaluation on a minimum example, a grid world navigation task, and a high-level navigation task of a simulated robot in a populated hallway shows that incorporating the estimation of the transition model into Inverse Reinforcement Learning yields more accurate models of the dynamics and the reward function.

@phdthesis{Herman2020PhDThesis, author = {Herman, Michael}, year = {2020}, month = {01}, school= {University of Freiburg}, address = {Freiburg, Germany}, title = {Simultaneous estimation of rewards and dynamics in inverse reinforcement learning problems} }