We train neural networks to control various systems efficiently towards arbitrary goals. In the videos above, pink markers (■) denote goals passed as input to the neural networks.
Imitation learning—learning to solve a task by imitating expert demonstrations—is a well-established approach for machine-learning-based control. However, its applicability depends on having access to demonstrations, which are often expensive to collect and/or suboptimal for solving the task. In this work, we present an approach to train goal-conditioned policies on datasets generated by trajectory optimization. Our approach for dataset generation is computationally efficient, can generate thousands of optimal trajectories in minutes on a laptop computer, and produces high-quality demonstrations. Further, by means of a data augmentation scheme that treats intermediate states as goals, we are able to increase the training dataset size by an order of magnitude. Using our generated datasets, we train goal-conditioned neural network policies that can control the system towards arbitrary goals. To demonstrate the generality of our approach, we generate datasets and then train policies for various control tasks, namely cart-pole stabilization, planar and three-dimensional quadcopter stabilization, and point reaching using a 6-DoF robot arm. We show that our trained policies can achieve high success rates and efficient control profiles, all while being small enough (less than 80,000 neural network parameters) that they could be deployed onboard resource-constrained controllers.
How do we bring a spacecraft to the Moon while minimizing fuel consumption? How do we make a robot arm reach an object in the shortest possible time? How do we control a drone to fly from point \(A\) to point \(B\) with the least energy expenditure?
These are all examples of optimal control problems (OCPs). We have a dynamical system whose state evolves over time: \[ \dot{\bm{x}}(t) = \bm{f}(\bm{x}(t), \bm{u}(t)), \] where \(\bm{x}(t)\) is the system's state at time \(t\) and \(\bm{u}(t)\) is the control. We cannot change the state directly, but we do get to pick the controls: these are, for instance, the forces of a drone's motors. But what controls do we pick to accomplish our task?
Optimal control is about picking the best possible controls according to a specific performance measure. Suppose the system starts at state \(\bm{x}_0\) and that we wish to control it to reach a goal state \(\bm{x}_g\) at time \(t_f\). Since we are seeking an optimal control, we should specify a performance measure or cost functional; this could take the form \[ \int_{0}^{t_f} L(\bm{x}(t), \bm{u}(t)) \mathrm{d}t + L_f(\bm{x}(t_f), t_f). \] Part of the cost is accumulated over time (integrated), while we also may incur some cost at the end of the trajectory; it all depends on what we define by a best control, and our definition will determine the best choice for \(L\) and \(L_f\).
We can then specify a full OCP, for example \[ \small \begin{align*} \underset{\bm{u}(t), t_f}{\small\text{minimize}} & \int_{0}^{t_f} L(\bm{x}(t), \bm{u}(t)) \mathrm{d}t + L_f(\bm{x}(t_f), t_f) \\ \small{\text{subject to}} \ & \dot{\bm{x}}(t) = \bm{f}(\bm{x}(t), \bm{u}(t)), \\ & \bm{x}(0) = \bm{x}_0, \\ & \bm{x}(t_f) = \bm{x}_g. \end{align*} \] For a fixed initial and goal state, we can solve this problem by trajectory optimization; nowadays we have very fast solvers for this purpose [1]. By solving the problem, we obtain a control \(\bm{u}(t)\) that minimizes the cost while satisfying the constraints (of which there could be more).
The problem, however, is that we'd like a controller that can control the system from any state; this is a closed-loop controller or policy. In addition, we'd like to be able to specify to the policy what goal it should aim to reach; that is, the policy is goal-conditioned. How do we obtain a goal-conditioned policy that is also efficient according to our measure of cost?
We sample many initial state-goal state pairs and obtain their corresponding optimal controls using trajectory optimization, yielding a dataset of optimal trajectories; we can then take these trajectories as expert demonstrations to train a model such as a neural network. Given a state and goal, the model should approximate (imitate) the actions of the expert controller; thus, this is a form of imitation learning.
Having trained a policy on the dataset of optimal demonstrations, we can proceed to evaluate it in simulation; to that end, we measure
We test our method on 4 different dynamical systems: a cart-pole system, a 2-dimensional (planar) quadrotor, a three-dimensional quadrotor
and the Franka Emika Panda robot arm. The first three are based on the
safe-control-gym
library [2].
In the table below we report the time spans of the dataset generation step, measured on a laptop computer.
Task | Number of trajectories | Dataset generation time (mm:ss) |
---|---|---|
Cart-pole | 20000 | 03:08 |
Planar quadrotor | 20000 | 05:27 |
Three-dimensional quadrotor | 20000 | 11:23 |
Robot arm reaching | 20000 | 00:19 |
For all the aforementioned control tasks, we train neural network policies that achieve success rates greater than 96%. The following videos show trained neural network policies controlling each dynamical system in simulation. We refer to the thesis report for detailed quantitative evaluations.
panda-gym
library [3] for the
simulation.
We extract the following conclusions from our work:
Ideas for future research include evaluating the method on more varied control tasks, handling partial observability and using alternative policy representations.