Learning efficient goal-conditioned policies by imitating optimal trajectories

Abstract

Imitation learning—learning to solve a task by imitating expert demonstrations—is a well-established approach for machine-learning-based control. However, its applicability depends on having access to demonstrations, which are often expensive to collect and/or suboptimal for solving the task. In this work, we present an approach to train goal-conditioned policies on datasets generated by trajectory optimization. Our approach for dataset generation is computationally efficient, can generate thousands of optimal trajectories in minutes on a laptop computer, and produces high-quality demonstrations. Further, by means of a data augmentation scheme that treats intermediate states as goals, we are able to increase the training dataset size by an order of magnitude. Using our generated datasets, we train goal-conditioned neural network policies that can control the system towards arbitrary goals. To demonstrate the generality of our approach, we generate datasets and then train policies for various control tasks, namely cart-pole stabilization, planar and three-dimensional quadcopter stabilization, and point reaching using a 6-DoF robot arm. We show that our trained policies can achieve high success rates and efficient control profiles, all while being small enough (less than 80,000 neural network parameters) that they could be deployed onboard resource-constrained controllers.

Optimal control

How do we bring a spacecraft to the Moon while minimizing fuel consumption? How do we make a robot arm reach an object in the shortest possible time? How do we control a drone to fly from point \(A\) to point \(B\) with the least energy expenditure?

These are all examples of optimal control problems (OCPs). We have a dynamical system whose state evolves over time: \[ \dot{\bm{x}}(t) = \bm{f}(\bm{x}(t), \bm{u}(t)), \] where \(\bm{x}(t)\) is the system's state at time \(t\) and \(\bm{u}(t)\) is the control. We cannot change the state directly, but we do get to pick the controls: these are, for instance, the forces of a drone's motors. But what controls do we pick to accomplish our task?

Optimal control is about picking the best possible controls according to a specific performance measure. Suppose the system starts at state \(\bm{x}_0\) and that we wish to control it to reach a goal state \(\bm{x}_g\) at time \(t_f\). Since we are seeking an optimal control, we should specify a performance measure or cost functional; this could take the form \[ \int_{0}^{t_f} L(\bm{x}(t), \bm{u}(t)) \mathrm{d}t + L_f(\bm{x}(t_f), t_f). \] Part of the cost is accumulated over time (integrated), while we also may incur some cost at the end of the trajectory; it all depends on what we define by a best control, and our definition will determine the best choice for \(L\) and \(L_f\).

We can then specify a full OCP, for example \[ \small \begin{align*} \underset{\bm{u}(t), t_f}{\small\text{minimize}} & \int_{0}^{t_f} L(\bm{x}(t), \bm{u}(t)) \mathrm{d}t + L_f(\bm{x}(t_f), t_f) \\ \small{\text{subject to}} \ & \dot{\bm{x}}(t) = \bm{f}(\bm{x}(t), \bm{u}(t)), \\ & \bm{x}(0) = \bm{x}_0, \\ & \bm{x}(t_f) = \bm{x}_g. \end{align*} \] For a fixed initial and goal state, we can solve this problem by trajectory optimization; nowadays we have very fast solvers for this purpose [1]. By solving the problem, we obtain a control \(\bm{u}(t)\) that minimizes the cost while satisfying the constraints (of which there could be more).

The problem, however, is that we'd like a controller that can control the system from any state; this is a closed-loop controller or policy. In addition, we'd like to be able to specify to the policy what goal it should aim to reach; that is, the policy is goal-conditioned. How do we obtain a goal-conditioned policy that is also efficient according to our measure of cost?

Our method

We sample many initial state-goal state pairs and obtain their corresponding optimal controls using trajectory optimization, yielding a dataset of optimal trajectories; we can then take these trajectories as expert demonstrations to train a model such as a neural network. Given a state and goal, the model should approximate (imitate) the actions of the expert controller; thus, this is a form of imitation learning.

Having trained a policy on the dataset of optimal demonstrations, we can proceed to evaluate it in simulation; to that end, we measure

the policy's success rate, i.e. what percentage of the time it is successfully able to reach its goal state; and
its efficiency, that is, the cost it has accumulated according to our OCP's cost functional. The less cost it obtains relative to the optimal solution, the better.

Results and videos

We test our method on 4 different dynamical systems: a cart-pole system, a 2-dimensional (planar) quadrotor, a three-dimensional quadrotor and the Franka Emika Panda robot arm. The first three are based on the safe-control-gym library [2].

In the table below we report the time spans of the dataset generation step, measured on a laptop computer.

Task	Number of trajectories	Dataset generation time (mm:ss)
Cart-pole	20000	03:08
Planar quadrotor	20000	05:27
Three-dimensional quadrotor	20000	11:23
Robot arm reaching	20000	00:19

For all the aforementioned control tasks, we train neural network policies that achieve success rates greater than 96%. The following videos show trained neural network policies controlling each dynamical system in simulation. We refer to the thesis report for detailed quantitative evaluations.

Example cart-pole trajectories in simulation using a 3-layer 64-unit MLP policy running at 60Hz. Pink spheres indicate goal cart-pole positions.

Example planar quadrotor trajectories in simulation using a 3-layer 64-unit MLP policy running at 60Hz. Pink spheres indicate goal quadrotor positions.

Example three-dimensional quadrotor trajectories in simulation using a 5-layer 128-unit MLP policy running at 60Hz. Pink spheres indicate goal quadrotor positions.

Example robot arm trajectories in simulation using a 3-layer 64-unit MLP policy running at 50Hz. Pink spheres indicate goal end effector positions. We use the `panda-gym` library [3] for the simulation.

We extract the following conclusions from our work:

Small neural networks can represent goal-conditioned policies that are efficient and achieve their goal with high success rates.
While a lower regression error sometimes corresponds to a better performance of the policy, this is not always the case!
The three-dimensional quadrotor was the most challenging task, due to the system's relatively high dimensionality and nonlinearity.

Ideas for future research include evaluating the method on more varied control tasks, handling partial observability and using alternative policy representations.

References

L. Vanroye, A. Sathya, J. De Schutter, and W. Decré, “FATROP: A fast constrained optimal control problem solver for robot trajectory optimization and control”, in 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), IEEE, 2023, pp. 10 036–10 043.
Z. Yuan et al., “Safe-control-gym: A unified benchmark suite for safe learning-based control and reinforcement learning in robotics”, IEEE Robotics and Automation Letters, vol. 7, no. 4, pp. 11 142–11 149, 2022. doi: 10.1109/LRA.2022.3196132.
Q. Gallouedec et al., “panda-gym: Open-source goal-conditioned environments for robotic learning”, arXiv preprint arXiv:2106.13687, 2021.