We train neural networks to control various systems efficiently towards arbitrary goals. In the videos above, pink markers (■) denote goals passed as input to the neural networks.
Imitation learning is a well-established approach for machine-learning-based control. However, its applicability depends on having access to demonstrations, which are often expensive to collect and/or suboptimal for solving the task. In this work, we present GCImOpt, an approach to learn efficient goal-conditioned policies by training on datasets generated by trajectory optimization. Our approach for dataset generation is computationally efficient, can generate thousands of optimal trajectories in minutes on a laptop computer, and produces high-quality demonstrations. Further, by means of a data augmentation scheme that treats intermediate states as goals, we are able to increase the training dataset size by an order of magnitude. Using our generated datasets, we train goal-conditioned neural network policies that can control the system towards arbitrary goals. To demonstrate the generality of our approach, we generate datasets and then train policies for various control tasks, namely cart-pole stabilization, planar and three-dimensional quadcopter stabilization, and point reaching using a 6-DoF robot arm. We show that our trained policies can achieve high success rates and efficient control profiles, all while being small enough (less than 80,000 neural network parameters) that they could be deployed onboard resource-constrained controllers.
We sample many initial state-goal state pairs and obtain their corresponding optimal controls using trajectory optimization, yielding a dataset of optimal trajectories; we can then take these trajectories as expert demonstrations to train a model such as a neural network. Given a state and goal, the model should approximate (imitate) the actions of the expert controller; thus, this is a form of imitation learning.
Having trained a policy on the dataset of optimal demonstrations, we can proceed to evaluate it in simulation; to that end, we measure
We test our method on 4 different dynamical systems: a cart-pole system, a 2-dimensional (planar) quadrotor, a three-dimensional quadrotor
and the Franka Emika Panda robot arm. The first three are based on the
safe-control-gym library [2].
In the table below we report the time spans of the dataset generation step, measured on a laptop computer.
| Task | Number of trajectories | Dataset generation time (mm:ss) |
|---|---|---|
| Cart-pole | 20000 | 03:08 |
| Planar quadrotor | 20000 | 05:27 |
| Three-dimensional quadrotor | 20000 | 11:23 |
| Robot arm reaching | 20000 | 00:19 |
For all the aforementioned control tasks, we train neural network policies that achieve success rates greater than 96%. The following videos show trained neural network policies controlling each dynamical system in simulation. We refer to the full paper for detailed quantitative evaluations.
panda-gym library [3] for the
simulation.
We extract the following conclusions from our work:
Ideas for future research include evaluating the method on more varied control tasks, handling partial observability and using alternative policy representations.