We train neural networks to control various systems near-optimally towards arbitrary goals. In the videos above, pink markers (■) denote goals passed as input to the neural networks.
Imitation learning is a well-established approach for machine-learning-based control. However, its applicability depends on having access to demonstrations, which are often expensive to collect and/or suboptimal for solving the task. In this work, we present GCImOpt, an approach to learn efficient goal-conditioned policies by training on datasets generated by trajectory optimization. Our approach for dataset generation is computationally efficient, can generate thousands of optimal trajectories in minutes on a laptop computer, and produces high-quality demonstrations. Further, by means of a data augmentation scheme that treats intermediate states as goals, we are able to increase the training dataset size by an order of magnitude. Using our generated datasets, we train goal-conditioned neural network policies that can control the system towards arbitrary goals. To demonstrate the generality of our approach, we generate datasets and then train policies for various control tasks, namely cart-pole stabilization, planar and three-dimensional quadcopter stabilization, and point reaching using a 6-DoF robot arm. We show that our trained policies can achieve high success rates and near-optimal control profiles, all while being small (less than 80,000 neural network parameters) and fast enough (up to more than 6,000 times faster than a trajectory optimization solver) that they could be deployed onboard resource-constrained controllers.
We sample many initial state-goal state pairs and obtain their corresponding optimal controls using trajectory optimization, yielding a dataset of optimal trajectories; we can then take these trajectories as expert demonstrations to train a model. We train neural networks on these demonstrations by behavior cloning, so that given the current state and goal, they approximate the optimal control.
Having trained a policy on the dataset of optimal demonstrations, we can proceed to evaluate it in simulation; to that end, we measure
We test our method on 4 different dynamical systems: a cart-pole system, a 2-dimensional (planar) quadrotor, a three-dimensional quadrotor
and the Franka Emika Panda robot arm. The first three are based on the
safe-control-gym library [2].
In the table below we report the time spans of the dataset generation step, measured on a laptop computer.
| Task | Number of trajectories | Dataset generation time (mm:ss) |
|---|---|---|
| Cart-pole | 20000 | 03:08 |
| Planar quadrotor | 20000 | 05:27 |
| Three-dimensional quadrotor | 20000 | 11:23 |
| Robot arm reaching | 20000 | 00:19 |
For all the aforementioned control tasks, we train neural network policies that achieve success rates greater than 97%. Moreover, the inference time of our policies is 97 to 6278 times faster than that of the Fatrop [1] trajectory optimization solver, suggesting that GCImOpt policies could rival or surpass the control frequency of MPC controllers. We refer to the full paper for detailed quantitative evaluations.
panda-gym library [3] for the
simulation.
We extract the following conclusions from our work:
Ideas for future research include evaluating the method on more varied control tasks, handling sensor/actuator noise and partial observability and using alternative policy representations.
@article{goikoetxea2026gcimopt,
title={{GCImOpt}: Learning efficient goal-conditioned policies by imitating optimal trajectories},
author={Goikoetxea, Jon and Palaci{\'{a}}n, Jes{\'{u}}s F.},
journal={arXiv preprint arXiv:2604.22724},
year={2026}
}