Learning efficient goal-conditioned policies by imitating optimal trajectories

Abstract

Imitation learning is a well-established approach for machine-learning-based control. However, its applicability depends on having access to demonstrations, which are often expensive to collect and/or suboptimal for solving the task. In this work, we present GCImOpt, an approach to learn efficient goal-conditioned policies by training on datasets generated by trajectory optimization. Our approach for dataset generation is computationally efficient, can generate thousands of optimal trajectories in minutes on a laptop computer, and produces high-quality demonstrations. Further, by means of a data augmentation scheme that treats intermediate states as goals, we are able to increase the training dataset size by an order of magnitude. Using our generated datasets, we train goal-conditioned neural network policies that can control the system towards arbitrary goals. To demonstrate the generality of our approach, we generate datasets and then train policies for various control tasks, namely cart-pole stabilization, planar and three-dimensional quadcopter stabilization, and point reaching using a 6-DoF robot arm. We show that our trained policies can achieve high success rates and near-optimal control profiles, all while being small (less than 80,000 neural network parameters) and fast enough (up to more than 6,000 times faster than a trajectory optimization solver) that they could be deployed onboard resource-constrained controllers.

Our method

We sample many initial state-goal state pairs and obtain their corresponding optimal controls using trajectory optimization, yielding a dataset of optimal trajectories; we can then take these trajectories as expert demonstrations to train a model. We train neural networks on these demonstrations by behavior cloning, so that given the current state and goal, they approximate the optimal control.

Having trained a policy on the dataset of optimal demonstrations, we can proceed to evaluate it in simulation; to that end, we measure

the policy's success rate, i.e. what percentage of the time it is successfully able to reach its goal state; and
its efficiency, that is, the cost it has accumulated according to our OCP's cost functional. The less cost it obtains relative to the optimal solution, the better.

Results and videos

We test our method on 4 different dynamical systems: a cart-pole system, a 2-dimensional (planar) quadrotor, a three-dimensional quadrotor and the Franka Emika Panda robot arm. The first three are based on the safe-control-gym library [2].

In the table below we report the time spans of the dataset generation step, measured on a laptop computer.

Task	Number of trajectories	Dataset generation time (mm:ss)
Cart-pole	20000	03:08
Planar quadrotor	20000	05:27
Three-dimensional quadrotor	20000	11:23
Robot arm reaching	20000	00:19

For all the aforementioned control tasks, we train neural network policies that achieve success rates greater than 97%. Moreover, the inference time of our policies is 97 to 6278 times faster than that of the Fatrop [1] trajectory optimization solver, suggesting that GCImOpt policies could rival or surpass the control frequency of MPC controllers. We refer to the full paper for detailed quantitative evaluations.

Example cart-pole trajectories in simulation using a 3-layer 64-unit MLP policy running at 60Hz. Pink spheres indicate goal cart-pole positions.

Example planar quadrotor trajectories in simulation using a 3-layer 64-unit MLP policy running at 60Hz. Pink spheres indicate goal quadrotor positions.

Example three-dimensional quadrotor trajectories in simulation using a 5-layer 128-unit MLP policy running at 60Hz. Pink spheres indicate goal quadrotor positions.

Example robot arm trajectories in simulation using a 3-layer 64-unit MLP policy running at 50Hz. Pink spheres indicate goal end effector positions. We use the `panda-gym` library [3] for the simulation.

We extract the following conclusions from our work:

Small neural networks can represent goal-conditioned policies that are near-optimal and computationally efficient and can achieve their goal with high success rates.
While a lower regression error sometimes corresponds to a better performance of the policy, this is not always the case!
The three-dimensional quadrotor was the most challenging task, due to the system's relatively high dimensionality and nonlinearity.

Ideas for future research include evaluating the method on more varied control tasks, handling sensor/actuator noise and partial observability and using alternative policy representations.

BibTeX

@article{goikoetxea2026gcimopt,
  title={{GCImOpt}: Learning efficient goal-conditioned policies by imitating optimal trajectories},
  author={Goikoetxea, Jon and Palaci{\'{a}}n, Jes{\'{u}}s F.},
  journal={arXiv preprint arXiv:2604.22724},
  year={2026}
}

References

L. Vanroye, A. Sathya, J. De Schutter, and W. Decré, “FATROP: A fast constrained optimal control problem solver for robot trajectory optimization and control”, in 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), IEEE, 2023, pp. 10 036–10 043.
Z. Yuan et al., “Safe-control-gym: A unified benchmark suite for safe learning-based control and reinforcement learning in robotics”, IEEE Robotics and Automation Letters, vol. 7, no. 4, pp. 11 142–11 149, 2022. doi: 10.1109/LRA.2022.3196132.
Q. Gallouedec et al., “panda-gym: Open-source goal-conditioned environments for robotic learning”, arXiv preprint arXiv:2106.13687, 2021.

GCImOpt: Learning efficient goal-conditioned policies by imitating optimal trajectories