Geometry-Aware Policy Imitation (GPI)


Yiming Li1,2 Nael Darwiche2 Amirreza Razmjoo1,2 Sichao Liu2
Yilun Du3 Auke Ijspeert2 Sylvain Calinon1,2
1 Idiap Research Institute  |  2 EPFL  |  3 Harvard University

Simple, Fast and Flexible

Figure 1 from the paper summarizing the GPI pipeline and Push-T results.

Evaluations on PushT Benchmark

Talk

Watch the GPI overview and a real-robot deployment from the paper.

Method overview and fruit delivery task: progression and attraction flows cooperate to stay close to demonstrations while navigating clutter.

Real-world box flip on the ALOHA platform, showcasing multimodal demonstrations and robustness to visual disturbances.

Motivation

Diffusion/flow-matching policies excel but are heavy and opaque. GPI reinterprets demonstrations Γ as geometric curves, builds distance fields d(x | Γ), and induces two primitives: (i) a progression flow along the demo tangent and (ii) an attraction flow given by the negative distance gradient. Their superposition yields a controllable, non-parametric vector field that is fast, interpretable, and robust.

Figure 2 illustrating three ways to obtain latent embeddings.
Figure 2 in the paper outlines the three feature pipelines we experiment with: a lightweight task-specific encoder, an autoencoder that learns task-agnostic latents, and pretrained models (e.g., CLIP or SAM). GPI only needs the distance in this latent space, so any of these choices can be swapped in without retraining the controller.

Progression flow

The progression component advances the system along the expert trajectory using its local tangent u_demo(x). It ensures forward motion and task completion without backtracking.

π_prog(x) = λ₁(x) · u_demo(x)
Figure 3 showing demonstrations, distance landscape, and flow fields.
Figure 3 from the paper: (a) two demonstrations forming a Y-shape, (b) the composed distance landscape, and (c–d) the induced flow fields. The top row shows pure progression following tangents, while the bottom row adds attraction to pull the policy back toward either branch. The combined field (right) illustrates how GPI produces smooth, bifurcating behavior without training.

Attraction flow

The attraction component corrects deviations by following the negative gradient of the distance field in the actuated subspace. This stabilizes the dynamics, pulling the state back toward the demonstration manifold.

π_attr(x) = − λ₂(x) · ∇x′ d(x | Γ)

Multimodality

GPI preserves distinct demonstrations as separate models and composes the K nearest via soft weights w_i(x) ∝ exp(−β · d(x | Γ(i))), enabling natural multimodal behavior without mode collapse.

π(x) = Σ_i w_i(x) · [ λ₁ u_demo^{(i)}(x) − λ₂ ∇x′ d(x | Γ(i)) ]

Decoupling representation and policy

We explicitly separate the learned distance metric (from vision/state encoders) from policy synthesis. Encoders can be swapped or fine-tuned independently; the reactive controller remains a simple first-order system.

Limitations

GPI relies on meaningful distance fields; poor representation learning can degrade attraction. Extremely long-horizon tasks may require waypointing or light receding-horizon planning. Safety constraints are not enforced by default.

Results

Push-T snapshot (state vs. vision; per-step latency and memory).

MethodState: Avg/MaxTrain / InferMemory Vision: Avg/MaxTrain / InferMemory
DDPM82.3 / 86.31.0 h / 641 ms252 MB80.9 / 85.52.5 h / 647 ms353 MB
DDIM81.5 / 85.11.0 h / 65 ms252 MB79.1 / 83.12.5 h / 67 ms353 MB
GPI (Ours)85.8 / 89.00 h / 0.6 ms0.7 MB 83.3 / 86.90.3 h / 3.3 ms44 MB

Evaluation follows standard Push-T protocols; latencies are per-step.

Figure 4 (below) studies receding-horizon rollouts. Even when we plan over 64-step horizons, both state and vision settings retain strong performance, showing the geometric controller can operate reactively or in a longer-horizon mode without degradation.

Figure 4 plotting reward versus action horizon for state and vision inputs.
Figure 4: Reward versus planning horizon. Solid curves show best runs; dashed curves show averages across seeds.

Figure 5 (next) highlights how performance scales with demonstration coverage. Whether actions are expressed in a relative or absolute frame, increasing the subset size consistently boosts average reward and remains stable across different neighbor counts K.

Figure 5 showing average reward versus subset size for relative and absolute actions.
Figure 5: Demonstration density analysis. Each curve corresponds to a different number of blended neighbors.

Figure 7 explores the interplay between the progression and attraction primitives. A broad range of weights (λ₁, λ₂) produces high reward, underscoring that the two fields combine smoothly without delicate tuning.

Figure 7 heatmap of average reward as progression and attraction weights vary.
Figure 7: Heatmap of average reward as progression (λ₁) and attraction (λ₂) coefficients vary.

Figures 8 and 9 showcase the real-robot evaluations described in Section 3.2 of the paper. The ALOHA clip (top) captures multiple successful box-flip trajectories, while the Franka arm experiment (bottom) demonstrates human–robot interaction where the robot reacts to a user presenting fruit.

Figure 8 montage of the ALOHA box-flip trials.
Figure 8: Multiple rollouts of the box-flip task on ALOHA, illustrating multimodal strategies learned from demonstrations.
Figure 9 showing the Franka arm handing fruit during a human-robot interaction trial.
Figure 9: Human–robot interaction on the Franka platform. The geometry-aware flows adapt to new fruit placements presented by a human partner.

BibTeX

@inproceedings{anonymous2026gpi,
  title     = {Geometry-Aware Policy Imitation},
  author    = {Anonymous},
  booktitle = {International Conference on Learning Representations (ICLR)},
  year      = {2026},
  note      = {Under review}
}
Toggle BibTeX
↑ Back to top