ImplicitRDP:
An End-to-End Visual-Force Diffusion Policy
with Structural Slow-Fast Learning

Wendi Chen12, Han Xue1, Yi Wang12, Fangyuan Zhou12, Jun Lv13, Yang jin1, Shirun Tang3,
Chuan Wen1\(\dagger\), Cewu Lu123\(\dagger\)
1Shanghai Jiao Tong University, 2Shanghai Innovation Institute, 3Noematrix Ltd.
\(\dagger\)Equal advising

ImplicitRDP is a unified end-to-end visual-force diffusion policy that integrates visual planning and reactive force control. By leveraging Structural Slow-Fast Learning (SSL), it performs closed-loop adjustments at high frequency while maintaining temporal coherence. Additionally, Virtual-target-based Representation Regularization (VRR) prevents modality collapse, enabling robust performance in challenging contact-rich tasks.

Abstract

Human-level contact-rich manipulation relies on the distinct roles of two key modalities: vision provides spatially rich but temporally slow global context, while force sensing captures rapid, high-frequency local contact dynamics. Integrating these signals is challenging due to their fundamental frequency and informational disparities.

In this work, we propose ImplicitRDP, a unified end-to-end visual-force diffusion policy that integrates visual planning and reactive force control within a single network. We introduce Structural Slow-Fast Learning, a mechanism utilizing causal attention to simultaneously process asynchronous visual and force tokens, allowing the policy to perform closed-loop adjustments at the force frequency while maintaining the temporal coherence of action chunks.

Furthermore, to mitigate modality collapse where end-to-end models fail to adjust the weights across different modalities, we propose Virtual-target-based Representation Regularization. This auxiliary objective maps force feedback into the same space as the action, providing a stronger, physics-grounded learning signal than raw force prediction.

Extensive experiments on contact-rich tasks demonstrate that ImplicitRDP significantly outperforms both vision-only and hierarchical baselines, achieving superior reactivity and success rates with a streamlined training pipeline.

Method

Structural Slow-Fast Learning (SSL)

ImplicitRDP is an end-to-end policy that unifies low-frequency visual planning and high-frequency force control. Instead of a handed-designed hierarchy like RDP, we utilize Structural Slow-Fast Learning (SSL). We concatenate slow visual tokens and fast force tokens into a unified sequence. By employing a temporally causal structure (GRUs and causal attention mask), the policy can perform closed-loop adjustments at the force frequency while maintaining the temporal coherence of action chunks.

Network Architecture

Fig. 1: Network Architecture. We enforce a temporally causal structure using a GRU for force signal encoding and a causal attention mask for action-force interaction. This design enables the model to effectively process asynchronous visual and force tokens simultaneously.


Virtual-target-based Representation Regularization (VRR)

To mitigate modality collapse where end-to-end models ignore force feedback, we introduce Virtual-target-based Representation Regularization (VRR). Rather than predicting raw future forces, we train the model to predict a "virtual target" derived from compliance control theory. This maps force feedback into the same space as the action and adaptively weights the loss based on force magnitude, forcing the model to internally align force feedback with motion planning.

Virtual Target Representation

Fig. 2: Virtual Target. The virtual target represents the trajectory position that the robot intends to track under a specific stiffness. We apply adaptive stiffness to weight high-force contact events more heavily, encouraging the policy to attend to critical force feedback.

Experiments

Q1: Comparison with Baselines

We compare ImplicitRDP against a vision-only method (DP) and a hierarchical baseline (RDP). Experiments on Box Flipping and Switch Toggling demonstrate that our end-to-end approach significantly outperforms baselines. While DP fails to regulate force and RDP struggles with contact precision, ImplicitRDP achieves the highest success rates by effectively integrating visual planning with reactive force control.

TABLE I: Success Rate Compared with Baseline Methods

Method Box Flipping Switch Toggling
DP 0/20 8/20
RDP 16/20 10/20
ImplicitRDP (Ours) 18/20 18/20
DP
RDP
ImplicitRDP
DP
RDP
ImplicitRDP

Q2: Effectiveness of Closed-Loop Control

To validate the effectiveness of Structural Slow-Fast Learning (SSL), we compare the full model against open-loop variants. Removing SSL leads to a significant performance drop, especially in tasks requiring sustained force maintenance like Box Flipping. The results confirm that the closed-loop force control enabled by SSL is critical for contact-rich manipulation.

TABLE II: Comparison Between Open-Loop and Closed-Loop Control

Method Box Flipping Switch Toggling
ImplicitRDP w.o. SSL & VRR 6/20 5/20
ImplicitRDP w.o. SSL 4/20 15/20
ImplicitRDP (Ours) 18/20 18/20
ImplicitRDP w.o. SSL & VRR
ImplicitRDP w.o. SSL
ImplicitRDP
ImplicitRDP w.o. SSL & VRR
ImplicitRDP w.o. SSL
ImplicitRDP

Q3: Choice of Auxiliary Task

We investigate the impact of Virtual-target-based Representation Regularization (VRR) compared to standard Force Prediction or no auxiliary task. VRR consistently yields the best performance while other auxiliary settings fail to effectively utilize high-frequency force signals which results in untimely loss of contact.

TABLE III: Comparison of Different Auxiliary Tasks

Auxiliary Task Box Flipping Switch Toggling
None 6/20 6/20
Force Prediction 8/20 10/20
Virtual Target Prediction (Ours) 18/20 18/20

As visualized in the attention map below, VRR encourages the network to adaptively attend to force tokens during contact phases, whereas models without VRR fail to utilize force information effectively.

Attention Weight Visualization

Fig 3. Attention Weight Visualization. We visualize the summed attention weights of visual tokens and force tokens from the first transformer layer in the switch toggling task. VRR enables adaptive attention to different modalities during different phases.

None
Force Prediction
Virtual Target Prediction
None
Force Prediction
Virtual Target Prediction