Siqi Chen / Ice

Core implementer (model + env interface + evaluation) · 2026

Reinforcement Learning Diffusion Policy RLinf LIBERO

Code

Integrated a diffusion-based manipulation policy (RDT) into the RLinf framework by aligning observations (including joint proprioception), implementing DDPM action-chunk sampling and LIBERO action extraction, and enabling checkpoint-based evaluation; PPO-style RL fine-tuning is in progress due to diffusion log-probability requirements.

Outcome: Evaluation pipeline works with pretrained/fine-tuned RDT checkpoints on LIBERO; PPO-style RL fine-tuning is under development due to diffusion log-probability challenges.

TL;DR

Problem: Online RL fine-tuning (e.g., PPO) typically requires action log-probabilities, but diffusion policies like RDT generate actions via iterative denoising and do not naturally expose tractable log-probs.
Method: Integrated Robotics Diffusion Transformer (RDT) into RLinf by wrapping DDPM sampling as an RLinf policy, aligning observation interfaces (including joint proprioception), and implementing LIBERO action extraction + evaluation scripts.
Result: Checkpoint-based evaluation / rollout (behavior cloning style) is working on LIBERO. RL training is in progress, with the main remaining challenge being diffusion log-probability estimation for PPO-style policy gradients.

Overview

This project integrates a diffusion-based robot manipulation policy (RDT) into the RLinf reinforcement learning framework. The goal is to make it possible to start from a pretrained / fine-tuned RDT checkpoint and eventually perform online RL fine-tuning in simulation.

The integration is intentionally staged:

✅ Make RDT runnable as an RLinf agent and evaluate rollouts with pretrained checkpoints.
⏳ Add RL training support (e.g., PPO), which requires a principled way to compute or approximate action log-probabilities for diffusion sampling.

What I Implemented

1) RDT policy wrapper inside RLinf

Implemented an RLinf policy class that performs DDPM denoising to generate an action chunk (e.g., 64-step chunk) and returns actions for environment execution.
Followed the diffusers.DDPMScheduler interface to keep sampling behavior consistent and debuggable.
Implemented posterior mean/variance computation and kept the denoising chain for potential log-prob estimation.

Key file:

rlinf/models/embodiment/rdt/rdt_action_model_withlogprob.py

2) Observation interface alignment (LIBERO env + RLinf I/O)

RDT requires joint proprioception (arm joints + gripper), while the default LIBERO interface often exposes end-effector pose rather than full joint state.

Extended LIBERO environment wrapper to extract and provide:
- 7-DoF arm joint positions
- 2-DoF gripper joint positions
Extended RLinf I/O structures to pass joint states from env → workers in distributed rollouts.

Key files:

rlinf/envs/libero/libero_env.py
rlinf/data/io_struct.py

3) Action-space mapping (RDT unified action → LIBERO action)

RDT produces actions in a unified high-dimensional space and outputs an action chunk, while LIBERO expects a lower-dimensional incremental control command.

Implemented a deterministic extraction mapping (EEF delta + gripper) from RDT’s unified action vector.
Added clamping to ensure actions stay within valid control ranges.

4) Practical engineering fixes for reproducibility

Fixed a 180° image rotation mismatch between LeRobot-formatted datasets and LIBERO simulation observations during preprocessing.
Handled a diffusers PEFT version check issue by disabling the compatibility check via an environment variable (script-level fix).

Current Status

✅ Working: evaluation with pretrained/fine-tuned checkpoints

The evaluation scripts can load an RDT checkpoint and run rollouts on LIBERO suites through RLinf’s runner. This provides a consistent way to benchmark and debug the integrated agent.

⏳ In progress: RL fine-tuning (PPO)

The main blocker is log-probability computation:

PPO-style methods require log π(a|s).
Diffusion policies generate actions through iterative denoising; computing exact log-probs is non-trivial.
The current implementation includes the sampling chain and posterior calculations as building blocks for future log-prob estimation.

Why this matters for Embodied AI

Many modern manipulation policies are diffusion-based, but most RL systems assume policies provide tractable log-probabilities. This project explores how to bridge that mismatch:

making diffusion policies runnable in an RL framework,
identifying the exact algorithmic bottleneck (log-prob),
and building the infrastructure needed for future online fine-tuning experiments.

Siqi Chen 陈思齐