Siqi Chen / Ice

Lead implementer (data → training → evaluation) · 2025

Imitation Learning Diffusion Policy LIBERO DeepSpeed LoRA

Code

Built an end-to-end supervised fine-tuning pipeline for Robotics Diffusion Transformer (RDT) on LIBERO (HDF5 parsing, SigLIP-style vision preprocessing, T5-XXL embedding caching, DeepSpeed training, and automated checkpoint evaluation), and quantified key tradeoffs (Full FT 76–94% vs LoRA ~20%).

Outcome: Identified practical bottlenecks and levers: inference dominates evaluation time (~375 ms/step, ~69%), and diffusion steps (H) + action chunk length (N) significantly affect success rate.

TL;DR

Problem: Adapting large multimodal diffusion policies to language-conditioned robot manipulation tasks is non-trivial due to mismatched observation/action formats and heavy training/inference cost.
Method: Built a full supervised fine-tuning pipeline for Robotics Diffusion Transformer (RDT) on LIBERO, including HDF5 parsing, vision/language preprocessing, DeepSpeed ZeRO-2 training, LoRA vs full fine-tuning comparison, and profiling-driven evaluation.
Result: Full fine-tuning achieves 76–94% success (suite-dependent), while LoRA (rank 8–32) saturates at ~20%. Caching T5-XXL instruction embeddings yields ~30% training speedup. Evaluation profiling shows ~375 ms/step model time dominates (~69%) and hyperparameters (diffusion steps H, action chunk length N) significantly impact success.

Overview

This project reproduces and extends the supervised fine-tuning workflow of Robotics Diffusion Transformer (RDT) on the LIBERO benchmark (five suites including libero_10/libero_90/libero_spatial/libero_object/libero_goal).
My focus is building a reliable and inspectable pipeline: data ingestion and normalization, scalable training, automated checkpoint evaluation, failure-mode analysis, and performance profiling.

What I Built

1) Data preprocessing: LIBERO HDF5 → RDT inputs

LIBERO provides RGB observations at 128×128, while RDT expects SigLIP-style inputs and a unified state/action layout.

Image preprocessing
- Pad 128×128 RGB frames to 336×336 (SigLIP-style) without aspect distortion
- Normalize with SigLIP mean/std (0.5/0.5)
- Augmentations: random horizontal flip, color jitter
State/action remapping
- Implemented LIBERO-specific index mapping (UNI_STATE_INDICES)
- State: 7 joint angles + 2D gripper width (normalized to [0,1])
- Action: 6D end-effector delta (xyz + rpy) + gripper command
Language embedding caching (T5-XXL)
- Pre-computed embeddings for all unique task instructions
- Cached to disk to avoid redundant encoding during training
- ~30% training speedup in practice

(Primary implementation: dataset classes in train/dataset_sft.py.)

2) Distributed full fine-tuning (DeepSpeed ZeRO-2)

Implemented full-parameter supervised fine-tuning with:

PyTorch 2.1 + CUDA 12.1, BF16
DeepSpeed 0.14 (ZeRO-2) for scalable training
Experiment management: timestamped checkpoint folders, config snapshots, resume-from-checkpoint, WandB logging

3) Parameter-efficient fine-tuning (LoRA) and systematic comparison

Implemented LoRA fine-tuning via peft and evaluated multiple setups:

ranks 8 / 16 / 32
different target module scopes (e.g., all / adaptor_only / cross_attn variants)
typical alpha/dropout sweeps

Key empirical finding: Across tested configurations, LoRA plateaus at ~20% success, far below full fine-tuning (76–94%, suite-dependent). This suggests low-rank adaptation is insufficient for this diffusion-policy + continuous-control setting at the tested ranks and module scopes.

4) Evaluation + checkpoint analysis + profiling infrastructure

Built evaluation utilities that:

Load checkpoints and run LIBERO rollouts with fixed trial counts
Output per-episode CSV logs and save videos for failure analysis
Support batch evaluation over many checkpoints to identify best training steps

Profiling results (A100):

Model inference dominates step time: ~375 ms/step (~69%)
Environment stepping: ~170 ms/step (~31%)
This strongly affects evaluation throughput (checkpoint evaluation becomes hours-scale).

Results (selected)

Success rates (full fine-tuning)

libero_object: 92% (46/50), checkpoint-40000
libero_goal: 94% (47/50), checkpoint-30000
libero_spatial: 76% (38/50), checkpoint-22000
libero_long: 38% (19/50), checkpoint-38000

Full FT vs LoRA

Full fine-tuning: 76–94% (suite-dependent)
LoRA (rank 8–32): ~20% upper bound in my experiments

Ablations that matter

On libero_object:

Increasing diffusion steps H from 1 → 8 improves success (e.g., ~78% → ~88%)
Increasing action chunk length N from 1 → 10 improves success (e.g., ~78% → ~90%)
Combined (H=8, N=10) reaches ~92%

My Contribution (summary)

Implemented the full fine-tuning pipeline end-to-end (data → training → evaluation).
Added language embedding caching to reduce compute overhead (~30% speedup).
Built batch checkpoint evaluation and profiling tools to quantify bottlenecks.
Ran systematic comparisons (Full FT vs LoRA) and identified practical failure modes and performance limits.

Siqi Chen 陈思齐