TL;DR
- Problem: Adapting large multimodal diffusion policies to language-conditioned robot manipulation tasks is non-trivial due to mismatched observation/action formats and heavy training/inference cost.
- Method: Built a full supervised fine-tuning pipeline for Robotics Diffusion Transformer (RDT) on LIBERO, including HDF5 parsing, vision/language preprocessing, DeepSpeed ZeRO-2 training, LoRA vs full fine-tuning comparison, and profiling-driven evaluation.
- Result: Full fine-tuning achieves 76–94% success (suite-dependent), while LoRA (rank 8–32) saturates at ~20%. Caching T5-XXL instruction embeddings yields ~30% training speedup. Evaluation profiling shows ~375 ms/step model time dominates (~69%) and hyperparameters (diffusion steps H, action chunk length N) significantly impact success.
Overview
This project reproduces and extends the supervised fine-tuning workflow of Robotics Diffusion Transformer (RDT) on the LIBERO benchmark (five suites including libero_10/libero_90/libero_spatial/libero_object/libero_goal).
My focus is building a reliable and inspectable pipeline: data ingestion and normalization, scalable training, automated checkpoint evaluation, failure-mode analysis, and performance profiling.
What I Built
1) Data preprocessing: LIBERO HDF5 → RDT inputs
LIBERO provides RGB observations at 128×128, while RDT expects SigLIP-style inputs and a unified state/action layout.
- Image preprocessing
- Pad 128×128 RGB frames to 336×336 (SigLIP-style) without aspect distortion
- Normalize with SigLIP mean/std (0.5/0.5)
- Augmentations: random horizontal flip, color jitter
- State/action remapping
- Implemented LIBERO-specific index mapping (
UNI_STATE_INDICES) - State: 7 joint angles + 2D gripper width (normalized to [0,1])
- Action: 6D end-effector delta (xyz + rpy) + gripper command
- Implemented LIBERO-specific index mapping (
- Language embedding caching (T5-XXL)
- Pre-computed embeddings for all unique task instructions
- Cached to disk to avoid redundant encoding during training
- ~30% training speedup in practice
(Primary implementation: dataset classes in train/dataset_sft.py.)
2) Distributed full fine-tuning (DeepSpeed ZeRO-2)
Implemented full-parameter supervised fine-tuning with:
- PyTorch 2.1 + CUDA 12.1, BF16
- DeepSpeed 0.14 (ZeRO-2) for scalable training
- Experiment management: timestamped checkpoint folders, config snapshots, resume-from-checkpoint, WandB logging
3) Parameter-efficient fine-tuning (LoRA) and systematic comparison
Implemented LoRA fine-tuning via peft and evaluated multiple setups:
- ranks 8 / 16 / 32
- different target module scopes (e.g., all / adaptor_only / cross_attn variants)
- typical alpha/dropout sweeps
Key empirical finding: Across tested configurations, LoRA plateaus at ~20% success, far below full fine-tuning (76–94%, suite-dependent). This suggests low-rank adaptation is insufficient for this diffusion-policy + continuous-control setting at the tested ranks and module scopes.
4) Evaluation + checkpoint analysis + profiling infrastructure
Built evaluation utilities that:
- Load checkpoints and run LIBERO rollouts with fixed trial counts
- Output per-episode CSV logs and save videos for failure analysis
- Support batch evaluation over many checkpoints to identify best training steps
Profiling results (A100):
- Model inference dominates step time: ~375 ms/step (~69%)
- Environment stepping: ~170 ms/step (~31%)
- This strongly affects evaluation throughput (checkpoint evaluation becomes hours-scale).
Results (selected)
Success rates (full fine-tuning)
- libero_object: 92% (46/50), checkpoint-40000
- libero_goal: 94% (47/50), checkpoint-30000
- libero_spatial: 76% (38/50), checkpoint-22000
- libero_long: 38% (19/50), checkpoint-38000
Full FT vs LoRA
- Full fine-tuning: 76–94% (suite-dependent)
- LoRA (rank 8–32): ~20% upper bound in my experiments
Ablations that matter
On libero_object:
- Increasing diffusion steps H from 1 → 8 improves success (e.g., ~78% → ~88%)
- Increasing action chunk length N from 1 → 10 improves success (e.g., ~78% → ~90%)
- Combined (H=8, N=10) reaches ~92%
My Contribution (summary)
- Implemented the full fine-tuning pipeline end-to-end (data → training → evaluation).
- Added language embedding caching to reduce compute overhead (~30% speedup).
- Built batch checkpoint evaluation and profiling tools to quantify bottlenecks.
- Ran systematic comparisons (Full FT vs LoRA) and identified practical failure modes and performance limits.