Lead implementer (data → training → evaluation) · 2025
Imitation Learning Diffusion Policy LIBERO DeepSpeed LoRA
Built an end-to-end supervised fine-tuning pipeline for Robotics Diffusion Transformer (RDT) on LIBERO (HDF5 parsing, SigLIP-style vision preprocessing, T5-XXL embedding caching, DeepSpeed training, and automated checkpoint evaluation), and quantified key tradeoffs (Full FT 76–94% vs LoRA ~20%).
Outcome: Identified practical bottlenecks and levers: inference dominates evaluation time (~375 ms/step, ~69%), and diffusion steps (H) + action chunk length (N) significantly affect success rate.

TL;DR

  • Problem: Adapting large multimodal diffusion policies to language-conditioned robot manipulation tasks is non-trivial due to mismatched observation/action formats and heavy training/inference cost.
  • Method: Built a full supervised fine-tuning pipeline for Robotics Diffusion Transformer (RDT) on LIBERO, including HDF5 parsing, vision/language preprocessing, DeepSpeed ZeRO-2 training, LoRA vs full fine-tuning comparison, and profiling-driven evaluation.
  • Result: Full fine-tuning achieves 76–94% success (suite-dependent), while LoRA (rank 8–32) saturates at ~20%. Caching T5-XXL instruction embeddings yields ~30% training speedup. Evaluation profiling shows ~375 ms/step model time dominates (~69%) and hyperparameters (diffusion steps H, action chunk length N) significantly impact success.

Overview

This project reproduces and extends the supervised fine-tuning workflow of Robotics Diffusion Transformer (RDT) on the LIBERO benchmark (five suites including libero_10/libero_90/libero_spatial/libero_object/libero_goal).
My focus is building a reliable and inspectable pipeline: data ingestion and normalization, scalable training, automated checkpoint evaluation, failure-mode analysis, and performance profiling.

What I Built

1) Data preprocessing: LIBERO HDF5 → RDT inputs

LIBERO provides RGB observations at 128×128, while RDT expects SigLIP-style inputs and a unified state/action layout.

  • Image preprocessing
    • Pad 128×128 RGB frames to 336×336 (SigLIP-style) without aspect distortion
    • Normalize with SigLIP mean/std (0.5/0.5)
    • Augmentations: random horizontal flip, color jitter
  • State/action remapping
    • Implemented LIBERO-specific index mapping (UNI_STATE_INDICES)
    • State: 7 joint angles + 2D gripper width (normalized to [0,1])
    • Action: 6D end-effector delta (xyz + rpy) + gripper command
  • Language embedding caching (T5-XXL)
    • Pre-computed embeddings for all unique task instructions
    • Cached to disk to avoid redundant encoding during training
    • ~30% training speedup in practice

(Primary implementation: dataset classes in train/dataset_sft.py.)

2) Distributed full fine-tuning (DeepSpeed ZeRO-2)

Implemented full-parameter supervised fine-tuning with:

  • PyTorch 2.1 + CUDA 12.1, BF16
  • DeepSpeed 0.14 (ZeRO-2) for scalable training
  • Experiment management: timestamped checkpoint folders, config snapshots, resume-from-checkpoint, WandB logging

3) Parameter-efficient fine-tuning (LoRA) and systematic comparison

Implemented LoRA fine-tuning via peft and evaluated multiple setups:

  • ranks 8 / 16 / 32
  • different target module scopes (e.g., all / adaptor_only / cross_attn variants)
  • typical alpha/dropout sweeps

Key empirical finding: Across tested configurations, LoRA plateaus at ~20% success, far below full fine-tuning (76–94%, suite-dependent). This suggests low-rank adaptation is insufficient for this diffusion-policy + continuous-control setting at the tested ranks and module scopes.

4) Evaluation + checkpoint analysis + profiling infrastructure

Built evaluation utilities that:

  • Load checkpoints and run LIBERO rollouts with fixed trial counts
  • Output per-episode CSV logs and save videos for failure analysis
  • Support batch evaluation over many checkpoints to identify best training steps

Profiling results (A100):

  • Model inference dominates step time: ~375 ms/step (~69%)
  • Environment stepping: ~170 ms/step (~31%)
  • This strongly affects evaluation throughput (checkpoint evaluation becomes hours-scale).

Results (selected)

Success rates (full fine-tuning)

  • libero_object: 92% (46/50), checkpoint-40000
  • libero_goal: 94% (47/50), checkpoint-30000
  • libero_spatial: 76% (38/50), checkpoint-22000
  • libero_long: 38% (19/50), checkpoint-38000

Full FT vs LoRA

  • Full fine-tuning: 76–94% (suite-dependent)
  • LoRA (rank 8–32): ~20% upper bound in my experiments

Ablations that matter

On libero_object:

  • Increasing diffusion steps H from 1 → 8 improves success (e.g., ~78% → ~88%)
  • Increasing action chunk length N from 1 → 10 improves success (e.g., ~78% → ~90%)
  • Combined (H=8, N=10) reaches ~92%

My Contribution (summary)

  • Implemented the full fine-tuning pipeline end-to-end (data → training → evaluation).
  • Added language embedding caching to reduce compute overhead (~30% speedup).
  • Built batch checkpoint evaluation and profiling tools to quantify bottlenecks.
  • Ran systematic comparisons (Full FT vs LoRA) and identified practical failure modes and performance limits.