Siqi Chen / Ice

End-to-end fine-tuning + evaluation infrastructure · 2026

VLA Language-conditioned Manipulation LIBERO GR00T-N1.6

Code

Fine-tuned NVIDIA GR00T-N1.6 (3B VLA) on four LIBERO suites with a parameter-efficient recipe (train projector + diffusion decoder, freeze vision/LLM), and achieved 97.8% average success (782/800) with released, benchmark-validated checkpoints.

Outcome: 97.8% success (782/800 episodes) across 4 LIBERO suites; +0.8% over the published baseline under the same evaluation protocol.

TL;DR

Problem: Fine-tuning large vision-language-action (VLA) models on language-conditioned manipulation benchmarks requires careful embodiment mapping, scalable training, and reliable evaluation.
Method: Reuse end-to-end fine-tuning pipeline for NVIDIA GR00T-N1.6 (3B) on LIBERO, using a parameter-efficient strategy (freeze vision + LLM; train projector + diffusion decoder), DeepSpeed ZeRO-2, and a config-driven parallel evaluation system with video recording.
Result: Achieved 97.8% average success (782/800 episodes) across 40 tasks from 4 LIBERO suites, surpassing the published baseline by +0.8% under the same evaluation protocol. Released 4 validated checkpoints on HuggingFace.

Overview

This project demonstrates a full workflow to adapt GR00T-N1.6 (a 3B-parameter vision-language-action model) to LIBERO language-conditioned manipulation tasks.
My focus is not only training, but also building a reproducible system: dataset integration, embodiment modality mapping, scalable training scripts for multiple suites, and a reliable evaluation harness (parallel envs + logging + videos).

What I Did

1) Dataset integration & embodiment mapping

Integrated LIBERO datasets in LeRobot v2 format.
Implemented modality mapping for Franka Panda (7-DoF + gripper) so the model can consume observations and output actions in the correct embodiment interface.
Built a simulation wrapper that supports parallel rollouts for efficient evaluation.

Key artifacts:

examples/LIBERO/modality.json (state/action mapping)
gr00t/eval/sim/LIBERO/libero_env.py (environment wrapper)
gr00t/eval/sim/LIBERO/setup_libero.sh (environment setup)

2) Parameter-efficient fine-tuning strategy

To make fine-tuning practical for a 3B model, I adopted a selective training plan:

Frozen: vision encoder (ViT), language model (LLM), and backbone components to preserve pre-trained capabilities.
Trained: multimodal projector (fusion/adaptation) and diffusion decoder (DiT) to specialize action generation.
Trainable parameters: ~40% (≈1.2B) of the full model, reducing memory footprint while keeping adaptation capacity.

3) Suite-specific training scripts (4 suites)

Implemented and validated suite-specific fine-tuning scripts:

LIBERO Spatial
LIBERO Object
LIBERO Goal
LIBERO-10

Training highlights:

DeepSpeed ZeRO-2 distributed training
BF16 mixed precision
Large effective batch size via gradient accumulation

4) Automated evaluation framework (parallel envs + videos)

Configuration-driven evaluation system:

YAML configs per suite (examples/LIBERO/eval/config/)
Parallel evaluation with 5 concurrent environments
Automatic logging (CSV + detailed logs)
Automatic video recording for qualitative verification (e.g., 3 videos per task)

Outputs are organized per run with results, logs, and videos:

examples/LIBERO/eval/out/<suite_timestamp>/...

5) Model release

Published four benchmark-validated fine-tuned checkpoints on HuggingFace:

Spatial (98.5% SR)
Object (100.0% SR)
Goal (97.0% SR)
LIBERO-10 (95.5% SR)

Results

Overall

Average success: 97.8% (782/800 episodes)
Perfect tasks: 31/40 tasks achieve 100% SR
Training cost: ~100 GPU-hours (8 GPUs)

Per-suite performance

Spatial: 98.5% (197/200), checkpoint-20000
Object: 100.0% (200/200), checkpoint-20000
Goal: 97.0% (194/200), checkpoint-40000
LIBERO-10: 95.5% (191/200), checkpoint-20000

Baseline comparison (published vs mine)

Under the same evaluation protocol:

Mine: 97.8% average
Published baseline: 97.0% average
Delta: +0.8% average

Siqi Chen 陈思齐