CMPT 983 Generative Modeling Course Project

Published: December 15, 2025

Project Overview

Course: CMPT 983 - Generative Modeling (Fall 2025)
Topic: Vision-Language-Action (VLA) Model Optimization
Number of Team members: Team of 2

Abstract

We study action sequence generation in Vision-Language-Action (VLA) models, focusing on the action expert component. Specifically, we replace the flow-matching-based action expert in π0.5 with a conditional generative model trained using rejection sampling implicit maximum likelihood estimation (RS-IMLE). This modification removes the need for iterative denoising during inference and enables single-pass action sequence generation.

Technologies Used

Framework: PyTorch, LeRobot
Models: π0.5, PaliGemma-2B (VLM), Gemma-300M (Action Expert), U-Net(Action generator)
Techniques: RS-IMLE, LoRA, Flow Matching
Benchmark: LIBERO (40 robot manipulation tasks over 4 task suites)

Method

π0.5 Architecture

The π0.5 model consists of two major components:

Vision Language Model (VLM): PaliGemma-2B that generates conditional latent embeddings encoding current task and state observation
Action Expert: Gemma-300M that takes the latent embedding and generates action chunk sequences using flow matching

Our Approach: RS-IMLE Action Expert

We replaced the flow-matching action expert with a model trained using Rejection Sampling IMLE (RS-IMLE):

Enables single-pass action generation (vs. 10 iterative denoising steps)
Reduces inference latency by ~70%
Uses likelihood-free training that ensures training data coverage

LoRA Fine-tuning

Applied Low-Rank Adaptation (LoRA) to efficiently fine-tune the VLM:

Target modules: Attention layers (q, k, v, o projections) and FFN layers
Rank r=32, scaling factor α=2r, dropout=0.05
Two-stage training: First train action expert (6k steps), then joint fine-tuning with LoRA (6k steps)

My Contributions

U-Net Action Expert Implementation: Trained and evaluated π0.5-IMLE-U-Net variants with different pooling strategies (average pooling, attention pooling, type encoding + attention pooling)
π0.5-IMLE-LoRA Implementation: Implemented LoRA fine-tuning for the VLM backbone
Benchmark & Profiling: Measured inference time and evaluation time across all model variants
Report Writing: Co-authored the final project report

Results

Main Results

Variant	Success Rate	Inference Time	Training Steps
π0.5 (baseline)	94%	131.13ms	-
π0.5-one-step-flow-matching	94.25%	28.6ms	6k
π0.5-LoRA-RSIMLE	86%	39.62ms	12k (2-stage)

U-Net Action Expert Results (My Experiments)

Variant	Success Rate (LIBERO Spatial)	Success Rate (Full LIBERO)
U-Net (avg pool)	99%	0%
U-Net (attn pool)	98%	6.5%
U-Net (type enc+attn pool)	98%	19.75%

Key Findings

RS-IMLE reduces inference time by ~70% but incurs 8-14% performance drop
One-step flow matching achieves best trade-off: maintains 94%+ success rate with lowest inference time (28.6ms)
Objective alignment matters: Flow-matching fine-tuning preserves pretrained action manifold better than RS-IMLE
U-Net information bottleneck: Pooling-based conditioning works for simple tasks (LIBERO Spatial) but fails on diverse tasks due to compressing rich VLM sequences into single vectors

Discussion

RS-IMLE vs Flow Matching Trade-off

Our experiments reveal a trade-off between inference efficiency and task performance when fine-tuning the action expert with RS-IMLE objective. While π0.5-LoRA-RSIMLE substantially improves inference speed relative to the baseline, it incurs an absolute performance drop of 8-14% in success rate across the LIBERO benchmark.

Importance of Objective Alignment

The comparison with one-step flow matching highlights the importance of objective alignment with the pretrained action expert. Fine-tuning the flow-matching model to operate in a single inference step preserves both task performance and inference efficiency, likely because the pretrained checkpoint already encodes a velocity field consistent with the flow-matching objective.

In contrast, RS-IMLE optimizes a fundamentally different objective that emphasizes sample coverage rather than trajectory likelihood. This objective mismatch causes the model to deviate from the pretrained action manifold, resulting in reduced task success.

Inference Speedup Analysis

Although RS-IMLE reduces the number of forward passes during inference, the observed speedup is substantially smaller than the theoretical 10x reduction implied by decreasing the number of denoising steps from 10 to 1. This suggests that non-negligible overheads such as action chunking and model I/O contribute significantly to the overall action generation latency.

U-Net Information Bottleneck

The U-Net experiments revealed a critical insight: the information bottleneck created by pooling VLM outputs into a single conditioning vector discards spatial relationships, fine-grained language grounding, and token-level semantics. This explains why π0.5 uses a transformer-based action expert with cross-attention instead of U-Net, as it can attend selectively to different parts of the conditioning sequence for each action generation step.

Jae Choi