RoboReward — Vision-Language Reward Dataset for Robotics

RoboReward is a dataset and benchmark for training and evaluating general-purpose vision-language reward models for robotics, released in January 2026 by researchers at Stanford University and UC Berkeley (Tony Lee, Andrew Wagenmaker, Karl Pertsch, Percy Liang, Sergey Levine, Chelsea Finn). Each example pairs a task instruction with a real-robot rollout video and a discrete end-of-episode progress score from 1 to 5. The dataset contains 45,000 scored robot episodes built from large-scale real-robot corpora — Open X-Embodiment and RoboArena. Because these source corpora are heavily skewed toward successful demonstrations, RoboReward introduces counterfactual relabeling and temporal clipping to synthetically generate calibrated failure and near-miss examples from the same videos. It is used to train the RoboReward 4B and 8B reward models (fine-tuned from Qwen3-VL) and provides a standardized reward-accuracy benchmark. Released under CC BY 4.0 for commercial use.

Dataset specifications
Year2026
Episodes45,000
EmbodimentsOpen X-Embodiment robots, RoboArena robots
Modalitiesrgb, language
Task categoriesmanipulation, pick-and-place, inspection
Data formatmp4, parquet, json
LicenseCC BY 4.0
Accessopen — commercial use permitted
MaintainerStanford University, UC Berkeley
Origin countryUS

What is it?

RoboReward is a dataset and benchmark for vision-language reward models in robotics, released in January 2026 by Stanford University and UC Berkeley researchers. Each example pairs a natural-language task instruction with a real-robot rollout video and a discrete end-of-episode progress reward score from 1 (no success) to 5 (perfect completion). The dataset contains 45,000 scored episodes across diverse tasks and embodiments, built from Open X-Embodiment and RoboArena.

Who is it for?

Researchers working on reinforcement-learning-based policy improvement who need automatic reward signals rather than labor-intensive human labeling or brittle hand-crafted objectives. RoboReward is distinct from demonstration datasets — it is used to train and evaluate reward models that judge whether a robot completed a task, closing much of the performance gap to human-given rewards.

Key specifications

How it compares

RoboReward fills a category no other dataset in the catalogue covers: reward modeling and evaluation rather than demonstration data. Where Open X-Embodiment and DROID provide demonstrations for imitation learning, RoboReward provides scored success and failure examples for training reward functions. It is built on top of Open X-Embodiment, adding the failure and near-miss examples that success-heavy demonstration corpora lack.

Limitations and access notes

RoboReward is a reward-modeling dataset, not a source of demonstration trajectories for policy imitation. Its failure examples are partly synthetic, generated via counterfactual relabeling of successful episodes rather than collected from genuine robot failures. CC BY 4.0 permits commercial use with attribution.

Frequently asked questions

What is RoboReward used for?

RoboReward trains and evaluates vision-language reward models for robotics — models that watch a robot's rollout video against a task instruction and output a reward score. This automates reward generation for reinforcement learning, reducing reliance on manual human labeling or hand-crafted objectives.

How large is the RoboReward dataset?

RoboReward contains 45,000 scored robot episodes across diverse tasks and embodiments, built from the Open X-Embodiment and RoboArena corpora.

Can RoboReward be used commercially?

Yes. RoboReward is licensed under CC BY 4.0, permitting commercial use with attribution.

Why does RoboReward generate synthetic failure examples?

Existing large-scale robot corpora like Open X-Embodiment are heavily skewed toward successful demonstrations, which are poorly suited for training reward models that must recognize both success and failure. RoboReward uses counterfactual relabeling (swapping in instructions the same video would fail) and temporal clipping (truncating successful episodes into partial-progress outcomes) to create calibrated negative and near-miss examples.

What are the RoboReward 4B and 8B models?

RoboReward 4B and 8B are vision-language reward models fine-tuned from Qwen3-VL on the RoboReward dataset. They predict discrete end-of-episode progress scores and outperform larger existing vision-language models on robotics reward accuracy.