CALVIN — Composing Actions from Language and Vision

CALVIN (Composing Actions from Language and Vision) is a language-conditioned long-horizon robot manipulation benchmark developed at the University of Freiburg and University of Tübingen. Released in 2021 under MIT license, it contains 24,000 demonstration episodes for a simulated Franka Panda robot in a tabletop environment with diverse objects, covering tasks requiring sequences of up to 5 chained language-conditioned manipulation steps. CALVIN evaluates whether robot policies can follow natural language instructions to complete long-horizon task sequences — a key capability for general-purpose robot deployment. The benchmark uses a multi-environment evaluation protocol (ABCD→D split) that tests generalisation across different visual contexts. CALVIN became the standard benchmark for language-conditioned robot manipulation research and is widely used to evaluate vision-language-action models. Results are tracked on a public leaderboard measuring the average number of tasks completed in a 5-task evaluation chain.

Dataset specifications
Year	2021
Episodes	24,000
Embodiments	Franka Panda (simulated)
Modalities	rgb, proprioception, language
Task categories	manipulation, pick-and-place, long-horizon, human-robot-interaction
Data format	npz, pkl
License	MIT
Access	open — commercial use permitted
Maintainer	University of Freiburg, University of Tübingen
Origin country	DE

What is it?

Who is it for?

Researchers working on vision-language-action models, language-conditioned manipulation, and long-horizon task planning. The standard benchmark for evaluating whether robots can understand natural language instructions and execute multi-step task sequences. Teams developing VLA models use CALVIN as a primary evaluation benchmark.

Key specifications

Episodes: 24,000 demonstrations
Task sequences: Up to 5 chained language-conditioned steps per evaluation
Robot platform: Simulated Franka Panda
Environments: 4 (A, B, C, D) — ABCD→D evaluation split
Format: NumPy, pickle
License: MIT — commercial use permitted
Access: Open — GitHub

How it compares

The standard benchmark for language-conditioned long-horizon manipulation. Meta-World covers more tasks without language conditioning. LIBERO-Long covers sequential tasks without natural language instruction. CALVIN's 5-task chaining evaluation and public leaderboard make it the reference for VLA model papers.

Limitations and access notes

Simulation-only. 24,000 episodes is small — a benchmark, not a pretraining source. MIT license permits unrestricted commercial use.

Linked professions

Frequently asked questions

What makes CALVIN unique among manipulation benchmarks?

CALVIN evaluates 5-task chained execution with natural language instructions — the robot must complete a sequence of 5 different language-specified tasks without resetting. Most benchmarks evaluate single tasks. This tests long-horizon planning and instruction following simultaneously.

What is the ABCD→D evaluation protocol?

CALVIN trains on environments A, B, C, and D, then evaluates exclusively on environment D. This tests whether policies generalise to a specific visual context after training on diverse environments, rather than memorising a single environment.

Can CALVIN be used commercially?

Yes. CALVIN is MIT licensed, permitting unrestricted commercial use.

How is CALVIN related to VLA models?

CALVIN is the primary benchmark for evaluating vision-language-action models including SuSIE, RT-2, GR-1, and others. The 5-task chaining success rate on CALVIN is the most-cited metric for comparing VLA model performance on language-conditioned manipulation.

How do I access CALVIN?

CALVIN is available at github.com/mees/calvin. No registration is required. The benchmark includes the simulation environment, dataset, and evaluation code.