CALVIN — Composing Actions from Language and Vision
CALVIN (Composing Actions from Language and Vision) is a language-conditioned long-horizon robot manipulation benchmark developed at the University of Freiburg and University of Tübingen. Released in 2021 under MIT license, it contains 24,000 demonstration episodes for a simulated Franka Panda robot in a tabletop environment with diverse objects, covering tasks requiring sequences of up to 5 chained language-conditioned manipulation steps. CALVIN evaluates whether robot policies can follow natural language instructions to complete long-horizon task sequences — a key capability for general-purpose robot deployment. The benchmark uses a multi-environment evaluation protocol (ABCD→D split) that tests generalisation across different visual contexts. CALVIN became the standard benchmark for language-conditioned robot manipulation research and is widely used to evaluate vision-language-action models. Results are tracked on a public leaderboard measuring the average number of tasks completed in a 5-task evaluation chain.
| Year | 2021 |
|---|---|
| Episodes | 24,000 |
| Embodiments | Franka Panda (simulated) |
| Modalities | rgb, proprioception, language |
| Task categories | manipulation, pick-and-place, long-horizon, human-robot-interaction |
| Data format | npz, pkl |
| License | MIT |
| Access | open — commercial use permitted |
| Maintainer | University of Freiburg, University of Tübingen |
| Origin country | DE |
What is it?
CALVIN (Composing Actions from Language and Vision) is a language-conditioned long-horizon robot manipulation benchmark developed at the University of Freiburg and University of Tübingen. Released in 2021 under MIT license, it contains 24,000 demonstration episodes for a simulated Franka Panda, covering tasks requiring sequences of up to 5 chained language-conditioned manipulation steps. CALVIN evaluates whether robot policies can follow natural language instructions to complete long-horizon task sequences.
Who is it for?
Researchers working on vision-language-action models, language-conditioned manipulation, and long-horizon task planning. The standard benchmark for evaluating whether robots can understand natural language instructions and execute multi-step task sequences. Teams developing VLA models use CALVIN as a primary evaluation benchmark.
Key specifications
- Episodes: 24,000 demonstrations
- Task sequences: Up to 5 chained language-conditioned steps per evaluation
- Robot platform: Simulated Franka Panda
- Environments: 4 (A, B, C, D) — ABCD→D evaluation split
- Format: NumPy, pickle
- License: MIT — commercial use permitted
- Access: Open — GitHub
How it compares
The standard benchmark for language-conditioned long-horizon manipulation. Meta-World covers more tasks without language conditioning. LIBERO-Long covers sequential tasks without natural language instruction. CALVIN's 5-task chaining evaluation and public leaderboard make it the reference for VLA model papers.
Limitations and access notes
Simulation-only. 24,000 episodes is small — a benchmark, not a pretraining source. MIT license permits unrestricted commercial use.
Linked professions
Frequently asked questions
What makes CALVIN unique among manipulation benchmarks?
CALVIN evaluates 5-task chained execution with natural language instructions — the robot must complete a sequence of 5 different language-specified tasks without resetting. Most benchmarks evaluate single tasks. This tests long-horizon planning and instruction following simultaneously.
What is the ABCD→D evaluation protocol?
CALVIN trains on environments A, B, C, and D, then evaluates exclusively on environment D. This tests whether policies generalise to a specific visual context after training on diverse environments, rather than memorising a single environment.
Can CALVIN be used commercially?
Yes. CALVIN is MIT licensed, permitting unrestricted commercial use.
How is CALVIN related to VLA models?
CALVIN is the primary benchmark for evaluating vision-language-action models including SuSIE, RT-2, GR-1, and others. The 5-task chaining success rate on CALVIN is the most-cited metric for comparing VLA model performance on language-conditioned manipulation.
How do I access CALVIN?
CALVIN is available at github.com/mees/calvin. No registration is required. The benchmark includes the simulation environment, dataset, and evaluation code.