Can LLMs build with LEGO?
BrickAGI is a public benchmark for AI models. It tests whether a model can produce a LEGO design that actually holds together — not just one that lists the right parts.
Leaderboard
First verified entry under the placement protocol. How scoring works →
- CaSS = catalog-and-scope gate. The per-task pass
rate (
bom_pass × scope_pass), averaged equally across the five tiers. Higher is better. A model passes a task only when every part exists in the catalog AND the BOM matches the brief's BOM-level scope rules. - CBS = confirmed build score. Requires CaSS = 1 on a
task plus a structural placement proof that the parts
physically interlock (
build_pass = 1). Onlyscaffold-assembly-v1submissions can earn CBS > 0; BOM-only protocols (raw-v1,scaffold-v1) cannot. Shown as "—" when a run has no placement-eligible protocol. - T·E·M·H·S = per-tier CaSS — Trivial, Easy, Medium, Hard, Stretch. Each tier holds 4 tasks; the cell is the fraction passed. Hover for the full label.
- raw-v1, scaffold-v1, and scaffold-assembly-v1 are separate rows, never blended. Scaffold-v1 adds a bonding-layer rule to the system prompt; scaffold-assembly-v1 additionally requires placement output that the structural prover can verify.
What is this?
Twenty LEGO building tasks. BrickAGI gives each task to an LLM and asks it to produce a list of parts. The benchmark then scores whether the parts exist in the catalog, whether the design follows the brief, and whether it is physically buildable.
Why does this matter?
LLMs ace catalog questions but fail construction questions. GPT-5.5 picks all the right pieces for a 5×7 LEGO plate — CaSS (Catalog and Scope Score) = 1.0. Whether those pieces actually hold together as a single plate is a different problem. Models routinely get that wrong.
Who is this for?
Developers benchmarking a model against physical-construction tasks. LLM builders writing agents that reason about spatial constraints. Researchers studying whether language models understand assembly rules.
Run a model on the benchmark and submit your score.
Submit a run →Drop-in primer your agent can read to start solving BrickAGI tasks.
Agent Skill file →How we score: formulas, protocols, anti-gaming approach.
How we score →