BrickAGI — A LEGO-building benchmark for LLMs

Can LLMs build with LEGO?

BrickAGI tests whether AI models can design LEGO that actually holds together — not just name the right parts. Models earn points for every build we can prove is physically sound; harder builds are worth more, and that total is the ranking below.

GitHub repo

Current leader: gpt-5.5 with 76 Build Points of 2,631 available.

37 building tasks across 7 difficulty tiers — from a 5×7 plate to a 64×64 mosaic and a 447-voxel sculpture.

What is this?

A set of LEGO building tasks. BrickAGI hands each one to an AI model and asks for a list of parts and how they fit together. It then checks whether the parts are real, whether they match the brief, and whether they actually interlock into the requested shape.

Why does this matter?

Ask a model for a flat 5×7 LEGO plate and it lists every correct piece — a perfect parts score. But laid out as described, the pieces just sit side by side and fall apart: nothing clicks together. Catalog-correct is not the same as buildable. That gap is what BrickAGI measures.

Who is this for?

Developers benchmarking a model against physical-construction tasks. LLM builders writing agents that reason about spatial constraints. Researchers studying whether language models understand assembly rules.

Leaderboard

Ranked by Build Points. How scoring works →

How to read this table
  • Build Points — the ranking: total difficulty of everything the model proved it can build. Unbounded; harder builds worth more. Shown as earned / available.
  • Core (0–100%) — how much of the core benchmark (trivial–hard tiers) the model proved it can build. The core band was sized on corpus v1.4 so the strongest one-shot agent run measured to date landed near 50% — an anchor on one model at one point in time, not a stable scale; expect drift as models improve. With 24 core tasks the 95% confidence interval on a single run spans roughly ±20 points. The stretch/expert/master tiers are unbounded frontier headroom that feeds Build Points instead.
  • CaSS — Catalog & Scope Score (0–100%) — the gate: are the parts real and on-brief?
  • CBS — Confirmed Build Score (0–100%) — the prize: do the parts actually interlock into the requested shape?
  • T·E·M·H·S·X·R = the seven difficulty tiers; each cell = % of that tier passed.
About the protocols

raw-v1, scaffold-v1, and scaffold-assembly-v1 are separate rows, never blended. scaffold-v1 adds a bonding-layer rule to the system prompt; scaffold-assembly-v1 additionally requires placement output the structural prover can verify — only it can earn CBS > 0. Full protocol definitions →

# Model Protocol PTS Core CaSS CBS T E M H S X R
1 gpt-5.5 View 3D
openai
raw-assembly-v1 76
/ 2,631
17% 28.7% 14.3% 100% 50% 22% 29% 0% 0% 0%
🔨
Developers

Run a model on the benchmark and submit your score.

Submit a run →
🤖
LLM builders

Drop-in primer your agent can read to start solving BrickAGI tasks.

Agent Skill file →
📚
Researchers

How we score: formulas, protocols, anti-gaming approach.

How we score →