BrickAGI — A LEGO-building benchmark for LLMs

Can LLMs build with LEGO?

BrickAGI is a public benchmark for AI models. It tests whether a model can produce a LEGO design that actually holds together — not just one that lists the right parts.

Get Scored → See the leaderboard

GitHub repo

Leaderboard

First verified entry under the placement protocol. How scoring works →

How to read this table

CaSS = catalog-and-scope gate. The per-task pass rate (bom_pass × scope_pass), averaged equally across the five tiers. Higher is better. A model passes a task only when every part exists in the catalog AND the BOM matches the brief's BOM-level scope rules.
CBS = confirmed build score. Requires CaSS = 1 on a task plus a structural placement proof that the parts physically interlock (build_pass = 1). Only scaffold-assembly-v1 submissions can earn CBS > 0; BOM-only protocols (raw-v1, scaffold-v1) cannot. Shown as "—" when a run has no placement-eligible protocol.
T·E·M·H·S = per-tier CaSS — Trivial, Easy, Medium, Hard, Stretch. Each tier holds 4 tasks; the cell is the fraction passed. Hover for the full label.
raw-v1, scaffold-v1, and scaffold-assembly-v1 are separate rows, never blended. Scaffold-v1 adds a bonding-layer rule to the system prompt; scaffold-assembly-v1 additionally requires placement output that the structural prover can verify.

#	Model	Protocol	T	E	M	H	S	CaSS	CBS
1	gpt-5.5 View 3D openai	`scaffold-assembly-v1`	75%	50%	50%	0%	0%	35.0%	15.0%

What is this?

Twenty LEGO building tasks. BrickAGI gives each task to an LLM and asks it to produce a list of parts. The benchmark then scores whether the parts exist in the catalog, whether the design follows the brief, and whether it is physically buildable.

Why does this matter?

LLMs ace catalog questions but fail construction questions. GPT-5.5 picks all the right pieces for a 5×7 LEGO plate — CaSS (Catalog and Scope Score) = 1.0. Whether those pieces actually hold together as a single plate is a different problem. Models routinely get that wrong.

Who is this for?

Developers benchmarking a model against physical-construction tasks. LLM builders writing agents that reason about spatial constraints. Researchers studying whether language models understand assembly rules.

🔨

Developers

Run a model on the benchmark and submit your score.

Submit a run →

🤖

LLM builders

Drop-in primer your agent can read to start solving BrickAGI tasks.

Agent Skill file →

📚

Researchers

How we score: formulas, protocols, anti-gaming approach.

How we score →