About BrickAGI
A benchmark for whether large language models can produce buildable LEGO designs — not merely catalog-correct ones.
The gap this benchmark exists to surface
During the validation spike that led to this project, both Claude Opus 4.7 and GPT-5.5 produced outputs we'd previously have called "valid": every part-number resolved in the Rebrickable catalog, every color-id was real, and the BOM passed a strict catalog validator. Those same outputs would then fall apart on a table when assembled — the parts shared edges, but no studs mated to anti-studs at adjacent Z-heights.
Catalog correctness ≠ buildability. A flat 5×5 surface composed of five 1×5 plates side-by-side is catalog-correct and physically disconnected. The "bonding-layer rule" — that any flat surface wider than the largest single plate must be a multi-layer build, not a single-layer tile — is not in any model's training data as an explicit rule. They learn it (or don't) implicitly. BrickAGI exists to measure how well.
Why LEGO specifically
- A finite, public part vocabulary. Rebrickable provides ~50,000 canonical parts with stable IDs. We can validate a BOM deterministically.
- A simple, deterministic mating rule. Studs to anti-studs at adjacent Z-heights. Adjacent at the same Z-height is not a bond.
- Many human reference solves available. AFOL communities (LDraw, BrickLink, MOC sites) provide ground truth without us having to author every solve from scratch.
- The failure modes are visible. A bad LEGO design falls apart on the table; a bad code generation hides in a test suite. LEGO's cost-of-being-wrong is legible to a layperson.
What we measure
Three metrics, in order of what they tell you:
- Build Points — the ranking: total difficulty of everything the model proved it can build; unbounded, harder builds worth more.
- CaSS — Catalog & Scope Score (0–100%): the gate — are the parts real and on-brief?
- CBS — Confirmed Build Score (0–100%): the prize — do the parts actually interlock into the requested shape?
CaSS gates: a model can only earn CBS (and therefore Build Points) on a task whose parts already pass the catalog-and-scope check. See methodology for the formulas.
What's open, what's frozen
- The task corpus is open. Every
task.yaml, every reference solve, every pass criterion lives in the public repo. - The protocols are frozen.
raw-v1,scaffold-v1, andscaffold-assembly-v1are version-pinned with sha256 prompt hashes; baselines run before protocol freeze are research notes, not benchmark data. - The validators are deterministic. CI re-scores every submission from the submitted BOMs; submitter-supplied scores are not honored. Anyone can run the validator locally.
- The contamination canary is in every task. A 16-hex
canary GUID appears in each
task.yaml— we use it to detect training-set leakage.
Status
The corpus spans seven difficulty tiers (trivial → master) across 26 tasks, with three frozen protocols. The structural validator is live and stable, and community submissions are open via GitHub PR. See methodology for current validator coverage and remaining work.
License & contributions
BrickAGI is open source under the MIT License. Submissions are welcome via GitHub PR (see Submit). Community task contributions enter via the AFOL track described in the human-validators design doc and require a curator-approved reference solve.
Acknowledgements
- τ²-bench (Sierra Research) — submission flow template.
- Terminal-Bench — canary-GUID and Docker-sandboxed eval pattern.
- BIG-bench — canary-GUID convention.
- HELM — scenario × metric matrix inspiration.
- The LDraw community — part vocabulary and hand-built reference solves.
LEGO is a trademark of the LEGO Group. The LEGO Group does not sponsor, authorize, or endorse this project.