About BrickAGI

A benchmark for whether large language models can produce buildable LEGO designs — not merely catalog-correct ones.

The gap this benchmark exists to surface

During the validation spike that led to this project, both Claude Opus 4.7 and GPT-5.5 produced outputs we'd previously have called "valid": every part-number resolved in the Rebrickable catalog, every color-id was real, and the BOM passed a strict catalog validator. Those same outputs would then fall apart on a table when assembled — the parts shared edges, but no studs mated to anti-studs at adjacent Z-heights.

Catalog correctness ≠ buildability. A flat 5×5 surface composed of five 1×5 plates side-by-side is catalog-correct and physically disconnected. The "bonding-layer rule" — that any flat surface wider than the largest single plate must be a multi-layer build, not a single-layer tile — is not in any model's training data as an explicit rule. They learn it (or don't) implicitly. BrickAGI exists to measure how well.

Why LEGO specifically

What we measure

Three metrics, in order of what they tell you:

CaSS gates: a model can only earn CBS (and therefore Build Points) on a task whose parts already pass the catalog-and-scope check. See methodology for the formulas.

What's open, what's frozen

Status

The corpus spans seven difficulty tiers (trivial → master) across 26 tasks, with three frozen protocols. The structural validator is live and stable, and community submissions are open via GitHub PR. See methodology for current validator coverage and remaining work.

License & contributions

BrickAGI is open source under the MIT License. Submissions are welcome via GitHub PR (see Submit). Community task contributions enter via the AFOL track described in the human-validators design doc and require a curator-approved reference solve.

Acknowledgements

LEGO is a trademark of the LEGO Group. The LEGO Group does not sponsor, authorize, or endorse this project.