About BrickAGI

A benchmark for whether large language models can produce buildable LEGO designs — not merely catalog-correct ones.

The gap this benchmark exists to surface

During the validation spike that led to this project, both Claude Opus 4.7 and GPT-5.5 produced outputs we'd previously have called "valid": every part-number resolved in the Rebrickable catalog, every color-id was real, and the BOM passed a strict catalog validator. Those same outputs would then fall apart on a table when assembled — the parts shared edges, but no studs mated to anti-studs at adjacent Z-heights.

Catalog correctness ≠ buildability. A flat 5×5 surface composed of five 1×5 plates side-by-side is catalog-correct and physically disconnected. The "bonding-layer rule" — that any flat surface wider than the largest single plate must be a multi-layer build, not a single-layer tile — is not in any model's training data as an explicit rule. They learn it (or don't) implicitly. BrickAGI exists to measure how well.

Why LEGO specifically

What v1 measures

The v1 headline metric is CaSS (Catalog-And-Scope Score): a per-task gated bom_pass × scope_pass, equal-tier-mean across the 5 difficulty tiers. It rewards a model that produces a catalog-valid BOM whose total piece count, required colors, and declared BOM-level scope rules obey the brief. It does not measure whether the BOM physically assembles — that is the deferred CBS metric, gated on the structural validator's coverage.

We chose CaSS as the v1 headline because CBS — the long-term canonical metric — is still low-coverage in v1: the conservative structural validator can disprove narrow flat-plate failures and prove supported assembly-backed placements, but BOM-only plausible bonded builds remain inconclusive → 0. A leaderboard whose top-line is forced near 0 cannot rank anyone fairly. CaSS is meaningful today; CBS becomes meaningful as coverage grows. See methodology.

What's open, what's frozen

Status

v1.1 stable. 20 tasks, 3 protocols, and a Phase 0 baseline of 3 frontier runs (GPT-5.5 raw, GPT-5.5 scaffold, Gemini 3 Pro raw). 11 of 20 tasks now have placement-backed buildability proofs; 2 are honestly inconclusive (vertical-plane orientation not modeled); 7 are connector-heavy and pending prover audit. The Phase 0 v2 memo (baselines-v2.md) recommended Fork A — proceed to public Phase 1 launch — and this site is part of that launch. Community submissions are open via GitHub PR.

License & contributions

BrickAGI is open source under the MIT License. Submissions are welcome via GitHub PR (see Submit). Community task contributions enter via the AFOL track described in the human-validators design doc and require a curator-approved reference solve.

Acknowledgements

LEGO is a trademark of the LEGO Group. The LEGO Group does not sponsor, authorize, or endorse this project.