Methodology
Scoring formulas, frozen protocol definitions, tier definitions, and the anti-gaming approach. CBS is the benchmark’s long-term headline; CaSS is the gate every submission must clear before CBS becomes meaningful.
The structural validator is live and stable (v1.1):
BOM-only flat-tile failures receive PROVEN_UNBUILDABLE_FOR_TARGET,
and 11 of 20 tasks have placement-backed reference
fixtures that produce PROVEN_BUILDABLE_AS_PLACED. Tasks
0008 and 0016 are honestly inconclusive (vertical-plane orientation not
yet modeled). Tasks 0013–0015 and 0017–0020 are connector-heavy and
pending prover audit. The first verified scaffold-assembly-v1
entry lands gpt-5.5 at CBS 0.15 — meaning at least one task on
the 11-task placement-backed corpus proved buildable. Other Phase 0
baselines on the leaderboard are BOM-only and so cannot earn CBS > 0;
that's expected for raw-v1 and scaffold-v1 runs
which don't submit placements.
CaSS — Catalog-And-Scope Score (the v1 gate)
cass_task_i = bom_pass_i × scope_pass_i # ∈ {0, 1}
tier_cass_t = mean(cass_task_i for i in tier t,
excluding did_not_run)
cass_total = mean(tier_cass_t over tiers present) # equal-tier weighting -
bom_pass = 1iff every BOM line has zero validator errors. Warnings (non-canonical mold aliases, unseen colour combos) are accepted. -
scope_pass = 1iff total piece count is in the task'sexpected_piece_range, every requiredcolor_idis present, and any declared BOM-levelscope_rulespass. -
did_not_convergetasks count as a 0 toward CaSS;did_not_runtasks (cost circuit-breaker tripped) are excluded from the denominator entirely. - Equal-tier weighting prevents 4 trivial tasks from drowning 4 stretch tasks.
CBS — Confirmed Build Score (the long-term headline)
cbs_task_i = bom_pass_i × scope_pass_i × (build_pass_i == 1 ? 1 : 0)
tier_cbs_t = mean(cbs_task_i for i in tier t)
cbs_total = mean(tier_cbs_t over tiers present)
coverage = (#tasks with build_pass ∈ {0,1}) / (#tasks where !did_not_run)
CBS is the benchmark’s stated thesis: did the
model produce parts that physically interlock into the requested
shape? It is the metric the public leaderboard will eventually
rank on. The first verified scaffold-assembly-v1 entry
demonstrates CBS > 0 is reachable: gpt-5.5 lands at
CBS 0.15 — the first non-zero CBS entry on the
leaderboard. BOM-only protocols (raw-v1,
scaffold-v1) cannot earn CBS > 0 because they
don’t submit placements for the structural prover.
Assembly-backed submissions under scaffold-assembly-v1
can earn CBS on any of the 11 tasks with placement-backed
proofs (0001–0007, 0009–0012).
v1 ships the disprover: an area-arithmetic check
plus a stud-grid theorem that says ”if the target is a
single-layer flat plate, the BOM has multiple plates whose total
area equals the target area, and no part exists at any other Z-layer,
then the BOM is provably unbuildable — coplanar tiles do not
interlock.” That theorem fires on GPT-5.5’s actual 5×5-plate
BOM and any other naïve flat-tile attempt. The disprover’s
verdicts are definitive (build_pass = 0); they are
what currently raise coverage above 0.
BOM-only mode explicitly does not attempt to confirm
buildability when a BOM has parts at multiple Z-layers (a plausible
bonding layer) — those return
inconclusive (POSSIBLY_BONDED). Submissions that include
final_bom.assembly are checked by the placement prover
for collision-free integer-grid connectivity and target fit on supported
target specs. Today 11 of 20 tasks have supported target
specs; 2 are inconclusive (vertical-plane orientation not
modeled), and 7 are connector-heavy (pending prover audit).
Until definitive coverage is broad enough for fair ranking, the leaderboard
ranks CaSS as a necessary-but-not-sufficient gate:
every CBS-eligible submission must already have CaSS = 1 on a task before
its build_pass can be 1. As assembly-backed submissions and
broader target specs land, CBS for tasks with definitive verdicts becomes
meaningful alongside CaSS.
Why this is honest rather than embarrassing: the spike that motivated BrickAGI established that frontier models pass CaSS while still failing buildability — see the 5×5-plate example on the home page. The v1 validator now scores that example. See the structural-validator design doc and STRUCTURAL-VALIDATOR-V1.md for current scope and remaining work.
Protocols
A protocol is a frozen, version-pinned artifact
specifying the system prompt, required tools, validator-iteration
cap, and output format. Each protocol is a separate column on the
leaderboard — raw-v1 and scaffold-v1 are
never blended.
| id | name | frozen at | prompt hash |
|---|---|---|---|
raw-v1 | BrickAGI Raw Protocol v1 | 2026-04-27 | sha256:ec6641f2d101a9de13bfca90229cdb93b49f1bd9369f3f1e7587f41b33c4c69c |
scaffold-assembly-v1 | BrickAGI Scaffolded Assembly Protocol v1 | 2026-04-29 | sha256:394f2d859fafd42d0ac144c5f251e68ea2e7f4c06da48bf733e1c2505055ee9c |
scaffold-v1 | BrickAGI Scaffolded Protocol v1 | 2026-04-27 | sha256:83fca868afffd414061a997e14f4688a0c96a4e750c2c91de23e7c5be11a2b7d |
raw-v1 — minimal
You are a LEGO designer. Given a design request, produce a final BOM in the JSON shape:
```bom
{"name": "<short name>", "parts": [{"part_num": "<id>", "color_id": <int>, "quantity": <int>}, ...]}
```
Use catalog tools to verify part_num / color_id values. Do not invent IDs.
Validate your final BOM with validate_bom before submitting.
Maximum 3 validator iterations. scaffold-v1 — bonding-layer addendum
The bonding-layer rule below is the load-bearing addendum. Quoted verbatim from the protocol's frozen system prompt.
You are a LEGO designer. Given a design request, produce a final BOM in the JSON shape:
```bom
{"name": "<short name>", "parts": [{"part_num": "<id>", "color_id": <int>, "quantity": <int>}, ...]}
```
Use catalog tools to verify part_num / color_id values. Do not invent IDs.
Validate your final BOM with validate_bom before submitting.
Maximum 3 validator iterations.
---
ASSEMBLY-INTENT REQUIREMENT (read carefully — this is the most important rule):
Before listing parts in your final BOM, you MUST verbalize:
(a) The target shape and its dimensions in studs (e.g., "5 studs × 5 studs × 1 plate tall").
(b) The layer structure: what is at each Z-height level. Use Z=0 for the lowest layer.
(c) For each layer, name which parts in the layer above (or below) interlock with the parts in this layer. Studs only mate to anti-studs at *different* Z-heights.
CRITICAL LEGO PHYSICS — THE BONDING-LAYER RULE:
Two parts placed side-by-side at the same Z-height share an *edge*, not a *connection*. They are not held together. Only studs mating with anti-studs at adjacent Z-heights actually bond parts.
Therefore: when you propose multiple parts that occupy the same Z-layer and tile a footprint (e.g., to form a flat surface larger than any single available mold), those parts WILL FALL APART unless an additional layer above OR below has parts whose boundaries are *offset* from the seams of the layer they are bonding. The offset bonding parts span across the seams, and their studs/anti-studs interlock the tiles below/above.
Worked example (use this pattern, do not copy these dimensions for your task):
- Goal: build a flat 4-stud × 6-stud × 1-plate-tall surface.
- WRONG: a 4×4 plate next to a 4×2 plate at Z=0. They share an edge but no studs interlock — falls apart.
- RIGHT: at Z=0, the same 4×4 + 4×2 plates as the visible top tile. At Z=-1 (one plate below), TWO 4×3 plates oriented so their seam crosses the seam of the Z=0 layer. The Z=-1 plates' studs poke up into the anti-studs of the Z=0 plates and lock the seams together. Result: 2 layers, 4 plate parts, structurally bonded.
Equivalently: any flat surface wider/longer than the largest available single-piece plate is a multi-layer build — never a single-layer tile.
When stating your assembly intent (requirement (b) above), you must explicitly identify any seam-crossing offset that locks the layers. If you cannot, the design has not been thought through — go back and add a bonding layer.
This rule applies to flat surfaces, walls, roofs, baseplate extensions, anything composed of multiple parts in the same layer. Tier definitions
| Tier | What's tested | n | Mean CaSS across submissions |
|---|---|---|---|
| trivial | Single-component shapes. Tests connectivity & the bonding-layer rule. | 4 | 81.3% |
| easy | <30 pieces, single color, simple composition. | 4 | 37.5% |
| medium | 30–55 pieces, multi-attachment, recognizable form. | 4 | 43.8% |
| hard | 40–120 pieces, articulation (wheels, hinges). | 4 | 20.8% |
| stretch | 120–250 pieces, multi-feature, large flat-tile bonding. | 4 | 0.0% |
Means computed over the 4 baseline submissions currently on the leaderboard. Hard and stretch are floored at near-0 in v1 — see the v2 baseline memo's discussion of discriminating-power ceiling.
Anti-gaming
- Canary GUID in every
task.yaml. Free contamination signal — borrowed from BIG-bench. - Versioned task sets. Old leaderboard entries stay pinned to their version. Refresh ~quarterly.
- Hidden held-out set (deferred to Phase 2). 5 hidden tasks per tier the maintainer runs internally. Reported alongside public scores; large delta = overfitting signal.
- Protocol/task pinning + CI re-score. Every submission's protocol hash, task canaries/hashes, scorer version, and validator version must match. CI recomputes scores from the submitted BOMs; submitter-supplied scores are ignored.
- Transcript archive required for
audited. Trajectories must resolve and be retained. - Sampling pinned per protocol. Temperature, max-tokens, validator-iteration cap all in the protocol YAML; CI rejects submissions outside the pinned envelope.
- Validator versioning.
validate_bom@0.3andvalidate_buildability@1.0are pinned per submission; old entries get are-scored=truemarker if a validator bug-fix flips them.