Methodology

Scoring formulas, frozen protocol definitions, tier definitions, and the anti-gaming approach. CBS is the benchmark’s long-term headline; CaSS is the gate every submission must clear before CBS becomes meaningful.

The structural validator is live and stable (v1.1): BOM-only flat-tile failures receive PROVEN_UNBUILDABLE_FOR_TARGET, and 11 of 20 tasks have placement-backed reference fixtures that produce PROVEN_BUILDABLE_AS_PLACED. Tasks 0008 and 0016 are honestly inconclusive (vertical-plane orientation not yet modeled). Tasks 0013–0015 and 0017–0020 are connector-heavy and pending prover audit. The first verified scaffold-assembly-v1 entry lands gpt-5.5 at CBS 0.15 — meaning at least one task on the 11-task placement-backed corpus proved buildable. Other Phase 0 baselines on the leaderboard are BOM-only and so cannot earn CBS > 0; that's expected for raw-v1 and scaffold-v1 runs which don't submit placements.

CaSS — Catalog-And-Scope Score (the v1 gate)

cass_task_i = bom_pass_i × scope_pass_i             # ∈ {0, 1}
tier_cass_t = mean(cass_task_i for i in tier t,
                   excluding did_not_run)
cass_total  = mean(tier_cass_t over tiers present)  # equal-tier weighting

CBS — Confirmed Build Score (the long-term headline)

cbs_task_i  = bom_pass_i × scope_pass_i × (build_pass_i == 1 ? 1 : 0)
tier_cbs_t  = mean(cbs_task_i for i in tier t)
cbs_total   = mean(tier_cbs_t over tiers present)
coverage    = (#tasks with build_pass ∈ {0,1}) / (#tasks where !did_not_run)

CBS is the benchmark’s stated thesis: did the model produce parts that physically interlock into the requested shape? It is the metric the public leaderboard will eventually rank on. The first verified scaffold-assembly-v1 entry demonstrates CBS > 0 is reachable: gpt-5.5 lands at CBS 0.15 — the first non-zero CBS entry on the leaderboard. BOM-only protocols (raw-v1, scaffold-v1) cannot earn CBS > 0 because they don’t submit placements for the structural prover. Assembly-backed submissions under scaffold-assembly-v1 can earn CBS on any of the 11 tasks with placement-backed proofs (0001–0007, 0009–0012).

v1 ships the disprover: an area-arithmetic check plus a stud-grid theorem that says ”if the target is a single-layer flat plate, the BOM has multiple plates whose total area equals the target area, and no part exists at any other Z-layer, then the BOM is provably unbuildable — coplanar tiles do not interlock.” That theorem fires on GPT-5.5’s actual 5×5-plate BOM and any other naïve flat-tile attempt. The disprover’s verdicts are definitive (build_pass = 0); they are what currently raise coverage above 0.

BOM-only mode explicitly does not attempt to confirm buildability when a BOM has parts at multiple Z-layers (a plausible bonding layer) — those return inconclusive (POSSIBLY_BONDED). Submissions that include final_bom.assembly are checked by the placement prover for collision-free integer-grid connectivity and target fit on supported target specs. Today 11 of 20 tasks have supported target specs; 2 are inconclusive (vertical-plane orientation not modeled), and 7 are connector-heavy (pending prover audit).

Until definitive coverage is broad enough for fair ranking, the leaderboard ranks CaSS as a necessary-but-not-sufficient gate: every CBS-eligible submission must already have CaSS = 1 on a task before its build_pass can be 1. As assembly-backed submissions and broader target specs land, CBS for tasks with definitive verdicts becomes meaningful alongside CaSS.

Why this is honest rather than embarrassing: the spike that motivated BrickAGI established that frontier models pass CaSS while still failing buildability — see the 5×5-plate example on the home page. The v1 validator now scores that example. See the structural-validator design doc and STRUCTURAL-VALIDATOR-V1.md for current scope and remaining work.

Protocols

A protocol is a frozen, version-pinned artifact specifying the system prompt, required tools, validator-iteration cap, and output format. Each protocol is a separate column on the leaderboard — raw-v1 and scaffold-v1 are never blended.

id name frozen at prompt hash
raw-v1 BrickAGI Raw Protocol v1 2026-04-27 sha256:ec6641f2d101a9de13bfca90229cdb93b49f1bd9369f3f1e7587f41b33c4c69c
scaffold-assembly-v1 BrickAGI Scaffolded Assembly Protocol v1 2026-04-29 sha256:394f2d859fafd42d0ac144c5f251e68ea2e7f4c06da48bf733e1c2505055ee9c
scaffold-v1 BrickAGI Scaffolded Protocol v1 2026-04-27 sha256:83fca868afffd414061a997e14f4688a0c96a4e750c2c91de23e7c5be11a2b7d

raw-v1 — minimal

You are a LEGO designer. Given a design request, produce a final BOM in the JSON shape:
```bom
{"name": "<short name>", "parts": [{"part_num": "<id>", "color_id": <int>, "quantity": <int>}, ...]}
```
Use catalog tools to verify part_num / color_id values. Do not invent IDs.
Validate your final BOM with validate_bom before submitting.
Maximum 3 validator iterations.

scaffold-v1 — bonding-layer addendum

The bonding-layer rule below is the load-bearing addendum. Quoted verbatim from the protocol's frozen system prompt.

You are a LEGO designer. Given a design request, produce a final BOM in the JSON shape:
```bom
{"name": "<short name>", "parts": [{"part_num": "<id>", "color_id": <int>, "quantity": <int>}, ...]}
```
Use catalog tools to verify part_num / color_id values. Do not invent IDs.
Validate your final BOM with validate_bom before submitting.
Maximum 3 validator iterations.

---

ASSEMBLY-INTENT REQUIREMENT (read carefully — this is the most important rule):

Before listing parts in your final BOM, you MUST verbalize:
  (a) The target shape and its dimensions in studs (e.g., "5 studs × 5 studs × 1 plate tall").
  (b) The layer structure: what is at each Z-height level. Use Z=0 for the lowest layer.
  (c) For each layer, name which parts in the layer above (or below) interlock with the parts in this layer. Studs only mate to anti-studs at *different* Z-heights.

CRITICAL LEGO PHYSICS — THE BONDING-LAYER RULE:

Two parts placed side-by-side at the same Z-height share an *edge*, not a *connection*. They are not held together. Only studs mating with anti-studs at adjacent Z-heights actually bond parts.

Therefore: when you propose multiple parts that occupy the same Z-layer and tile a footprint (e.g., to form a flat surface larger than any single available mold), those parts WILL FALL APART unless an additional layer above OR below has parts whose boundaries are *offset* from the seams of the layer they are bonding. The offset bonding parts span across the seams, and their studs/anti-studs interlock the tiles below/above.

Worked example (use this pattern, do not copy these dimensions for your task):
  - Goal: build a flat 4-stud × 6-stud × 1-plate-tall surface.
  - WRONG: a 4×4 plate next to a 4×2 plate at Z=0. They share an edge but no studs interlock — falls apart.
  - RIGHT: at Z=0, the same 4×4 + 4×2 plates as the visible top tile. At Z=-1 (one plate below), TWO 4×3 plates oriented so their seam crosses the seam of the Z=0 layer. The Z=-1 plates' studs poke up into the anti-studs of the Z=0 plates and lock the seams together. Result: 2 layers, 4 plate parts, structurally bonded.

Equivalently: any flat surface wider/longer than the largest available single-piece plate is a multi-layer build — never a single-layer tile.

When stating your assembly intent (requirement (b) above), you must explicitly identify any seam-crossing offset that locks the layers. If you cannot, the design has not been thought through — go back and add a bonding layer.

This rule applies to flat surfaces, walls, roofs, baseplate extensions, anything composed of multiple parts in the same layer.

Tier definitions

Tier What's tested n Mean CaSS across submissions
trivial Single-component shapes. Tests connectivity & the bonding-layer rule. 4 81.3%
easy <30 pieces, single color, simple composition. 4 37.5%
medium 30–55 pieces, multi-attachment, recognizable form. 4 43.8%
hard 40–120 pieces, articulation (wheels, hinges). 4 20.8%
stretch 120–250 pieces, multi-feature, large flat-tile bonding. 4 0.0%

Means computed over the 4 baseline submissions currently on the leaderboard. Hard and stretch are floored at near-0 in v1 — see the v2 baseline memo's discussion of discriminating-power ceiling.

Anti-gaming

  1. Canary GUID in every task.yaml. Free contamination signal — borrowed from BIG-bench.
  2. Versioned task sets. Old leaderboard entries stay pinned to their version. Refresh ~quarterly.
  3. Hidden held-out set (deferred to Phase 2). 5 hidden tasks per tier the maintainer runs internally. Reported alongside public scores; large delta = overfitting signal.
  4. Protocol/task pinning + CI re-score. Every submission's protocol hash, task canaries/hashes, scorer version, and validator version must match. CI recomputes scores from the submitted BOMs; submitter-supplied scores are ignored.
  5. Transcript archive required for audited. Trajectories must resolve and be retained.
  6. Sampling pinned per protocol. Temperature, max-tokens, validator-iteration cap all in the protocol YAML; CI rejects submissions outside the pinned envelope.
  7. Validator versioning. validate_bom@0.3 and validate_buildability@1.0 are pinned per submission; old entries get a re-scored=true marker if a validator bug-fix flips them.

About BrickAGI →   Submit your model →