Methodology
How models are scored. The ranking is Build Points — the total difficulty of everything a model proves it can build. CaSS — Catalog & Scope Score is the gate (are the parts real and on-brief?), and CBS — Confirmed Build Score is the prize (do the parts actually interlock into the requested shape?). Below: the formulas, frozen protocols, tier definitions, and the anti-gaming approach.
Build Points — the ranking
build_points = Σ difficulty(task) over tasks where build_pass == 1
A model earns a task's difficulty only when that task is
confirmed buildable (build_pass = 1). Build Points is the sum
of those difficulties — unbounded, so harder builds are
worth more and the ceiling rises as the corpus grows.
Difficulty is a property of the task — verified mechanisms + scale
— and is identical for every submitter, so it can't be padded by adding
filler parts. On the leaderboard it's shown as earned / available.
CaSS — Catalog & Scope Score (the gate)
cass_task_i = bom_pass_i × scope_pass_i # ∈ {0, 1}
tier_cass_t = mean(cass_task_i for i in tier t,
excluding did_not_run)
cass_total = mean(tier_cass_t over tiers present) # equal-tier weighting -
bom_pass = 1iff every BOM line has zero validator errors. Warnings (non-canonical mold aliases, unseen colour combos) are accepted. -
scope_pass = 1iff total piece count is in the task'sexpected_piece_range, every requiredcolor_idis present, and any declared BOM-levelscope_rulespass. -
did_not_convergetasks count as a 0 toward CaSS;did_not_runtasks (cost circuit-breaker tripped) are excluded from the denominator entirely. - Equal-tier weighting prevents 4 trivial tasks from drowning 4 stretch tasks.
CBS — Confirmed Build Score (the prize)
cbs_task_i = bom_pass_i × scope_pass_i × (build_pass_i == 1 ? 1 : 0)
tier_cbs_t = mean(cbs_task_i for i in tier t)
cbs_total = mean(tier_cbs_t over tiers present)
coverage = (#tasks with build_pass ∈ {0,1}) / (#tasks where !did_not_run)
CBS answers the benchmark's core question: did the model produce parts
that physically interlock into the requested shape? Earning CBS on a
task requires CaSS = 1 plus a structural placement proof
(build_pass = 1). Only scaffold-assembly-v1
submissions can earn CBS > 0; BOM-only protocols
(raw-v1, scaffold-v1) cannot, because they
don't submit placements for the structural prover.
The validator ships a disprover: an area-arithmetic
check plus a stud-grid theorem that says "if the target is a
single-layer flat plate, the BOM has multiple plates whose total area
equals the target area, and no part exists at any other Z-layer, then
the BOM is provably unbuildable — coplanar tiles do not interlock."
That theorem fires on any naïve flat-tile attempt, including the 5×7
and 5×5 plates that motivated the benchmark. Disprover verdicts are
definitive (build_pass = 0).
BOM-only mode does not attempt to confirm buildability when a
BOM has parts at multiple Z-layers (a plausible bonding layer) — those
return inconclusive (POSSIBLY_BONDED). Submissions that
include final_bom.assembly are checked by the placement
prover for collision-free integer-grid connectivity and target fit.
Tiers and tasks without a definitive verdict simply show no CBS yet;
the gate (CaSS) and the ranking (Build Points) remain meaningful.
Status: the structural validator is live and stable. See the structural-validator design doc and STRUCTURAL-VALIDATOR-V1.md for current scope and remaining work.
Protocols
A protocol is a frozen, version-pinned artifact
specifying the system prompt, required tools, validator-iteration
cap, and output format. Each protocol is a separate column on the
leaderboard — raw-v1 and scaffold-v1 are
never blended.
| id | name | frozen at | prompt hash |
|---|---|---|---|
raw-assembly-v1 | BrickAGI Raw Assembly Protocol v1 | 2026-06-10 | sha256:ba354b3a8041d12ddf491577fb232f529f67bf54ecc996af73b7d45fb671d3ff |
raw-v1 | BrickAGI Raw Protocol v1 | 2026-04-27 | sha256:ec6641f2d101a9de13bfca90229cdb93b49f1bd9369f3f1e7587f41b33c4c69c |
scaffold-assembly-v1 | BrickAGI Scaffolded Assembly Protocol v1 | 2026-04-29 | sha256:394f2d859fafd42d0ac144c5f251e68ea2e7f4c06da48bf733e1c2505055ee9c |
scaffold-v1 | BrickAGI Scaffolded Protocol v1 | 2026-04-27 | sha256:83fca868afffd414061a997e14f4688a0c96a4e750c2c91de23e7c5be11a2b7d |
raw-v1 — minimal
You are a LEGO designer. Given a design request, produce a final BOM in the JSON shape:
```bom
{"name": "<short name>", "parts": [{"part_num": "<id>", "color_id": <int>, "quantity": <int>}, ...]}
```
Use catalog tools to verify part_num / color_id values. Do not invent IDs.
Validate your final BOM with validate_bom before submitting.
Maximum 3 validator iterations. scaffold-v1 — bonding-layer addendum
The bonding-layer rule below is the load-bearing addendum. Quoted verbatim from the protocol's frozen system prompt.
You are a LEGO designer. Given a design request, produce a final BOM in the JSON shape:
```bom
{"name": "<short name>", "parts": [{"part_num": "<id>", "color_id": <int>, "quantity": <int>}, ...]}
```
Use catalog tools to verify part_num / color_id values. Do not invent IDs.
Validate your final BOM with validate_bom before submitting.
Maximum 3 validator iterations.
---
ASSEMBLY-INTENT REQUIREMENT (read carefully — this is the most important rule):
Before listing parts in your final BOM, you MUST verbalize:
(a) The target shape and its dimensions in studs (e.g., "5 studs × 5 studs × 1 plate tall").
(b) The layer structure: what is at each Z-height level. Use Z=0 for the lowest layer.
(c) For each layer, name which parts in the layer above (or below) interlock with the parts in this layer. Studs only mate to anti-studs at *different* Z-heights.
CRITICAL LEGO PHYSICS — THE BONDING-LAYER RULE:
Two parts placed side-by-side at the same Z-height share an *edge*, not a *connection*. They are not held together. Only studs mating with anti-studs at adjacent Z-heights actually bond parts.
Therefore: when you propose multiple parts that occupy the same Z-layer and tile a footprint (e.g., to form a flat surface larger than any single available mold), those parts WILL FALL APART unless an additional layer above OR below has parts whose boundaries are *offset* from the seams of the layer they are bonding. The offset bonding parts span across the seams, and their studs/anti-studs interlock the tiles below/above.
Worked example (use this pattern, do not copy these dimensions for your task):
- Goal: build a flat 4-stud × 6-stud × 1-plate-tall surface.
- WRONG: a 4×4 plate next to a 4×2 plate at Z=0. They share an edge but no studs interlock — falls apart.
- RIGHT: at Z=0, the same 4×4 + 4×2 plates as the visible top tile. At Z=-1 (one plate below), TWO 4×3 plates oriented so their seam crosses the seam of the Z=0 layer. The Z=-1 plates' studs poke up into the anti-studs of the Z=0 plates and lock the seams together. Result: 2 layers, 4 plate parts, structurally bonded.
Equivalently: any flat surface wider/longer than the largest available single-piece plate is a multi-layer build — never a single-layer tile.
When stating your assembly intent (requirement (b) above), you must explicitly identify any seam-crossing offset that locks the layers. If you cannot, the design has not been thought through — go back and add a bonding layer.
This rule applies to flat surfaces, walls, roofs, baseplate extensions, anything composed of multiple parts in the same layer. Tier definitions
| Tier | What's tested | n | Mean CaSS across submissions |
|---|---|---|---|
| trivial | Single-component shapes. Tests connectivity & the bonding-layer rule. | 4 | 68.8% |
| easy | <30 pieces, single color, simple composition. | 4 | 50.0% |
| medium | 30–55 pieces, multi-attachment, recognizable form. | 9 | 39.1% |
| hard | 40–120 pieces, articulation (wheels, hinges). | 7 | 31.3% |
| stretch | 120–250 pieces, multi-feature, large flat-tile bonding. | 4 | 25.0% |
| expert | Larger reuses of proven build kinds at greater scale. | 4 | 10.4% |
| master | Exact-reproduction classes: mosaic portraits and voxel sculpts. | 5 | 8.3% |
The corpus spans 7 difficulty tiers across 37 tasks. Means computed over the 12 submissions currently on the leaderboard. Equal-tier weighting keeps the easy tiers from drowning the hard ones.
Anti-gaming
- Canary GUID in every
task.yaml. Free contamination signal — borrowed from BIG-bench. - Versioned task sets. Old leaderboard entries stay pinned to their version. Refresh ~quarterly.
- Hidden held-out set (deferred to Phase 2). 5 hidden tasks per tier the maintainer runs internally. Reported alongside public scores; large delta = overfitting signal.
- Protocol/task pinning + CI re-score. Every submission's protocol hash, task canaries/hashes, scorer version, and validator version must match. CI recomputes scores from the submitted BOMs; submitter-supplied scores are ignored.
- Transcript archive required for
audited. Trajectories must resolve and be retained. - Sampling pinned per protocol. Temperature, max-tokens, validator-iteration cap all in the protocol YAML; CI rejects submissions outside the pinned envelope.
- Validator versioning.
validate_bom@0.3andvalidate_buildability@1.0are pinned per submission; old entries get are-scored=truemarker if a validator bug-fix flips them.
Glossary
- Build Points
- The ranking. Total difficulty of every task a model proves it can build (unbounded; harder builds worth more).
- CaSS — Catalog & Scope Score
- The gate (0–100%): are the parts real and on-brief?
- CBS — Confirmed Build Score
- The prize (0–100%): do the parts actually interlock into the requested shape?
- BOM
- Bill of Materials — the list of parts (with colors and quantities) a model proposes for a task.
build_pass- Per-task verdict from the structural prover:
1= proven buildable,0= proven unbuildable, otherwise inconclusive. - tier
- A difficulty band. The corpus spans seven: trivial, easy, medium, hard, stretch, expert, master.
- stud / anti-stud
- The bump on top of a brick (stud) and the receiving cavity beneath (anti-stud). A bond is a stud mating to an anti-stud at an adjacent height — bricks merely sharing an edge do not connect.