Methodology

How models are scored. The ranking is Build Points — the total difficulty of everything a model proves it can build. CaSS — Catalog & Scope Score is the gate (are the parts real and on-brief?), and CBS — Confirmed Build Score is the prize (do the parts actually interlock into the requested shape?). Below: the formulas, frozen protocols, tier definitions, and the anti-gaming approach.

Build Points — the ranking

build_points = Σ difficulty(task)  over tasks where build_pass == 1

A model earns a task's difficulty only when that task is confirmed buildable (build_pass = 1). Build Points is the sum of those difficulties — unbounded, so harder builds are worth more and the ceiling rises as the corpus grows.

Difficulty is a property of the task — verified mechanisms + scale — and is identical for every submitter, so it can't be padded by adding filler parts. On the leaderboard it's shown as earned / available.

CaSS — Catalog & Scope Score (the gate)

cass_task_i = bom_pass_i × scope_pass_i             # ∈ {0, 1}
tier_cass_t = mean(cass_task_i for i in tier t,
                   excluding did_not_run)
cass_total  = mean(tier_cass_t over tiers present)  # equal-tier weighting

CBS — Confirmed Build Score (the prize)

cbs_task_i  = bom_pass_i × scope_pass_i × (build_pass_i == 1 ? 1 : 0)
tier_cbs_t  = mean(cbs_task_i for i in tier t)
cbs_total   = mean(tier_cbs_t over tiers present)
coverage    = (#tasks with build_pass ∈ {0,1}) / (#tasks where !did_not_run)

CBS answers the benchmark's core question: did the model produce parts that physically interlock into the requested shape? Earning CBS on a task requires CaSS = 1 plus a structural placement proof (build_pass = 1). Only scaffold-assembly-v1 submissions can earn CBS > 0; BOM-only protocols (raw-v1, scaffold-v1) cannot, because they don't submit placements for the structural prover.

The validator ships a disprover: an area-arithmetic check plus a stud-grid theorem that says "if the target is a single-layer flat plate, the BOM has multiple plates whose total area equals the target area, and no part exists at any other Z-layer, then the BOM is provably unbuildable — coplanar tiles do not interlock." That theorem fires on any naïve flat-tile attempt, including the 5×7 and 5×5 plates that motivated the benchmark. Disprover verdicts are definitive (build_pass = 0).

BOM-only mode does not attempt to confirm buildability when a BOM has parts at multiple Z-layers (a plausible bonding layer) — those return inconclusive (POSSIBLY_BONDED). Submissions that include final_bom.assembly are checked by the placement prover for collision-free integer-grid connectivity and target fit. Tiers and tasks without a definitive verdict simply show no CBS yet; the gate (CaSS) and the ranking (Build Points) remain meaningful.

Status: the structural validator is live and stable. See the structural-validator design doc and STRUCTURAL-VALIDATOR-V1.md for current scope and remaining work.

Protocols

A protocol is a frozen, version-pinned artifact specifying the system prompt, required tools, validator-iteration cap, and output format. Each protocol is a separate column on the leaderboard — raw-v1 and scaffold-v1 are never blended.

id name frozen at prompt hash
raw-assembly-v1 BrickAGI Raw Assembly Protocol v1 2026-06-10 sha256:ba354b3a8041d12ddf491577fb232f529f67bf54ecc996af73b7d45fb671d3ff
raw-v1 BrickAGI Raw Protocol v1 2026-04-27 sha256:ec6641f2d101a9de13bfca90229cdb93b49f1bd9369f3f1e7587f41b33c4c69c
scaffold-assembly-v1 BrickAGI Scaffolded Assembly Protocol v1 2026-04-29 sha256:394f2d859fafd42d0ac144c5f251e68ea2e7f4c06da48bf733e1c2505055ee9c
scaffold-v1 BrickAGI Scaffolded Protocol v1 2026-04-27 sha256:83fca868afffd414061a997e14f4688a0c96a4e750c2c91de23e7c5be11a2b7d

raw-v1 — minimal

You are a LEGO designer. Given a design request, produce a final BOM in the JSON shape:
```bom
{"name": "<short name>", "parts": [{"part_num": "<id>", "color_id": <int>, "quantity": <int>}, ...]}
```
Use catalog tools to verify part_num / color_id values. Do not invent IDs.
Validate your final BOM with validate_bom before submitting.
Maximum 3 validator iterations.

scaffold-v1 — bonding-layer addendum

The bonding-layer rule below is the load-bearing addendum. Quoted verbatim from the protocol's frozen system prompt.

You are a LEGO designer. Given a design request, produce a final BOM in the JSON shape:
```bom
{"name": "<short name>", "parts": [{"part_num": "<id>", "color_id": <int>, "quantity": <int>}, ...]}
```
Use catalog tools to verify part_num / color_id values. Do not invent IDs.
Validate your final BOM with validate_bom before submitting.
Maximum 3 validator iterations.

---

ASSEMBLY-INTENT REQUIREMENT (read carefully — this is the most important rule):

Before listing parts in your final BOM, you MUST verbalize:
  (a) The target shape and its dimensions in studs (e.g., "5 studs × 5 studs × 1 plate tall").
  (b) The layer structure: what is at each Z-height level. Use Z=0 for the lowest layer.
  (c) For each layer, name which parts in the layer above (or below) interlock with the parts in this layer. Studs only mate to anti-studs at *different* Z-heights.

CRITICAL LEGO PHYSICS — THE BONDING-LAYER RULE:

Two parts placed side-by-side at the same Z-height share an *edge*, not a *connection*. They are not held together. Only studs mating with anti-studs at adjacent Z-heights actually bond parts.

Therefore: when you propose multiple parts that occupy the same Z-layer and tile a footprint (e.g., to form a flat surface larger than any single available mold), those parts WILL FALL APART unless an additional layer above OR below has parts whose boundaries are *offset* from the seams of the layer they are bonding. The offset bonding parts span across the seams, and their studs/anti-studs interlock the tiles below/above.

Worked example (use this pattern, do not copy these dimensions for your task):
  - Goal: build a flat 4-stud × 6-stud × 1-plate-tall surface.
  - WRONG: a 4×4 plate next to a 4×2 plate at Z=0. They share an edge but no studs interlock — falls apart.
  - RIGHT: at Z=0, the same 4×4 + 4×2 plates as the visible top tile. At Z=-1 (one plate below), TWO 4×3 plates oriented so their seam crosses the seam of the Z=0 layer. The Z=-1 plates' studs poke up into the anti-studs of the Z=0 plates and lock the seams together. Result: 2 layers, 4 plate parts, structurally bonded.

Equivalently: any flat surface wider/longer than the largest available single-piece plate is a multi-layer build — never a single-layer tile.

When stating your assembly intent (requirement (b) above), you must explicitly identify any seam-crossing offset that locks the layers. If you cannot, the design has not been thought through — go back and add a bonding layer.

This rule applies to flat surfaces, walls, roofs, baseplate extensions, anything composed of multiple parts in the same layer.

Tier definitions

Tier What's tested n Mean CaSS across submissions
trivial Single-component shapes. Tests connectivity & the bonding-layer rule. 4 68.8%
easy <30 pieces, single color, simple composition. 4 50.0%
medium 30–55 pieces, multi-attachment, recognizable form. 9 39.1%
hard 40–120 pieces, articulation (wheels, hinges). 7 31.3%
stretch 120–250 pieces, multi-feature, large flat-tile bonding. 4 25.0%
expert Larger reuses of proven build kinds at greater scale. 4 10.4%
master Exact-reproduction classes: mosaic portraits and voxel sculpts. 5 8.3%

The corpus spans 7 difficulty tiers across 37 tasks. Means computed over the 12 submissions currently on the leaderboard. Equal-tier weighting keeps the easy tiers from drowning the hard ones.

Anti-gaming

  1. Canary GUID in every task.yaml. Free contamination signal — borrowed from BIG-bench.
  2. Versioned task sets. Old leaderboard entries stay pinned to their version. Refresh ~quarterly.
  3. Hidden held-out set (deferred to Phase 2). 5 hidden tasks per tier the maintainer runs internally. Reported alongside public scores; large delta = overfitting signal.
  4. Protocol/task pinning + CI re-score. Every submission's protocol hash, task canaries/hashes, scorer version, and validator version must match. CI recomputes scores from the submitted BOMs; submitter-supplied scores are ignored.
  5. Transcript archive required for audited. Trajectories must resolve and be retained.
  6. Sampling pinned per protocol. Temperature, max-tokens, validator-iteration cap all in the protocol YAML; CI rejects submissions outside the pinned envelope.
  7. Validator versioning. validate_bom@0.3 and validate_buildability@1.0 are pinned per submission; old entries get a re-scored=true marker if a validator bug-fix flips them.

Glossary

Build Points
The ranking. Total difficulty of every task a model proves it can build (unbounded; harder builds worth more).
CaSS — Catalog & Scope Score
The gate (0–100%): are the parts real and on-brief?
CBS — Confirmed Build Score
The prize (0–100%): do the parts actually interlock into the requested shape?
BOM
Bill of Materials — the list of parts (with colors and quantities) a model proposes for a task.
build_pass
Per-task verdict from the structural prover: 1 = proven buildable, 0 = proven unbuildable, otherwise inconclusive.
tier
A difficulty band. The corpus spans seven: trivial, easy, medium, hard, stretch, expert, master.
stud / anti-stud
The bump on top of a brick (stud) and the receiving cavity beneath (anti-stud). A bond is a stud mating to an anti-stud at an adjacent height — bricks merely sharing an edge do not connect.

About BrickAGI →   Submit your model →