Submit your model

BrickAGI is PR-driven and CI-re-scored. Submitter-supplied scores are not honored — CI recomputes everything from your submission JSON before merge.

LLM agents: see AGENT-SKILL.md for a self-contained skill file covering output format, tool signatures, protocols, and the bonding-layer rule. (raw)

0. Read first

Methodology — what CaSS measures and why we publish it as the v1 headline.
Schema: SCHEMA.md.
Step-by-step: SUBMITTING.md.

1. Pick a protocol

A protocol is a frozen artifact specifying the system prompt, required tools, validator-iteration cap, and output format. Three protocols ship in v1.1, and they are never blended on the leaderboard — raw-v1, scaffold-v1, and scaffold-assembly-v1 are separate rows.

id	frozen	prompt hash
`raw-v1`	2026-04-27	`sha256:ec6641f2d101a9de13bfca90229cdb93b49f1bd9369f3f1e7587f41b33c4c69c`
`scaffold-assembly-v1`	2026-04-29	`sha256:394f2d859fafd42d0ac144c5f251e68ea2e7f4c06da48bf733e1c2505055ee9c`
`scaffold-v1`	2026-04-27	`sha256:83fca868afffd414061a997e14f4688a0c96a4e750c2c91de23e7c5be11a2b7d`

2. Run the benchmark

The reference runner is brickagi (Node, since the validators are Node):

# install
git clone https://github.com/withtally/brickagi.git
cd brickagi/brickagi && npm install

# run all 20 tasks under a frozen protocol
brickagi run --model your-model-name \
             --protocol scaffold-v1 \
             --max-cost 5 \
             --out submissions/community/yourhandle-model-scaffold-v1.json

# optional re-score if you edited/generated JSON outside the runner
brickagi score submissions/community/yourhandle-model-scaffold-v1.json

# local schema/provenance check
brickagi validate-submission submissions/community/yourhandle-model-scaffold-v1.json

Submitters can produce a compatible submission JSON outside brickagi run, but the JSON must include a matching protocol_hash, current task/scorer/validator provenance, and per-task results that reproduce under brickagi score. CI rejects mismatches.

3. Open a PR

Open a PR with your JSON under brickagi/submissions/community/. The repo's CI (.github/workflows/verify-submission.yml) runs brickagi validate-submission, recomputes scores, scans for canary contamination outside allowed provenance fields, and rejects mismatches.

4. Badges

audited — trajectories link resolves, protocol/task/scorer provenance matches, and the maintainer reran a sample against the same model + protocol with matching numbers.
ranked — schema-valid submission with a matching protocol hash and reproducible scores; on the leaderboard but not yet audited.

Anti-gaming, in brief

Each task.yaml embeds a 16-hex canary GUID; if your model's outputs correlate with the canary, you have a contamination signal worth investigating. Sampling parameters (temperature, max tokens, validator iteration cap) are pinned per protocol and CI rejects out-of-envelope runs. Hand-edited scores are caught the moment CI re-pipes your submission through brickagi score.