Submit your model
BrickAGI is PR-driven and CI-re-scored. Submitter-supplied scores are not honored — CI recomputes everything from your submission JSON before merge.
LLM agents: see AGENT-SKILL.md for a self-contained skill file covering output format, tool signatures, protocols, and the bonding-layer rule. (raw)
0. Read first
- Methodology — what CaSS measures and why we publish it as the v1 headline.
- Schema: SCHEMA.md.
- Step-by-step: SUBMITTING.md.
1. Pick a protocol
A protocol is a frozen artifact specifying the system prompt,
required tools, validator-iteration cap, and output format. Three
protocols ship in v1.1, and they are never blended on
the leaderboard — raw-v1, scaffold-v1, and
scaffold-assembly-v1 are separate rows.
| id | frozen | prompt hash |
|---|---|---|
raw-v1 | 2026-04-27 | sha256:ec6641f2d101a9de13bfca90229cdb93b49f1bd9369f3f1e7587f41b33c4c69c |
scaffold-assembly-v1 | 2026-04-29 | sha256:394f2d859fafd42d0ac144c5f251e68ea2e7f4c06da48bf733e1c2505055ee9c |
scaffold-v1 | 2026-04-27 | sha256:83fca868afffd414061a997e14f4688a0c96a4e750c2c91de23e7c5be11a2b7d |
2. Run the benchmark
The reference runner is brickagi (Node, since the
validators are Node):
# install
git clone https://github.com/withtally/brickagi.git
cd brickagi/brickagi && npm install
# run all 20 tasks under a frozen protocol
brickagi run --model your-model-name \
--protocol scaffold-v1 \
--max-cost 5 \
--out submissions/community/yourhandle-model-scaffold-v1.json
# optional re-score if you edited/generated JSON outside the runner
brickagi score submissions/community/yourhandle-model-scaffold-v1.json
# local schema/provenance check
brickagi validate-submission submissions/community/yourhandle-model-scaffold-v1.json
Submitters can produce a compatible submission JSON outside
brickagi run, but the JSON must include a matching
protocol_hash, current task/scorer/validator provenance,
and per-task results that reproduce under brickagi score.
CI rejects mismatches.
3. Open a PR
Open a PR with your JSON under
brickagi/submissions/community/. The repo's CI
(.github/workflows/verify-submission.yml) runs
brickagi validate-submission, recomputes scores, scans for
canary contamination outside allowed provenance fields, and rejects
mismatches.
4. Badges
- audited — trajectories link resolves, protocol/task/scorer provenance matches, and the maintainer reran a sample against the same model + protocol with matching numbers.
- ranked — schema-valid submission with a matching protocol hash and reproducible scores; on the leaderboard but not yet audited.
Anti-gaming, in brief
Each task.yaml embeds a 16-hex
canary GUID; if your model's outputs correlate
with the canary, you have a contamination signal worth investigating.
Sampling parameters (temperature, max tokens, validator iteration
cap) are pinned per protocol and CI rejects out-of-envelope runs.
Hand-edited scores are caught the moment CI re-pipes your submission
through brickagi score.