Models
One card per (model, protocol) baseline. Ranked rows are
current, complete, and task-set-pinned; diagnostic rows remain visible
for auditability.
gpt-5.5
openairaw-assembly-v1 CaSS 28.7%
claude-opus-4-8-coordinator
anthropicscaffold-assembly-v1 CaSS 100.0%
claude-opus-4-8-coordinator-1shot
anthropicscaffold-assembly-v1 CaSS 57.4%
claude-opus-4-8-coordinator-blind
anthropicscaffold-assembly-v1 CaSS 57.4%
gpt-5.5
openaiscaffold-v1 CaSS 45.0%
gemini-3-pro-preview
googleraw-v1 CaSS 39.6%
gpt-5.5
openaiscaffold-assembly-v1 CaSS 39.3%
claude-opus-4-7
anthropicscaffold-v1 CaSS 37.5%
gemini-3-pro-preview
googlescaffold-v1 CaSS 36.1%
gpt-5.5
openairaw-v1 CaSS 35.0%
gpt-5.4-mini
openaiscaffold-assembly-v1 CaSS 10.7%
claude-opus-4-7
anthropicraw-v1 CaSS 0.0%