Trust Fund Baby — Receipts

We pulled our own receipts down.

A claim outran its evidence. We took the page offline, told the truth, and changed the standard: nothing is marked verified until an independent peer reviews the actual run. The result below is real, recorded, and awaiting peer review.

Old claims void · Peer review now required

What went wrong

We said our local harness system got DeepSeek R1 to 102 of 113 on DeepSWE, and we let that read like "R1 solved it." It did not.

When we audited our own run honestly, the truth was uncomfortable. 102 of the "successful" rows had empty final model submissions. R1's visible output across the entire 113-task packet was about 2,800 tokens total. The code changes that passed were applied by our own harness machinery, and some of that machinery was task-specific. In plain English: we left something close to an answer key inside the test environment, then measured the whole operating room and called it the patient.

It was not fraud, and no other model cheated. We built a genuinely powerful system and pointed the measurement at the wrong thing. There is a real distinction we blurred: how well our system ships a result, versus how well the model itself reasons. Both are valuable. They are not the same number, and we labeled them the same.

A community member, Tim Osterhus (Substack: Tim Ost), found the hole and said so. He was right. That is the whole case for review: a second set of eyes you trust to tell you the truth before the world does. We would rather fail fast and in the open than be quietly wrong.

The new standard

We did not just take the page down. We rebuilt the machine so the mistake cannot repeat, and we made peer review the gate. The lessons live in the architecture, not in promises:

The model under test cannot see the answers. Enforced at the database, not by good intentions. No answer key in the room.
Patient score and operating-room score are separate, measured things. The exact wound that took the old page down is now structurally impossible.
Every claim ships with a confidence interval and a held-out check. We cannot promote noise, and we cannot promote a memorized result.
Nothing is marked verified until an independent peer reviews it. Reviewers receive access to reproduce every number against the live system. Until then, the claim reads awaiting peer review.

The first result — awaiting peer review

The question: does a model trust a tool's answer over its own reasoning, even when the tool is wrong? We inject a plausible-but-wrong answer and measure how often R1 commits to it. The model never sees the correct answer (firewall enforced at the database), so a pass proves the model reasoned. A harness teaches it to reason first; deference falls and accuracy holds. The numbers are recorded and reproducible. They are not public-verified yet.

Condition	Deference	Accuracy
R1, raw (no harness)	0.15	0.78
+ harness, tuning tasks	0.05	0.86
+ harness, held-out tasks	0.00	1.00
+ harness, canonical 32B	0.00	0.90

Awaiting peer review

Status: recorded · reproducible against the live system · awaiting independent peer review

The harness the model wears, the surgery techniques, and the per-model profiles are withheld as trade secrets. Peer reviewers verify the result under access; the methods are not published.