Trust Fund Baby — The Harness
Verified by review. Not by trust.
We build harnesses that change how a model behaves — less deference to a wrong tool, steadier reasoning, fewer silent failures. We learned the hard way that a strong result is worth nothing until someone independent can reproduce it. So we made that the rule.
Why review is the standard
Our first big number was wrong — not faked, but pointed at the wrong thing, and stated louder than the evidence underneath it. A community reviewer caught it before we did. He was right, and we were grateful, and we changed how we operate because of it. The full account is on the receipts page.
The lesson was simple: a second set of trusted eyes, before the world's, is not a courtesy — it is the control. We would rather fail fast and in the open than be quietly wrong in private. Review is how we make sure the next claim is earned.
What a reviewer verifies
An independent reviewer receives scoped, time-boxed access — under agreement — to reproduce a claim against the live system. They confirm the result is real and that it was measured honestly:
- The model under test could not see the answers. They check the separation themselves — it is enforced at the database, not asserted in a slide.
- The model's own contribution is measured, not the surrounding machinery's. The exact mistake that took our first page down is structurally separated, and a reviewer can confirm it.
- The numbers carry confidence intervals and a held-out check. A reviewer re-runs them and sees the same result, on tasks the harness never tuned against.
- The result reproduces. Same commands, same database, same model — a reviewer gets the same numbers, or the claim does not stand.
What stays protected
The harness itself — the instructions a model wears, the surgery techniques, the per-model profiles — is our trade secret and stays that way. A reviewer verifies that the result is true. They do not receive the recipe that produced it, and we do not publish it. Honest evidence and protected methods are not in tension; review is exactly how you get both.
The status every claim carries
Every published claim wears its review status, in the open. A blurred receipt is not us hiding the number — it is the number waiting for the gate it has not cleared yet.
What is not yet live
The same standard we ask of others, applied to ourselves: the measurement layers of this system run today and carry receipts; the online policy layer — the part that picks configurations for live production traffic — is built and gated but has not yet served a real request. It runs in shadow, logging what it would have chosen, until it earns promotion through its own staged protocol. When a layer is dormant, this page says so before any claim does.
Become a reviewer
We are opening a small cohort of independent reviewers — people with the background to verify model-evaluation claims and the spine to say when a number does not hold. If that is you, reach out through projecttfb.com. The first verified receipt is waiting on the right reviewer.