How we measure token savings and what the numbers mean.
STAR50 is 50 representative Claude Code sessions drawn from our own development of kwj.ai — the fleet-evolve campaign, the value-research scanner, and web3 audit work. These are real sessions, not synthetic benchmarks.
All sessions run with IS_SANDBOX=1. Baseline = Claude Code without KWJ MCP tools active. "With KWJ" = the same session type with custom-context, custom-digest, custom-cache, and custom-recall active.
We publish the raw session token counts in our GitHub repo under /benchmark/star50/.
| Task Type | Baseline Tokens | With KWJ | Savings % |
|---|---|---|---|
| Code audit | 2,620,000 | 254,000 | 90.3% |
| Daily research | 180,000 | 22,000 | 87.8% |
| Iterative builds | 450,000 | 81,000 | 82.0% |
| File navigation | 120,000 | 8,000 | 93.3% |
Token counts come from custom-meter, which taps the Claude API response stream and counts input + output tokens per session. The meter runs identically in baseline and KWJ-active conditions.
Each task type represents 10-15 sessions (STAR50 = 50 total). Savings % is the mean reduction across sessions of that type, not the best case.
Baseline sessions used Claude Code defaults: full file reads via the Read tool, no caching, no output compression. KWJ sessions used the same Claude Code session with the MCP bridge active — no prompt changes were made.
Individual results vary by workflow. Token savings are highest for tasks with repeated file reads and research queries. Sessions that are already minimal (short conversations, small files) will see lower relative savings. The numbers above represent our workload profile and may not match yours exactly.