STAR50 Benchmark Methodology

How we measure token savings and what the numbers mean.

What is STAR50?

STAR50 is 50 representative Claude Code sessions drawn from our own development of kwj.ai — the fleet-evolve campaign, the value-research scanner, and web3 audit work. These are real sessions, not synthetic benchmarks.

All sessions run with IS_SANDBOX=1. Baseline = Claude Code without KWJ MCP tools active. "With KWJ" = the same session type with custom-context, custom-digest, custom-cache, and custom-recall active.

We publish the raw session token counts in our GitHub repo under /benchmark/star50/.

Results by Task Type

Task Type Baseline Tokens With KWJ Savings %
Code audit 2,620,000 254,000 90.3%
Daily research 180,000 22,000 87.8%
Iterative builds 450,000 81,000 82.0%
File navigation 120,000 8,000 93.3%

How we measure

Token counts come from custom-meter, which taps the Claude API response stream and counts input + output tokens per session. The meter runs identically in baseline and KWJ-active conditions.

Each task type represents 10-15 sessions (STAR50 = 50 total). Savings % is the mean reduction across sessions of that type, not the best case.

Baseline sessions used Claude Code defaults: full file reads via the Read tool, no caching, no output compression. KWJ sessions used the same Claude Code session with the MCP bridge active — no prompt changes were made.

Caveats

Individual results vary by workflow. Token savings are highest for tasks with repeated file reads and research queries. Sessions that are already minimal (short conversations, small files) will see lower relative savings. The numbers above represent our workload profile and may not match yours exactly.