STAR50 Benchmark

What is STAR50?

STAR50 is 50 representative Claude Code sessions drawn from our own development of kwj.ai — the fleet-evolve campaign, the value-research scanner, and web3 audit work. These are real sessions, not synthetic benchmarks.

All sessions run with IS_SANDBOX=1. Baseline = Claude Code without KWJ MCP tools active. "With KWJ" = the same session type with custom-context, custom-digest, custom-cache, and custom-recall active.

We publish the raw session token counts in our GitHub repo under /benchmark/star50/.

Results by Task Type

Task Type	Baseline Tokens	With KWJ	Savings %
Code audit	2,620,000	254,000	90.3%
Daily research	180,000	22,000	87.8%
Iterative builds	450,000	81,000	82.0%
File navigation	120,000	8,000	93.3%

How we measure

Token counts come from custom-meter, which taps the Claude API response stream and counts input + output tokens per session. The meter runs identically in baseline and KWJ-active conditions.

Each task type represents 10-15 sessions (STAR50 = 50 total). Savings % is the mean reduction across sessions of that type, not the best case.

Baseline sessions used Claude Code defaults: full file reads via the Read tool, no caching, no output compression. KWJ sessions used the same Claude Code session with the MCP bridge active — no prompt changes were made.

Caveats

Individual results vary by workflow. Token savings are highest for tasks with repeated file reads and research queries. Sessions that are already minimal (short conversations, small files) will see lower relative savings. The numbers above represent our workload profile and may not match yours exactly.

STAR50 Benchmark Methodology

What is STAR50?

Results by Task Type

How we measure

Caveats