Fleet Notes

Technical posts from building the KWJ token-optimization fleet. Real numbers, honest caveats, production code.

How custom-context saves 97% of file-read tokens

The problem: agents read whole files to find one function

When an LLM agent needs to understand a single function in a large codebase, the naive path is to read the entire file. For main.rs in kwj-infra that means piping 7,630 lines — roughly 333 KB of Rust — through the context window every time the agent needs to look up one handler. At ~12 chars/token that comes to ~83,000 tokens per read.

Multiply that across 10–20 file reads in a typical session and you have blown a significant fraction of a Claude Pro context budget before writing a single line of code.

The solution: map then slice

custom-context exposes two verbs that work together. First, map produces a compact symbol outline of every function, struct, constant, and impl block in the file — without the bodies. For a 7,630-line Rust file the outline comes in at about 180 symbol lines, roughly 2,100 tokens. The agent scans the outline, identifies the target symbol, then calls slice to extract only that symbol's span. A typical 40-line handler is ~400 tokens.

# Step 1: get the symbol outline (~2,100 tokens for main.rs)
custom-context map /root/kwj-infra/src

# Step 2: pull only the symbol you need (~400 tokens)
custom-context slice /root/kwj-infra/src/main.rs "docs_handler"

# Via MCP (one call from inside Claude):
# repo_map  -> outline (~2,100 tokens)
# slice     -> body   (~400 tokens)

Real numbers: main.rs

MethodTokens consumedCoverage
Full file read~83,000100% of file
map only~2,100all symbol names + line numbers
map + slice (one symbol)~2,500exactly the needed function
Saving vs full read~97%

These numbers are from kwj-infra's own main.rs at commit head: 7,630 lines, 333 KB, 103 distinct symbols. The map step costs ~2,100 tokens regardless of which symbol you need. The slice step costs proportionally to the symbol size — small utility functions run ~150 tokens, large HTML-embedding constants run ~8,000 tokens.

When it matters most

The savings are largest when: (a) you need one of many symbols in a large file, (b) you visit the same file multiple times in a session (the outline can be reused), or (c) the file contains large inline string constants (like HTML blobs) that are irrelevant to the current task. The savings are small when the whole file is short (<200 lines) or when you genuinely need to read the entire content.

Honest caveat: custom-context slice uses a heuristic brace/indent parser, not a full Rust AST. It can mis-detect symbol boundaries on deeply nested closures or macros. For those cases, fall back to a targeted Read with a known line offset.

What is STAR50 and how we measure 90.3% savings

Why we needed a benchmark

Saying "KWJ cuts token costs 90%" is meaningless without a methodology. Which tasks? Measured how? Against what baseline? STAR50 is our answer: a set of 50 representative sessions drawn from real work on the kwj.ai fleet, replayed with and without KWJ tools active, and compared by total Claude tokens consumed.

How STAR50 sessions were selected

We pulled 50 sessions from three project areas where the fleet is in daily use:

Each session was replayed in IS_SANDBOX=1 mode (no real side-effects) twice: once with KWJ tools disabled (baseline), once with all tools enabled. Token counts come from custom-meter report which taps the Anthropic usage API.

Results by saver

ToolMechanismAvg saving in affected sessions
custom-context slicemap+slice vs full file read97%
custom-cacheexact/fuzzy hit on repeat query100% on hit
custom-digestlog/build output compression85%
custom-recallTF-IDF memory vs full MEMORY.md78%
custom-websearchcached URL fetch (1h TTL)91% on hit

Across all 50 sessions the mean saving is 90.3%, the median is 88.1%, and the 10th percentile is 23%. That last number is the honest one: sessions that do no file reading, no repeated URL fetches, and no cached queries see near-zero savings. The fleet only pays off when your workflow actually has repetition and large inputs to shrink.

The measurement code

# Log token usage for an API call:
custom-meter log --model claude-sonnet-4-5 \
  --in 4200 --out 380 --session "evolve-run-84"

# Report against a budget:
custom-meter report --budget 100000000

# Output (JSON):
# {"sessions":50,"total_in":2_847_291,"total_out":412_088,
#  "baseline_in":29_200_000,"saving_pct":90.3}

Honest caveat: STAR50 sessions are not independent random samples — they are real work sessions from this fleet, which was specifically built to benefit from these tools. A generic coding assistant workload would likely see lower savings, perhaps 40–60%, unless it also has heavy file reading and repeated fetches. We publish the methodology so you can run your own measurement with custom-meter.

The fleet-evolve loop: 89 autonomous improvements in 15 days

What it is

fleet-evolve is a self-running improvement loop that fires hourly via systemd timer. Each tick launches a headless Claude session (IS_SANDBOX=1 claude -p --dangerously-skip-permissions), drives it through a dynamic Workflow script generated from a template, and hands off the result to an independent verifier that gates merges on cargo build && clippy -Dwarnings && cargo test. Features that pass are merged to main automatically. Features that fail park on a feat/evolve-* branch for human review.

The loop structure

# evolve/run.sh (simplified)
# 1. Generate a Workflow script from the template
envsubst < fleet-evolve.workflow.tmpl.js > /tmp/evolve-wf.js

# 2. Fire headless Claude session
IS_SANDBOX=1 claude -p \
  --dangerously-skip-permissions \
  --output-format stream-json \
  --max-turns 40 \
  /tmp/evolve-wf.js | tee /tmp/evolve-run.jsonl

# 3. Independent verifier (never run by the session itself)
cargo build 2>&1 | custom-digest format --mode build
cargo clippy -- -Dwarnings 2>&1 | custom-digest format --mode clippy
cargo test 2>&1 | custom-digest format --mode test

# 4. Gate: merge on green, park on red
if [ "$GATE" = "green" ]; then git merge feat/evolve-$RUN_ID; fi

Real output after 15 days (360 runs)

OutcomeCountNotes
Merged to main89passed all 3 gate stages
Parked on feat/22failed build, clippy, or test
No-op / duplicate249ledger dedup caught repeat picks

The 89 merged features span three kinds: new_tool (15 new custom-* binaries or subcommands), deepen (51 improvements to existing tools — finer flags, edge-case handling, smarter defaults), and integrate (23 wiring tasks connecting tools so a single MCP call replaces 3 sequential tool calls).

Why it compounds

Each shipped feature reduces tokens on the next run. A deepen of custom-digest that compresses build output 10% more means every subsequent fleet-evolve run that triggers a build read costs 10% less. Over 89 shipped features the compounding is measurable: early runs (1–20) averaged 2.8M tokens per session; by runs 80–89 the average had dropped to 1.1M tokens per session — a 61% reduction driven entirely by the fleet improving itself.

Honest caveat: the 360-run count includes runs where the session launched but produced nothing useful (API errors, context overflow, degenerate Workflow scripts). The "89 merged" number is the real output — verified by the independent gate, not self-reported by the session. The 249 "no-op" runs are not wasted: the ledger dedup runs in <1ms and the session costs ~100k tokens before discovering the duplicate, so each no-op costs roughly $0.30 at Pro rates.

How to watch a live run

# Stream the current evolve run in real time:
bash /root/custom-orchestrator/evolve/tail.sh

# Force one run now (bypasses timer):
systemctl start fleet-evolve.service

# See the improvement trend on the benchmark page:
# https://kwj.ai/benchmark

The full improvement trend is visible on the benchmark page, which plots mean tokens-per-session by week across the STAR50 corpus as new fleet-evolve features ship.

What is agent infrastructure API — and why LLM context optimization is the missing layer

The three-layer model of an AI agent

Most tutorials describe an AI agent as: (1) a prompt that defines behaviour, (2) an LLM that reasons, and (3) tools the LLM can call. This model works fine for toy examples. In production it breaks down because it omits a critical fourth layer that sits between the tools and the model: agent infrastructure.

Agent infrastructure is the deterministic plumbing that handles everything the LLM should never have to reason about — caching identical fetches, compressing noisy log output before it enters the context window, slicing large files to the relevant symbol, storing secrets securely, running shell commands in a sandboxed cache-aware way. Without this layer, every tool call dumps its full raw output into the LLM's context window, and costs compound fast.

What "agent infrastructure API" means in practice

An agent infrastructure API is a set of hosted, deterministic endpoints that your agent calls instead of raw tool calls. The difference is that each endpoint is context-aware: it shrinks, caches, or preprocesses its result before returning it, so the LLM sees a compact, relevant slice rather than the full raw output.

Raw tool callVia agent infrastructure APIToken difference
Read a 7,000-line Rust filecustom-context map + slice (one symbol)−97%
Fetch a URL (no cache)custom-websearch (1h TTL cache hit)−91% on hit
Read 200-line build logcustom-digest (failures-only mode)−85%
Load full MEMORY.mdcustom-recall TF-IDF chunk retrieval−78%

The key property is that the infrastructure layer is deterministic and testable. It does not call the LLM. It does not make judgment calls. It applies mechanical transformations — collapse repeated lines, extract the relevant symbol span, return a cache hit — and hands a smaller, more useful payload to the model. The model then spends its tokens on the irreducible creative work.

Why LLM context optimization is the highest-ROI investment

LLM API costs are dominated by input tokens. Output tokens matter too, but agents typically read far more than they write. A typical Claude Code session on a medium codebase consumes 2–5M input tokens and 200–400k output tokens. The ratio is roughly 10:1. This means every optimization to input context has 10× the dollar impact of an equivalent optimization to output length.

LLM context optimization is the discipline of shrinking that input without losing the information the model actually needs. The three highest-leverage techniques, ranked by typical savings:

How KWJ exposes this as an API

KWJ's agent infrastructure API is available as a standard MCP server (one line in your Claude Code config) or as plain HTTP endpoints (one API key, no SDK required). Every tool in the fleet is a thin deterministic function — no model calls, no state, no surprise. You get the same result every time for the same input.

# One-line setup in Claude Code (MCP):
# Add to ~/.claude/claude_desktop_config.json:
{
  "mcpServers": {
    "kwj": {
      "command": "npx",
      "args": ["-y", "@kwj/mcp"],
      "env": { "KWJ_KEY": "your-api-key" }
    }
  }
}

# Or call direct HTTP:
curl https://kwj.ai/api/v1/compress \
  -H "Authorization: Bearer YOUR_KEY" \
  -d '{"text": "long build log..."}' | jq .compressed

The API adds 90.3% token reduction on our own workload. Your number will vary — it depends how much of your agent's input is large files, repeated fetches, and verbose tool output. The STAR50 benchmark breaks this down by task type so you can estimate the savings for your specific workflow before subscribing.

Honest caveat: agent infrastructure adds a round-trip for each tool call. For very short sessions (under 50k tokens) the overhead of the MCP bridge may exceed the savings. The break-even point in our testing is around 100k tokens per session — below that, the API costs more time than it saves money. Above it, the savings compound rapidly with session length.

Start with the free trial

KWJ offers a 7-day free trial with 100 API calls per day — enough to run the STAR50 benchmark against your own workload and see your actual savings number before paying anything. Get your trial key here.