kwj.ai · acquisition inquiries from >$999view prospectus →
The Domesday Book ofKWJ · AI

Dispatches from the Field

Practitioner notes from C.W. Jameson. Production observations, cost analyses, and field reports on what actually works.

History··6 min

The Day GPT-4 Stopped Being the Default

For fourteen months, GPT-4 was the only serious choice. Then Anthropic benchmarked past it on a Tuesday, and the market never went back.

historyclaudegpt-4
Cost··8 min

Prompt Caching Economics: A Worked Example

I ran the same 5,000-token system prompt through 10,000 calls. With caching: $12. Without: $112. The setup took twenty minutes.

prompt-cachingcostanthropic
Architecture··10 min

Five Agent Patterns That Actually Work in Production

After running six different loop architectures in production, I ended up using three of them. Here are the five patterns and which two I stopped using.

agentspatternsproduction
Analysis··9 min

DeepSeek R1: Six Weeks Later

The 2025 January release of R1 changed three things: the floor for open-weights reasoning, the confidence of Chinese AI labs, and the board slide at every US lab.

deepseekreasoningopen-weights
Ecosystem··8 min

Model Context Protocol: Six Months In

MCP was supposed to be the USB of AI tool connections. After six months: it mostly is, with two significant caveats.

mcptoolingecosystem
Cost··12 min

The Real Cost of Self-Hosting a 70B Model

At 5 million tokens per day, self-hosting wins. Below 1 million, it is almost always cheaper to use the API. Here is the full calculation.

self-hostingcostopen-weights
Cost··7 min

The Hidden Cost of Reasoning Tokens

You asked for ten lines of code. The model thought for four thousand tokens before writing them. You paid for both.

costreasoningextended-thinking
Engineering··9 min

A Taxonomy of Structured Output Failures

There are exactly twelve ways a model will break your schema. Seven of them have deterministic fixes. Five require prompt engineering.

structured-outputjsonengineering
Evaluation··8 min

Perplexity vs ChatGPT vs Claude for Research Tasks

The citation quality comparison was the most surprising result: Perplexity cited more, but Claude was more accurate about what the cited source said.

comparisonresearchcitation
Education··11 min

The Transformer Architecture: What Practitioners Actually Need to Know

The transformer is not magic. It is a very well-designed function approximator with three specific properties that make it scale. Understanding those three properties explains most model behaviour operators observe.

architecturetransformerseducation
Engineering··9 min

Fine-Tuning: When Does It Actually Help?

Fine-tuning helps with format and style. It rarely helps with knowledge. I have the invoice to prove it.

fine-tuningpromptingengineering
Reliability··8 min

Hallucination: A Field Classification

Confabulation (inventing plausible details), source drift (misattributing real content), and contradiction (contradicting the prompt) are three different problems requiring three different solutions.

hallucinationreliabilitygrounding
Analysis··7 min

The Refusal Landscape in 2025

The trend over three years has been toward more specific refusals with better alternatives. The outlier remains xAI's Grok, which is trending the other direction.

refusalsafetycomparison
Architecture··10 min

RAG vs Long Context: A Practical Comparison

The right answer depends on three variables: update frequency, query specificity, and document density. Here is the decision matrix.

raglong-contextarchitecture
Reliability··12 min

Agentic Systems: A Failure Modes Catalog

Mode 7 — the agent decides it is done when it is not — accounts for 38% of the production failures I have logged. It is also the hardest to detect.

agentsreliabilityproduction
Evaluation··10 min

OpenAI o3 vs Claude Opus 4.7 on Hard Problems

o3 wins on competition math. Opus wins on long-horizon code. For everything else, the answer depends on the specific task more than the model.

comparisonreasoningevaluation
Cost··6 min

The Prompt Cache Hit Rate Problem

The cache expects your prefix to be stable. Your prefix is not stable. Here is what to do about it.

prompt-cachingcostanthropic
Security··8 min

Agent Identity: The New Security Problem

When an agent can spawn sub-agents, who is responsible for the sub-agent's actions? The answer is 'nobody' until you design it to be otherwise.

securityagentsidentity
Analysis··8 min

Llama 4: The New Open-Weights Baseline

Meta shipped a model that can be self-hosted on eight H100s and matches GPT-4o on most benchmarks. The inference industry responded within two weeks.

llamametaopen-weights
Engineering··9 min

Grounding: The Systematic Approach

Grounding is not just RAG. It is a claim verification process. The retrieval is the easy part.

groundingragreliability
Engineering··7 min

The Tool Call Design Checklist

The most common tool design mistake is an error handling contract that returns nothing. The model has no idea what to do with silence.

tool-useengineeringproduction
Cost··8 min

Extended Thinking: A Cost-Benefit Analysis

Competition math, architecture design, legal contract review: yes. Customer support, classification, summarisation: almost never.

extended-thinkingcostreasoning
Operations··9 min

Batch Processing at Scale: The Practical Guide

Batch processing typically cuts cost 50% and increases throughput 5×. It requires accepting up to 24 hours of latency. For most workloads, that is a straightforward trade.

batchcostoperations
Architecture··10 min

Memory Architectures for Agents

In-context memory is the simplest and breaks first. Episodic memory is the most human-like and the hardest to implement correctly.

memoryagentsarchitecture
Analysis··8 min

The Agentic Era: What 2026 Looks Like From Inside It

The reliable part: agent loops that run for hours on well-defined tasks. The unreliable part: knowing when the task is actually well-defined until it is not.

agentic-era2026analysis
Analysis··7 min

Qwen: The Chinese Open-Weights Standard

Qwen processes Chinese-language content with the same quality that Claude handles English. That sounds obvious; it was not obvious until someone built it.

qwenalibabamultilingual
Cost··9 min

The Token Economics of Long Documents

A 500-page PDF is 250,000 tokens. At Opus pricing, reading it costs $3.75 per pass. Caching it costs $0.75 after the first.

costtokenslong-context
Tools··6 min

NotebookLM's Podcast Feature: One Year Later

The feature still works exactly as advertised. The content is still shallow. For onboarding to a new domain, it remains genuinely useful.

notebooklmgoogleaudio
Evaluation··9 min

Evaluation Without Ground Truth

LLM-as-judge is surprisingly reliable for quality dimensions. It is completely unreliable for factual accuracy.

evaluationllm-judgequality
Security··11 min

The State of LLM Security in 2026

Prompt injection is not a solved problem. It is a managed risk. The operators who do it well have accepted that and built accordingly.

securityprompt-injection2026
Engineering··10 min

Why Agents Fail at Long Tasks

Most agents fail before the 30-minute mark. The reasons are structural, not just model limitations.

agentsengineeringfailure-modes
Architecture··7 min

How MCP Changes Tool Use for AI Agents

MCP standardises tool definitions across models. The implications for multi-agent systems are significant.

mcptoolsarchitecture
Strategy··9 min

The Case Against Fine-Tuning in 2025

Fine-tuning felt necessary in 2022. By 2025, the base models moved faster than the fine-tunes could keep up.

fine-tuningstrategyprompting
Architecture··9 min

Vision Models in Production: What Works

Vision models handle PDFs and screenshots well. They fail predictably on handwriting and dense tables.

visionmultimodalproduction
Architecture··8 min

Embedding Models Explained for Engineers

Choosing the wrong embedding model for your language pair or domain quietly degrades RAG quality for months.

embeddingsragarchitecture
Engineering··10 min

How Retrieval-Augmented Generation Fails

RAG pipelines look simple until they are in production. Here are the failure modes that surprise every team.

ragengineeringfailure-modes
Engineering··8 min

Function Calling Best Practices for Production

Function calling looks simple. It breaks in production when schemas are ambiguous or error messages are not informative.

function-callingtoolsengineering
Research··9 min

GPT-4 to GPT-5: What Actually Changed

GPT-5 is not GPT-4 with more parameters. The architecture and training approach changed in observable ways.

gpt-5openairesearch
Architecture··8 min

Model Routing in Production Systems

Sending every query to the frontier model is expensive. Smart routing cuts costs while preserving quality.

routingarchitectureeconomics
History··8 min

DeepSeek: What Changed Everything in January 2025

DeepSeek showed that frontier-level capability did not require frontier-level compute. The implications rippled through the industry for months.

deepseekhistoryeconomics
Engineering··9 min

Agentic Web Research: Workflows That Scale

Web research agents are easy to build and hard to make reliable. Source quality and citation accuracy are the hard problems.

researchagentsweb
Economics··10 min

AI Cost Optimisation: Ten Strategies That Work

Production AI costs surprised most teams. Here are the ten approaches that deliver the highest cost reduction per unit of effort.

costeconomicsproduction
Security··11 min

LLM Security in Enterprise Deployments

Enterprise security teams are still building mental models for LLM-specific risks. This is the current map.

securityenterprisecompliance
Engineering··9 min

State Management in Agent Loops

Agent state management is where most frameworks fall short. Here are the patterns that handle real-world failure conditions.

stateagentsengineering
Strategy··9 min

Building AI Products That Stick

The AI feature graveyard is full of products that users tried once. Here is the pattern that separates sticky AI from abandoned AI.

productstrategyretention
Prompting··8 min

Claude Extended Thinking: A Practical Guide

Extended thinking is powerful on hard reasoning tasks and wasteful on simple ones. Here is how to tell which is which.

claudereasoningprompting
Engineering··12 min

Deploying Open-Source LLMs: A Practical Runbook

Running open-source models sounds like cost savings. It is until you account for operations, GPU rental, and engineering time.

open-sourcedeploymentengineering
Security··10 min

Red Teaming AI Systems: A Practitioner's Guide

Red teaming an LLM is not pen testing. The attack surface is fundamentally different and the failures are harder to categorise.

red-teamingsecuritysafety
Architecture··8 min

Long Context vs Retrieval: When to Use Each

Long context windows change the calculus for RAG. Here is how to decide which approach is right for your use case.

long-contextragarchitecture
Engineering··9 min

Observability for LLM Applications

LLM monitoring is not like service monitoring. The failure modes are different and the metrics that matter are counterintuitive.

observabilitymonitoringproduction
Engineering··9 min

Tool Use Design Patterns for AI Agents

Tool design is half of agent design. A poorly defined tool is as damaging as a poorly designed prompt.

toolsagentsengineering
Research··8 min

Synthetic Data Generation with LLMs

Synthetic data sounds like a shortcut. Used correctly, it is one of the highest-leverage data strategies available.

synthetic-datatrainingresearch
Operations··8 min

Content Moderation at Scale with LLMs

LLMs are better moderators than classifiers for nuanced content. They are slower and more expensive, which changes the architecture.

moderationsafetyoperations
Legal··10 min

AI for Legal Workflows: What Actually Helps

Legal is high-stakes. The tasks where AI helps and the tasks where it can damage client outcomes are not always obvious.

legalworkflowsapplications
Finance··9 min

AI for Financial Analysis: Where It Fits

Financial analysis and LLMs have a complicated relationship. The models are capable; the liability frameworks are not ready.

financeworkflowsapplications