Agentic Engineering Weekly for April 17-25, 2026

This week, the receipts arrived from every direction at once. Thoughtworks' Tech Radar reads like a warning letter, frontier models silently corrupt a quarter of what you delegate to them, and the companies that told you AI would pay for itself are doubling their price, while employees world-wide burn tokens just to look productive (and not get fired).

My top 3 picks this week

AI Coding For Real Engineers - Matt Pocock, AI Hero (@mattpocockuk): Pocock's agentic engineering worklow closely resembles my own. Highly worth watching. (video)
Slice the Problem Grow the Solution — free tl;dr extract: Cockburn's 30-page sharable preview of their upcoming book. Instead of diving into yet-another-agentic-framework, do yourself a favor and dive into swe fundamentals (free PDF)
The Pulse: 'Tokenmaxxing' as a weird new trend: Gergely names a coined concept you will hear repeated for the rest of the year. (article)

Code is cheap, code isn't cheap. Which is it?

Two takes landed in the same week and they look like opposites until you read them carefully. Cat Wu, Head of Product for Claude Code at Anthropic, said it plainly: "As code becomes much cheaper to write, the thing that becomes more valuable is deciding what to write." Code is cheap. The marginal cost of the next prototype, the next draft implementation is approaching zero, and the scarce resource while prototyping is no longer typing speed. It is product taste, problem selection, and knowing what to build next.

Pocock made the counter-argument in the same week. After eighteen months of teaching developers to ship with AI agents, his observation was sharp: AI coding tools are extraordinary when used well and burying when used badly, and the difference is the process, not the tool. The fundamentals (TDD, clean boundaries, friction as a feature) are the only thing standing between a productive AI workflow and a codebase nobody can reason about in six months. Code isn't cheap when you measure it in the comprehension you have to maintain and the cognitive debt you have to pay back later.

Both claims are true at once, which is why this is the harder conversation to have right now. The cost of generating code went to zero. The cost of owning code did not. A free extract from Alistair Cockburn's upcoming book Slice the Problem Grow the Solution lands the same week with a frame that sharpens the point: unvalidated decisions are toxic inventory, and the fix is moving decisions through the system in very small, validated increments. Walking skeletons, nano-increments, Learning-Value-Trim loops. His AI chapter says it plainly: treat AI as an unreliable partner, use tests, boundaries, and tiny increments for control. The teams that thrive in the next two years will be the ones who internalise both halves of the paradox: ambitious enough to use the cheap-code superpower, disciplined enough to refuse the expensive-code trap. The fundamentals we have known about for 50 years just became your competitive advantage again.

Worth reading:

It Ain't Broke: Why Software Fundamentals Matter More Than Ever: Pocock distilling eighteen months of teaching into a hard-edged claim about process versus tool. (video)
How Anthropic's product team moves faster than anyone else: Cat Wu on why cheap code shifts value to deciding what to build. (video)
Slice the Problem Grow the Solution — free tl;dr extract: Cockburn's 30-page sharable preview of the full book covering unvalidated decisions as toxic inventory, tiny increments as the fix. (free PDF)

Tokenmaxxing: gaming AI metrics on both sides of the API

The Pragmatic Engineer named a strange new behaviour at Meta, Microsoft, and Salesforce: developers are deliberately burning tokens (and money) to inflate their AI-usage metrics, because their employers are treating tokens-consumed as a proxy for AI adoption. Goodhart's law just landed in the AI talent market. The moment you incentivise a metric, people optimise for the metric, even when the metric is decoupled from the outcome. Engineering managers reading this should check what dashboards their leadership is looking at, and what story those numbers are telling them about the team they think they have.

The vendor side of the API is doing the inverse. Anthropic briefly removed Claude Code from the $20 Pro plan with no announcement, then partially walked it back when Simon Willison and others noticed. Microsoft's leaked internal docs show the weekly cost of running GitHub Copilot has doubled since January, prompting a shift from request-based to token-based billing, suspended individual signups, and tightening rate limits. The Verge calls it the AI monetization cliff. The economics that funded the last eighteen months of cheap inference are tightening simultaneously across providers, and the practitioner trust that the cheap-inference era was building on is the first casualty. Opus 4.7 and GPT-5.5 cost twice as much as their predecessors. There is some light at the end of the tunnel, open-weight models are catching up for agentic engineering workloads.

The two trends together describe a specific dysfunction. Companies are paying more for AI while measuring usage in ways that reward the worst kind of consumption. Token-hygiene practices that used to be optional optimisations are turning into essential engineering discipline. The harness that emits fewer tokens is now also the harness that survives the next price hike.

Worth reading:

The Pulse: 'Tokenmaxxing' as a weird new trend: Gergely names a coined concept you will hear repeated for the rest of the year. (article)
Microsoft To Shift GitHub Copilot Users To Token-Based Billing: Internal-doc leak with hard numbers on cost trajectory. (article)
Is Claude Code going to cost $100/month?: Simon Willison's careful documentation of the silent pricing change. (article)
You're about to feel the AI money squeeze: The Verge frames the macro picture across Anthropic and OpenAI. (article)

Frontier models silently corrupt 25% of delegated documents

A new benchmark called DELEGATE-52 simulates long delegated workflows across fifty-two professional domains, from coding to crystallography to music notation. Across nineteen LLMs tested, even the frontier (Gemini 3.1 Pro, Claude 4.6 Opus, GPT 5.4) corrupted an average of 25% of document content by the end of long workflows. Errors are sparse but severe. They compound silently. Agentic tool use does not improve performance, and degradation gets worse with document size, interaction length, or the presence of distractor files. SlopCodeBench measures the same shape in coding agents.

This is the first hard number on a failure mode practitioners have been describing anecdotally for months. The Hak video on this week's list captures the lived version: two days debugging code with a subtle logic flaw buried deep within the AI-generated output, exactly the failure mode DELEGATE-52 measures. The MIT, Stanford and Upwork data on a 97% failure rate for AI coding tools on real freelance jobs lands in the same week. The benchmarks and the field reports finally agree, and the agreement is not flattering for AI shills. Humans are still essential custodians of your dark factory, guardrails remain an essential ingredient of your agentic harness.

The implication for delegation patterns is sharp. Long-horizon agent runs are not safe by default. Short tasks with explicit verification gates are. Whatever harness you build should assume silent degradation as a baseline and verify aggressively at the boundaries. Kent Beck's framing fits here: nobody actually wants an agent, they want the outcome.

Worth reading:

LLMs Corrupt Your Documents When You Delegate: Hard data on a failure mode you can no longer dismiss as anecdotal. (article)
SlopCodeBench: The coding-specific companion benchmark, same shape of degradation. (article)
MIT, Stanford & Upwork Data Just Exposed AI Coding's 97% Failure Rate: Real-world job data versus benchmark scores, the gap is the story. (video)
The tax we pay when building with AI: Two days debugging a subtle logic flaw buried in AI-generated output — the lived version of what DELEGATE-52 measures. (video)
Genie Lessons: Nobody Wants Agents: Kent Beck on why outcome-orientation matters more than agent-orientation. (article)

Composition is the next architecture problem. Again.

A small but consistent signal is starting to converge: the cost of fragmented systems is being named again, but in the agent-era language. Cynefin's Snowden wrote about elephants in canals versus elephants in estuaries: context determines whether your system is a forcing function or an enabling constraint. Software Enchiridion framed a platform as a place rather than a product. Different vocabularies, one underlying claim.

The architectural conversation is shifting from microservices versus monoliths to whether the whole thing holds together when agents are both the builders and the users. Vercel's Malte Ubl described AI engineering as the legitimate successor to web development. Burgess' paper applying promise theory to mixed human-machine systems gives the formal version: agents make promises, systems hold together when promises compose. The second edition of Designing Data-Intensive Applications dropped this month as a reminder that the architectural fundamentals do not get cheaper to ignore as you stack agents on top.

Worth reading:

A Platform Is a Place More Than a Product: You can build the perfect platform and miss what makes people love it — place beats product. (article)
Big Red Elephants Dance Rather Thoughtfully: A useful lens on whether your system is a constraint or an inhabitant. (article)
The New Application Layer - Malte Ubl, CTO Vercel: What infrastructure looks like when agents are both builders and users. (video)
Cooperation in Human and Machine Agents: Promise Theory Considerations: Burgess refreshes his promise theory to mixed human-machine systems: agents make promises, systems hold together when promises compose. (article)
Designing Data-intensive Applications with Martin Kleppmann: The fundamentals do not get cheaper to ignore as you stack agents on top. (podcast)

Change is complex, not complicated. Today we're all being forced to change.

Esther Derby's new piece, The Fingerprint Principle, made the same argument she has been making for years from a fresh angle: when leaders make a change, they want buy-in, but presenting a polished proposal prevents that. Let people get their fingerprints on it. The framing is starting to feel load-bearing for AI rollouts specifically, because AI is the largest organisational change most teams will face this decade, and the default playbook of top-down rewrites and quarterly retraining is producing the predictable result. People do not resist change. They resist being changed.

The Serverless Craic episode on psychological safety in the AI era asks the question that should be on every engineering leader's mind: what happens to the conditions that allow learning and experimentation when everything is changing at once? Coercion, rewards, and positional authority produce compliance, not engagement. AI rollouts that lean on the first three are predictably stalling, and the data on adoption gaps tells the same story from a different direction. The frameworks worth reaching for here are old: small FINE experiments that need no permission, principle of mission command instead of micromanagement, invite over inflict, finding the bright spots already working in the context and amplifying them.

Worth reading:

The Fingerprint Principle: One move from Derby's seven rules, distilled to its sharpest form. (article)
Psychological Safety in the AI Era: The change-management lens applied directly to the AI rollout problem. (video)

The Tech Radar is blinking red

Thoughtworks dropped Volume 34 of their recurring Tech Radar, and for the first time ever it reads less like a technology map and more like a warning letter. The blips that matter most: codebase cognitive debt (teams shipping faster than they can understand), coding throughput as a measure of productivity (Goodhart's law wearing a hoodie: measure lines generated and you get a flood of barely-reviewed output while cycle times go up), and semantic diffusion (AI terms proliferating faster than shared meaning, with over 300 proposed technologies submitted to this edition of the radar, some contributed by coding agents themselves).

Thoughtworks' conclusion is blunt: the current cognitive demand is not sustainable and will likely undermine the very gains AI is supposed to deliver. Their suggested counter-metric - first-pass acceptance rate, how often AI output gets used with minimal rework - connects directly to the tokenmaxxing problem from earlier in this issue.

Worth reading:

Thoughtworks Technology Radar Volume 34: The source. (article)
The Tech Radar is Blinking Red: Sharp walkthrough of the Volume 34 signals that matter most. (article)

Quick Hits

DeepSeek Unveils Flagship AI Model a Year After Breakthrough: Most powerful open-source model claim, worth tracking. (article)
Sign of the future: GPT-5.5: Mollick's measured take on the GPT-5.5 release. (article)
Sculpt, not Spec: Hanselman and Russinovich on iterative refinement versus upfront specification. (video)
The AI Sandwich: Where Humans Excel in an AI World: Compound engineering as a four-step framework, with humans bookending. (video)
The Agentic SDLC Handbook: Open-source handbook for the agentic SDLC, worth bookmarking. (article)
Quality Is a System: Why Testing Roles are Changing Again: Hendrickson with Alan Page on what 'quality is someone's job' actually means. (video)
software development now costs less than minimum wage: Geoffrey Huntley's AIE Miami talk on the cost-of-development collapse. (video)
Practices Coevolve with AI: Wardley's argument that adoption takes time because practices have to co-evolve. (video)
Vercel confirms breach as hackers claim to be selling stolen data: Worth noting given how much of today's agentic infrastructure sits on top of Vercel. (article)