Agentic Engineering Weekly for April 17-25, 2026

Share
Agentic Engineering Weekly for April 17-25, 2026

This week, the receipts arrived from every direction at once. Thoughtworks' Tech Radar reads like a warning letter, frontier models silently corrupt a quarter of what you delegate to them, and the companies that told you AI would pay for itself are doubling their price, while employees world-wide burn tokens just to look productive (and not get fired).


My top 3 picks this week


Code is cheap, code isn't cheap. Which is it?

Two takes landed in the same week and they look like opposites until you read them carefully. Cat Wu, Head of Product for Claude Code at Anthropic, said it plainly: "As code becomes much cheaper to write, the thing that becomes more valuable is deciding what to write." Code is cheap. The marginal cost of the next prototype, the next draft implementation is approaching zero, and the scarce resource while prototyping is no longer typing speed. It is product taste, problem selection, and knowing what to build next.

Pocock made the counter-argument in the same week. After eighteen months of teaching developers to ship with AI agents, his observation was sharp: AI coding tools are extraordinary when used well and burying when used badly, and the difference is the process, not the tool. The fundamentals (TDD, clean boundaries, friction as a feature) are the only thing standing between a productive AI workflow and a codebase nobody can reason about in six months. Code isn't cheap when you measure it in the comprehension you have to maintain and the cognitive debt you have to pay back later.

Both claims are true at once, which is why this is the harder conversation to have right now. The cost of generating code went to zero. The cost of owning code did not. A free extract from Alistair Cockburn's upcoming book Slice the Problem Grow the Solution lands the same week with a frame that sharpens the point: unvalidated decisions are toxic inventory, and the fix is moving decisions through the system in very small, validated increments. Walking skeletons, nano-increments, Learning-Value-Trim loops. His AI chapter says it plainly: treat AI as an unreliable partner, use tests, boundaries, and tiny increments for control. The teams that thrive in the next two years will be the ones who internalise both halves of the paradox: ambitious enough to use the cheap-code superpower, disciplined enough to refuse the expensive-code trap. The fundamentals we have known about for 50 years just became your competitive advantage again.

Worth reading:


Tokenmaxxing: gaming AI metrics on both sides of the API

The Pragmatic Engineer named a strange new behaviour at Meta, Microsoft, and Salesforce: developers are deliberately burning tokens (and money) to inflate their AI-usage metrics, because their employers are treating tokens-consumed as a proxy for AI adoption. Goodhart's law just landed in the AI talent market. The moment you incentivise a metric, people optimise for the metric, even when the metric is decoupled from the outcome. Engineering managers reading this should check what dashboards their leadership is looking at, and what story those numbers are telling them about the team they think they have.

The vendor side of the API is doing the inverse. Anthropic briefly removed Claude Code from the $20 Pro plan with no announcement, then partially walked it back when Simon Willison and others noticed. Microsoft's leaked internal docs show the weekly cost of running GitHub Copilot has doubled since January, prompting a shift from request-based to token-based billing, suspended individual signups, and tightening rate limits. The Verge calls it the AI monetization cliff. The economics that funded the last eighteen months of cheap inference are tightening simultaneously across providers, and the practitioner trust that the cheap-inference era was building on is the first casualty. Opus 4.7 and GPT-5.5 cost twice as much as their predecessors. There is some light at the end of the tunnel, open-weight models are catching up for agentic engineering workloads.

The two trends together describe a specific dysfunction. Companies are paying more for AI while measuring usage in ways that reward the worst kind of consumption. Token-hygiene practices that used to be optional optimisations are turning into essential engineering discipline. The harness that emits fewer tokens is now also the harness that survives the next price hike.

Worth reading:


Frontier models silently corrupt 25% of delegated documents

A new benchmark called DELEGATE-52 simulates long delegated workflows across fifty-two professional domains, from coding to crystallography to music notation. Across nineteen LLMs tested, even the frontier (Gemini 3.1 Pro, Claude 4.6 Opus, GPT 5.4) corrupted an average of 25% of document content by the end of long workflows. Errors are sparse but severe. They compound silently. Agentic tool use does not improve performance, and degradation gets worse with document size, interaction length, or the presence of distractor files. SlopCodeBench measures the same shape in coding agents.

This is the first hard number on a failure mode practitioners have been describing anecdotally for months. The Hak video on this week's list captures the lived version: two days debugging code with a subtle logic flaw buried deep within the AI-generated output, exactly the failure mode DELEGATE-52 measures. The MIT, Stanford and Upwork data on a 97% failure rate for AI coding tools on real freelance jobs lands in the same week. The benchmarks and the field reports finally agree, and the agreement is not flattering for AI shills. Humans are still essential custodians of your dark factory, guardrails remain an essential ingredient of your agentic harness.

The implication for delegation patterns is sharp. Long-horizon agent runs are not safe by default. Short tasks with explicit verification gates are. Whatever harness you build should assume silent degradation as a baseline and verify aggressively at the boundaries. Kent Beck's framing fits here: nobody actually wants an agent, they want the outcome.

Worth reading:


Composition is the next architecture problem. Again.

A small but consistent signal is starting to converge: the cost of fragmented systems is being named again, but in the agent-era language. Cynefin's Snowden wrote about elephants in canals versus elephants in estuaries: context determines whether your system is a forcing function or an enabling constraint. Software Enchiridion framed a platform as a place rather than a product. Different vocabularies, one underlying claim.

The architectural conversation is shifting from microservices versus monoliths to whether the whole thing holds together when agents are both the builders and the users. Vercel's Malte Ubl described AI engineering as the legitimate successor to web development. Burgess' paper applying promise theory to mixed human-machine systems gives the formal version: agents make promises, systems hold together when promises compose. The second edition of Designing Data-Intensive Applications dropped this month as a reminder that the architectural fundamentals do not get cheaper to ignore as you stack agents on top.

Worth reading:


Change is complex, not complicated. Today we're all being forced to change.

Esther Derby's new piece, The Fingerprint Principle, made the same argument she has been making for years from a fresh angle: when leaders make a change, they want buy-in, but presenting a polished proposal prevents that. Let people get their fingerprints on it. The framing is starting to feel load-bearing for AI rollouts specifically, because AI is the largest organisational change most teams will face this decade, and the default playbook of top-down rewrites and quarterly retraining is producing the predictable result. People do not resist change. They resist being changed.

The Serverless Craic episode on psychological safety in the AI era asks the question that should be on every engineering leader's mind: what happens to the conditions that allow learning and experimentation when everything is changing at once? Coercion, rewards, and positional authority produce compliance, not engagement. AI rollouts that lean on the first three are predictably stalling, and the data on adoption gaps tells the same story from a different direction. The frameworks worth reaching for here are old: small FINE experiments that need no permission, principle of mission command instead of micromanagement, invite over inflict, finding the bright spots already working in the context and amplifying them.

Worth reading:


The Tech Radar is blinking red

Thoughtworks dropped Volume 34 of their recurring Tech Radar, and for the first time ever it reads less like a technology map and more like a warning letter. The blips that matter most: codebase cognitive debt (teams shipping faster than they can understand), coding throughput as a measure of productivity (Goodhart's law wearing a hoodie: measure lines generated and you get a flood of barely-reviewed output while cycle times go up), and semantic diffusion (AI terms proliferating faster than shared meaning, with over 300 proposed technologies submitted to this edition of the radar, some contributed by coding agents themselves).

Thoughtworks' conclusion is blunt: the current cognitive demand is not sustainable and will likely undermine the very gains AI is supposed to deliver. Their suggested counter-metric - first-pass acceptance rate, how often AI output gets used with minimal rework - connects directly to the tokenmaxxing problem from earlier in this issue.

Worth reading:


Quick Hits


Curated from articles, podcasts, and videos. Week of April 17-25, 2026.

Read more