Agentic Engineering Weekly for May 8-15, 2026

Harness engineering is gaining traction, the maintenance bill from a year of vibe coding is coming due, and METR puts the first credible number on agent autonomy. The vocabulary of 2026 is stabilizing in front of us.

My top 3 picks this week

Machines of Code and Grace: Great podcast where 2 experienced engineers talk about their experiences and tips working with the clankers. No poop candy! (podcast)
LLMs are functions, not brains: Software engineers are over-emphasizing the "agent" angle and underemphasizing the "tool" angle (article)
Maintenance: Of Everything: Great book on "maintenance", from sail boats to motorcycles. Maintenance is where the cost of AI-generated code still lies in 2026 (book)

Last week's video

Does programming language choice still matter in the agentic engineering era? I went in with a hypothesis: popular strongly typed languages would give coding agents a leg up. The compiler is the agent's best friend, right? Fast feedback, fewer hallucinations, types-as-tests. I expected my personal language of choice F# to rank highly. The data refused to confirm it.

More thoughts on Harness Engineering

The vocabulary stabilized. Thoughtworks called it the year's term on their latest podcast episode. LangChain shipped the anatomy diagram: filesystem, sandbox, memory. Matt Wynne wrote up the Dark Factory pattern, Vincent Koc showed it shipping at OpenClaw, and Russ Miles wrote the parable. Six months ago this was prompt engineering plus glue. This week it has a job title, conference tracks, and a reference architecture.

The shift is in what counts as "the work." Stacking markdown rules in CLAUDE.md was the v1 version. The v2 version is computational sensors that fire when the agent does something wrong, sandboxes that contain the blast radius, and skills that compose like compiler passes. Pedro Rodrigues at Supabase showed why the stakes are real: agents were constantly bypassing row-level security because nobody told them about the security_invoker flag. The harness fixed it. Markdown rules would not have.

Where this is going: harness engineering converges with platform engineering. The team that ships agents reliably is the team with the strongest sensors, the best sandboxes, and the most honed skill library, no longer the team with the cleverest prompts in markdown files. The compiler analogy from Skip Labs is the right mental model. Our discomfort with AI-generated code is really discomfort with the missing tools we have not built around it yet.

Worth reading:

What is harness engineering?: Podcast to accompany Birgitta Boeckeler's original article (podcast)
The Anatomy of an Agent Harness: The reference diagram you can hand to your team (article)
Harness engineering beyond skills: Using sensors to keep your coding agent in check: The sensor-driven feedback channel, with concrete examples (video)
Combine Skills and MCP to Close the Context Gap: The row-level-security war story (video)
Treat Agent Output Like Compiler Output: The mental model that reframes the whole problem (article)

Code is cheap, maintenance is where you can find the real cost

James Shore's piece started this topic off by stating that if your AI doubles your speed, it had better halve your maintenance costs, otherwise you are trading a temporary speed boost for permanent indenture. David Fowler on Microsoft's Aspire team published the experience report under a title that does not need explanation: "AI Made Us Faster. That Was the Problem."

The heuristic that holds up best, I'm finding, is code is cheap, software still isn't. The maintenance cost is the major factor contributing to total cost of ownership, not creation cost. A tool that took two hours to build still takes the same effort to keep alive as one that took two weeks. The operational implication: productivity is determined by maintenance costs. Everything codebase you keep around should be built with that in mind.

A k10s.dev developer made the case study at human scale this week. Seven months of vibe-coding a GPU-aware Kubernetes TUI. Archived. Started over by hand. AI got it wrong when the project grew complex enough that the maintenance horizon outran the creation speedup. Kent Beck's "Hey N00b, We Didn't Hire You to Complete Tasks" completes the circle: ongoing thinking, not task completion, is what the maintenance bill is paying for.

Worth reading:

Maintenance: Of Everything: Great book on "maintenance", from sail boats to motorcycles and where the real cost of AI-generated code still lies (book)
You Need AI That Reduces Maintenance Costs: The piece everyone is quoting, with the math (article)
AI Made Us Faster. That Was the Problem: An honest production experience report from the Microsoft Aspire team (article)
I'm going back to writing code by hand: Seven months of vibe-coding, archived. The honest postmortem (article)
Hey, N00b, We Didn't Hire You to Complete Tasks: Kent Beck on you should be spending some of the time you gain on learning and improving, not on more production (article)
9 Ways AI Coding Has Rewired My Brain: Matt Pocock's personal-practice companion (article)

Programming languages have stopped being lock-in

Mitchell Hashimoto watched Bun migrate from Zig to Rust in a week or two and saw the discontinuity. His quote, lightly edited: "Programming languages used to be lock-in, and they're increasingly not so. Rust is expendable. Useful until it's not, then it can be thrown out." This is confirmation of what last week's almost-uniform PL benchmark numbers were saying. The cost of being wrong about your language pick just collapsed.

Simon Willison heard the same thing at a conference. Someone's medium-sized tech company had just completed an agent-driven rewrite of both iPhone and Android apps to React Native. The reason they chose unification: agents drove the rewrite cost low enough that maintaining two separate native codebases stopped being cheaper. The economic argument for cross-platform frameworks just changed shape, fifteen years after the original pitch.

The companion piece is the Pragmatic Engineer's two-hour interview with Anders Hejlsberg, the designer who shipped Turbo Pascal, Delphi, C#, and TypeScript. The same week the canonical PL designer is publishing wisdom about ecosystem decades, his successors are watching language choices become weekly decisions. Both readings can be true. The one that is actionable for most teams is the second.

Worth reading:

Quoting Mitchell Hashimoto: The expendable-language quote that defined the week (article)
Not so locked in any more: The React Native rewrite as concrete evidence (article)
TypeScript, C# and Turbo Pascal with Anders Hejlsberg: The canonical counter-example, in his own words (podcast)
Deep Engineering #47: Why Experienced Developers Have the Hardest Time Learning Rust: The borrow checker as a design tool (and the only language abstraction that broke my brain) (article)

METR puts numbers on autonomy: 50-minute tasks today, doubling every 7 months

METR landed another headline metric. Frontier models now have a 50% time horizon of about 50 minutes on real software tasks, and that horizon has been doubling every seven months since 2019. The trend may have accelerated in 2024. Their own extrapolation: within five years, AI systems will be capable of automating many tasks that currently take humans a month. The discourse finally has a number to argue about instead of a vibe.

The number changes the planning question. If your roadmap is a year long, the AI capability available at the end of the year changes your math, based on this metric.

What the METR paper is careful about: the increase comes from greater reliability and better adaptation to mistakes, not from raw reasoning gains. Agents are going further by getting less wrong, not by getting smarter. That is a different lever for harness engineers to pull.

Worth reading:

Task-Completion Time Horizons of Frontier AI Models: The METR report, the number, the methodology (article)
Measuring AI Ability to Complete Long Software Tasks: The arxiv paper behind it. Worth reading the limitations section (article)
AI Will Hit a Wall in 2026, if nothing changes: The contrarian read landing the same week (video)
Revisiting "No Silver Bullets" in the age of AI: The forty-year historical lens on the same claim (article)

Anthropic's capacity squeeze turned hostile to its own users

Three independent vectors converged this week. Gergely Orosz's Pulse asked whether the "dumber" Claude model and the Claude Code lockouts traced back to capacity issues Anthropic could not admit publicly. The Verge reported Microsoft canceling Claude Code licenses for thousands of developers. Theo declared "I'm done" over the new Agents SDK terms: you can no longer ship downstream products against a Claude subscription.

The shape underneath the news is the same. The era of "unlimited compute behind a flat subscription" is closing. Armin Ronacher and Ben spent a couple of minutes walking through enterprises clamping down on token spend and the beginning of the end of subsidies. The free experimentation phase that funded the last twelve months of agent tooling innovation is coming to a close, and the bill is going to land downstream.

For practitioners the operational change is concrete. If your team's workflow depends on a single vendor's subscription, model the case where access changes overnight. Multi-provider harnesses, local model fallbacks, and explicit cost tracking move from "nice to have" to "table stakes." The teams that built token-conscious and provider-agnostic harnesses are the ones that'll sleep a bit better this week.

Worth reading:

The Pulse: Did capacity shortages turn Anthropic hostile to devs?: The hypothesis that frames the rest of the story (article)
Microsoft starts canceling Claude Code licenses: Enterprise blast radius made visible (article)
I'm done.: Theo on the new Agents SDK fence (video)
State of Agentic Coding #6 with Armin and Ben: Twenty minutes on the end of the subsidy era (video)

HTML is the unreasonably effective output format for agent work

Thariq Shihipar from the Claude Code team posted the most quoted prompt suggestion of the year: "Help me review this PR by creating an HTML artifact that describes it." Simon Willison wrote it up. The argument is small, sharp, and immediately useful. Markdown is the lowest-common-denominator format the model was trained on. HTML carries more semantic weight per token and lets the agent build navigable, structured deliverables.

The practical change is one prompt edit. For PR reviews, architecture explorations, and any output you will read more than once, ask for an HTML artifact. The model produces something you can click through, search, and collapse, rather than a wall of text you scroll past once. Reading the output stops being the bottleneck. The same trick works for design docs, refactoring plans, and any artifact where structure carries meaning.

The broader lesson is worth keeping. Output format is part of the agent's contract, not an afterthought. The teams that get the most out of agents are the ones treating prompt design, output schema, and review surface as a single problem. HTML is the cheap experiment that proves the principle.

Worth reading:

Using Claude Code: The Unreasonable Effectiveness of HTML: The Simon Willison writeup with the full prompt examples (article)
Thariq on X: The Unreasonable Effectiveness of HTML: The original thread from the Claude Code team (article)
Turns out, HTML is King: The video version for the lunch-break crowd (video)

Quick Hits

Interrogatory LLM: Martin Fowler on inverting the context problem: let the LLM interview you. /grill-me for boomers (article)
Fragments: May 14: Fowler's Chatham House notebook, including a behavioral clone of GNU Cobol in Rust (article)
How Claude Code works in large codebases: Anthropic's enterprise playbook for the tool (article)
Learning on the Shop floor: Tobi Lutke on Shopify's River agent operating entirely in public Slack channels (article)
AI Gateway production index: Vercel's seven-month look at production AI traffic across hundreds of models (article)
Why Agentic AI Fails: Infinite Loops, Planning Errors, and More: IBM's taxonomy of where the agent loop breaks (video)
How Uber Runs 60,000 AI Agent Tasks Per Week With MCP: Production numbers from a name-brand engineering org (video)
CI/CD Is Dead, Agents Need Continuous Compute: What pipelines look like when thousands of agents open PRs continuously (video)
The AI Soft Ban Assessment: The "AI friendly" employer that quietly stayed un-friendly. Pattern recognition for managers (article)