The week's three pieces kept arriving at the same place from different directions: generation is no longer the hard part. The WSJ reported Anthropic's reliability gap as an enterprise defection story, but the signal underneath it is that inference demand has compounded past the point where raw capability differentiates; Retool's CEO didn't leave over model quality. Anthropic's own alignment research then demonstrated the same structure internally: nine Claude instances can generate alignment research at $22/hour, and the production failure on Sonnet 4 revealed that evaluation infrastructure is now the binding constraint on that pipeline, not the generation itself. Davies lands the argument at the human layer, drawing on vigilance decrement research from autonomous vehicle monitoring to name the number organizations are structurally incentivized never to measure: how many AI outputs can a person verify per day before their judgment quietly degrades. Across all three pieces, the constraint isn't what's being produced; it's the layer that checks whether what was produced can be trusted. That's the shift the week traced: the intelligence layer is decoupling from the execution layer, and the value is moving toward whoever can make verification legible and scalable. The organizations still optimizing for generation throughput are measuring the wrong race.
The 3 reads that mattered most
Retool's CEO switched from Anthropic to OpenAI this quarter, and the reason wasn't a benchmark: it was 98.95% uptime versus the alternative. Enterprise AI competition has shifted from capability to reliability, the same transition cloud infrastructure went through in 2010. The Anthropic paper this week shows the same pattern one layer up: automated alignment research can generate at $22/hour, but generation without stable evaluation infrastructure is just faster reward-hacking. Davies' vigilance decrement argument lands it at the human layer: even if the infrastructure holds, the person reviewing outputs degrades before the system does. Whoever solves five-nines for the full stack, model plus evaluation plus human judgment, owns enterprise regardless of whose Elo score leads.
Nine autonomous Claude instances achieved PGR 0.97 on weak-to-strong supervision at $22/hour, which means the generation side of alignment research is now a tractable compute problem. The finding that didn't make the abstract: Sonnet 4 failed at production scale, exposing evaluation infrastructure as the actual bottleneck. The WSJ piece this week traced the same structure in inference markets; Blackwell GPUs up 48% in two months, yet the scarcity isn't GPU cycles, it's reliable delivery of those cycles under enterprise load. Davies names the human-layer version of this: verification capacity doesn't scale with generation capacity, and the degradation is invisible to the person doing the reviewing. Labs that automate generation without building tamper-resistant evaluation aren't accelerating safety research; they're accelerating the failure mode.
Dan Davies asks how many words of AI output a manager can actually verify per day before judgment silently degrades, and the honest answer is that almost no organization has tried to find out. The self-driving car literature documented this vigilance decrement precisely; the same cognitive dynamic applies to anyone reviewing model outputs at volume, and unlike physical fatigue it's invisible to the person experiencing it. The Anthropic alignment paper this week hit the same wall at the research level: automated generation scaled, evaluation didn't, and the production failure on Sonnet 4 is the visible edge of that gap. The WSJ piece shows what it looks like at the infrastructure level: reliability became the competitive moat the moment generation capacity exceeded the enterprise's ability to trust it. Organizations are measuring tokens per second and cost per query; the number that will actually constrain their AI leverage is one nobody is tracking.