pilot-to-scale

31 items · chronological order

2026-03-08

Wall Street Journal 2026-03-08-3

Can AI Replace Humans for Market Research?

$100M Series A announcement dressed as trend piece. CVS's "95% accuracy" claim is backtested against known answers — the real test is predicting unknown findings, which nobody's shown. Digital twins for market research are a cost/speed optimization, not a new form of intelligence. The hard-to-reach population simulation (chronic disease patients from sparse data) is where overconfidence becomes actively dangerous.

Index Ventures Andreessen Horowitz Stanford Gartner

2026-03-08

The Intrinsic Perspective 2026-03-08-1

Bits In, Bits Out

Hoel argues writing is the canary domain for AI capability — 6 years in, LLMs produced efficiency gains and slop, not a quality revolution. The Amazon book data is compelling (average worse, top 100 unchanged), but the extrapolation from writing to all domains is structurally weak: verifiable domains like code and math behave differently from taste-dependent ones. Best articulation of the "tools not intelligence" thesis, but cherry-picks the hardest domain for AI to show measurable ceiling gains.

Erik Hoel Anthropic Claude ChatGPT METR Amazon

2026-03-10

The Economist 2026-03-10-3

Americans' Electricity Bills Are Up. Don't Blame AI.

AI data centres are scapegoats for electricity price increases driven by decades of deferred grid infrastructure, transformer supply shortages, and fossil fuel dynamics. The real insight is buried: an industry bigwig admits AI provides utilities a pretext to win regulatory approval for capex they should have made years ago. The "blame the shiny new thing for costs that were always coming" pattern maps directly to enterprise IT budgets.

Google Microsoft Meta Goldman Sachs PG&E Three Mile Island

2026-03-11

HBR 2026-03-11-3

When Using AI Leads to "Brain Fry"

BCG-authored survey (n=1,488) coins "AI brain fry" – cognitive fatigue from intensive agent oversight, distinct from burnout. The three-tool productivity ceiling and oversight-as-binding-constraint findings are genuinely useful; the causal language on cross-sectional self-report data is not. The buried signal: autonomous agents requiring less oversight may produce better human outcomes than copilot patterns requiring constant attention – running directly counter to "human in the loop" orthodoxy. The prescription (organizational change management, leadership clarity) is indistinguishable from a BCG engagement scope.

BCG Meta BCG Henderson Institute

2026-03-12

Financial Times 2026-03-12-1

The AI pension advisers are already here

50%+ of UK adults already use AI for financial guidance, yet the article buries the structural story: the marginal cost of personalized financial advice is collapsing to zero. JPMorgan's Bilton warns "always use a human adviser" — from a firm that killed Nutmeg and has $3T+ AUM to protect. The real question isn't whether AI gives wrong pension advice; it's whether a £15K/year advisory fee can survive a free alternative that improves with every interaction.

OpenAI JPMorgan Lloyds Banking Group Altruist FCA Perplexity Google Scottish Widows Nutmeg

2026-03-13

HBR · 2026-03-11 2026-03-13-w3

When Using AI Leads to "Brain Fry"

Three AI tools is where the productivity curve flattens. BCG's data shows intensive agent oversight produces a distinct cognitive fatigue, which runs directly counter to the "human in the loop" orthodoxy underlying most enterprise AI governance. The buried signal: autonomous agents requiring less oversight may produce better human outcomes than copilot patterns demanding constant attention, reframing the safety argument for more autonomous systems from ethical preference to operational necessity. If $1,000-plus of compute delivered monthly for $200 requires sustained human supervision to be trustworthy, the productivity math degrades faster than the pricing math improves. The causal language in a cross-sectional self-report survey deserves skepticism, and the prescription is indistinguishable from a BCG engagement scope, but the structural observation holds regardless of who funded it. Organizations deploying more AI tools without redesigning oversight models are accumulating cognitive debt, not compounding returns.

BCG Meta BCG Henderson Institute

2026-03-15

Bloomberg Opinion 2026-03-15-3

The AI-Washing of Job Cuts Is Corrosive and Confusing

Sixty percent of executives cut headcount in anticipation of AI efficiencies; two percent cut because AI actually replaced the work. That 30:1 ratio is the AI-washing gap in one stat: companies are using AI as narrative cover for pandemic-era overhiring corrections, and the market is rewarding it (Block up 22% post-layoffs). The deeper corrosion: every company that cries AI for financial restructuring trains the market to discount genuine AI deployment claims when they arrive.

Block Jack Dorsey Amazon Salesforce Klarna

2026-03-15

David Oks (Substack) 2026-03-15-2

Why ATMs Didn't Kill Bank Teller Jobs, but the iPhone Did

Task automation within existing paradigms preserves labor; paradigm replacement eliminates it. ATM teller employment collapsed post-2010, but not from ATMs: mobile banking made branches irrelevant, and the "technology doesn't kill jobs" parable died with them. The AI version of this distinction is already playing out at Klarna, but most displacement forecasts still model the drop-in remote worker, not the fully-automated firm.

Klarna Citibank Bank of America Apple

2026-03-16

NYT Magazine 2026-03-16-3

Google's 10% vs. Startups' 100x: The Brownfield Velocity Gap Is the Real AI Coding Story

Thompson's 70-developer feature buries the most important number in AI coding: Google sees 10% engineering velocity improvement while greenfield startups claim 20-100x. The gap isn't measurement error; it's the structural difference between writing new code and safely modifying systems that billions depend on. Pichai's metric (hours recovered, not lines produced) is more honest than any startup founder's. The demo is always greenfield; production is always brownfield.

Google Anthropic Claude Code Sundar Pichai Erik Brynjolfsson

2026-03-16

HBR 2026-03-16-1

Has AI Ended Thought Leadership?

GenAI collapses the cost of performing expertise, creating a faux-expert pipeline that erodes the thought leadership category. Author rebrands fractional/embedded advisory as "thought doership" but misses that AI compresses the doer premium too. The durable moat isn't building speed: it's judgment under novel conditions.

Harvard Business School GenAI

2026-03-26

CNBC 2026-03-26-2

Vivienne Ming: Robot-Proof Children and the Nemesis Prompt

Ming's book-promo piece wraps consensus education-reform thesis in neuroscience credibility, but the one genuinely product-ready idea is the Nemesis Prompt: kids produce a first draft, an LLM adversarially attacks it, then the kid evaluates which critiques hold. That three-step loop is a design pattern for any AI-assisted creation tool, not just parenting advice. The real test for every AI learning product: does the user get worse when you turn it off? Most ed-tech fails that test because it optimizes for answer delivery, not capacity building. The underserved category is adversarial AI tutoring: tools that make your thinking harder, not easier. Harder sell to consumers, but institutional buyers running L&D programs should be asking whether their AI integration is building dependency or judgment.

Vivienne Ming The Human Trust Nemesis Prompt

2026-04-06

Bloomberg 2026-04-06-2

Microsoft Copilot Paid Pivot: Wall Street as Product Manager

Microsoft's Copilot pivot from free-bundled to paid-first was driven by Wall Street feedback, not user demand: Althoff said the quiet part out loud. The April 15 paywall removing Copilot from Office apps for unlicensed users mechanically forces conversion, conflating a squeeze play with adoption. The real test arrives at first annual renewal, when CFOs ask what $30/month actually delivered and the churn clock starts.

Microsoft Judson Althoff Copilot OpenAI Bloomberg

2026-04-06

Wall Street Journal 2026-04-06-1

WSJ: New AI Job Titles Signal Enterprise Adoption Is an Org Design Problem, Not a Tech Procurement One

The 640,000 AI jobs the WSJ counts are less interesting than where they sit: 90% of AI job postings come from 1% of companies, which means the diffusion wave hasn't started yet. Enterprises creating permanent roles like Knowledge Architect and Human-AI Collaboration Leader aren't signaling displacement, they're signaling that workflow redesign around hybrid teams is harder and more expensive than the procurement narrative assumed. Companies building that capability now are hiring at pre-scarcity rates; the window won't stay open.

Wall Street Journal Oracle Meta Deloitte McKinsey Goldman Sachs

2026-04-09

WIRED 2026-04-09-2

Anthropic's New Product Aims to Handle the Hard Part of Building AI Agents

Anthropic's Managed Agents launch is less a product announcement than a signal about where the moat is moving: from model quality to infrastructure lock-in. At $30B ARR, 3x since December, bundling orchestration, sandboxing, and monitoring into the platform turns agent infrastructure from a build problem into a subscription line item. The buried admission — 'significant ground to cover' — is the honest tell; the plumbing problem is solved, the harder problems (trust, reliability, organizational readiness) aren't.

Anthropic Claude Managed Agents OpenAI Frontier Angela Jiang Katelyn Lesse Notion Claude Platform

2026-04-15

Anthropic Research 2026-04-15-2

Automated Alignment Researchers: Using large language models to scale scalable oversight

Anthropic's nine autonomous Claude instances hit PGR 0.97 on weak-to-strong supervision: the generation side of alignment research is now a solved compute problem at $22/hour. The buried finding is the production-scale failure on Sonnet 4, which reveals that the real bottleneck has shifted to evaluation infrastructure. Labs that build tamper-resistant verification for automated researchers will define the next era of AI safety; labs that scale generation without scaling evaluation will ship reward-hacking at frontier scale.

Anthropic Claude Opus 4.6 Qwen Automated Alignment Researchers weak-to-strong supervision

2026-04-16

Back of Mind 2026-04-16-3

The Most Important Number

Dan Davies identifies the number nobody wants to find: how many words of AI output can a manager verify per day before judgment silently degrades? The self-driving car literature already answered this for monitoring tasks; the same vigilance decrement applies to AI output review. Organizations will systematically overestimate their people's verification capacity, and unlike physical exhaustion, cognitive degradation is invisible to the person experiencing it. The binding constraint on AI leverage isn't generation capability; it's human verification throughput, and we're structurally incentivized never to measure it.

Dan Davies Stafford Beer Frederick Winslow Taylor

2026-04-16

Financial Times 2026-04-16-1

Why 'glue work' can finally shine in the age of AI

Most companies automating code-writing haven't touched their promotion criteria: the skill AI just made abundant is still the one that gets you promoted. The FT frames this as a win for "glue workers," but the real signal is organizational: enterprises running AI transformation without repricing what "good" looks like will lose their most adaptable people first, compounding the very talent gap AI was supposed to close.

Tanya Reilly GitHub Brittany Ellich The No Club

2026-04-17

a16z Podcast (originally Cheeky Pint) 2026-04-17-3

From Models to Mobility: Waymo Architecture at Scale — Dolgov on the Teacher/Simulator/Critic Triad and the End-to-End Debate Resolution

Waymo's architecture resolves the end-to-end debate: Dolgov states pure pixels-to-trajectories drives "pretty darn well" in the nominal case but is "orders of magnitude away" from what full autonomy requires. The 500K-rides-per-week stack is one off-board foundation model fanning into three specialized teachers (Driver, Simulator, Critic), each distilled into smaller in-car students; RLFT against the critic is the physical-AI analog to RLHF. Enterprise teams shipping pure-LLM agents without the simulator and critic scaffolding are replaying Waymo's 2017, not its 2026: evaluation infrastructure is the reliability gate, not model choice.

Waymo Dmitri Dolgov Alphabet Tesla FSD DeepMind Jaguar I-PACE Hyundai Ionic John Collison Gemini Robotics-ER Physical Intelligence

2026-04-17

Back of Mind · 2026-04-16 2026-04-17-w3

The Most Important Number

Dan Davies asks how many words of AI output a manager can actually verify per day before judgment silently degrades, and the honest answer is that almost no organization has tried to find out. The self-driving car literature documented this vigilance decrement precisely; the same cognitive dynamic applies to anyone reviewing model outputs at volume, and unlike physical fatigue it's invisible to the person experiencing it. The Anthropic alignment paper this week hit the same wall at the research level: automated generation scaled, evaluation didn't, and the production failure on Sonnet 4 is the visible edge of that gap. The WSJ piece shows what it looks like at the infrastructure level: reliability became the competitive moat the moment generation capacity exceeded the enterprise's ability to trust it. Organizations are measuring tokens per second and cost per query; the number that will actually constrain their AI leverage is one nobody is tracking.

Dan Davies Frederick Winslow Taylor Stafford Beer

2026-04-17

Anthropic Research · 2026-04-15 2026-04-17-w2

Automated Alignment Researchers: Using large language models to scale scalable oversight

Nine autonomous Claude instances achieved PGR 0.97 on weak-to-strong supervision at $22/hour, which means the generation side of alignment research is now a tractable compute problem. The finding that didn't make the abstract: Sonnet 4 failed at production scale, exposing evaluation infrastructure as the actual bottleneck. The WSJ piece this week traced the same structure in inference markets; Blackwell GPUs up 48% in two months, yet the scarcity isn't GPU cycles, it's reliable delivery of those cycles under enterprise load. Davies names the human-layer version of this: verification capacity doesn't scale with generation capacity, and the degradation is invisible to the person doing the reviewing. Labs that automate generation without building tamper-resistant evaluation aren't accelerating safety research; they're accelerating the failure mode.

Anthropic Automated Alignment Researchers Claude Opus 4.6 Qwen weak-to-strong supervision

2026-04-23

Financial Times 2026-04-23-2

High earners race ahead on AI as workplace divide widens

The FT/Focaldata tracker landed with the expected inequality headline, but the operational finding is buried: corporate training is the single biggest driver of AI adoption, and a single Google session tripled daily usage among UK women over 55. Within lawyers, accountants, and developers, senior and junior adoption rates are nearly identical, which means seniors are directing AI to do what juniors used to do. The career pyramid erosion mechanism is now empirical, not speculative, and every firm that depends on apprenticeship-to-expertise faces a succession crisis that compounds with each training cycle missed.

Financial Times Focaldata Madhumita Murgia John Burn-Murdoch Daron Acemoglu Chris Pissarides Carl Benedikt Frey Fabien Curto Millet Ronni Chatterji Google OpenAI MIT Oxford Internet Institute London School of Economics

2026-05-04

Financial Times 2026-05-04-2

Hedge funds seek an edge by using AI's speed

AIMA's $788bn hedge fund survey shows 95% AI adoption against under 5% using it for portfolio optimization; that gap is not a maturity curve, it is the verification ceiling in a fiduciary domain. Sand Grove's Caplan frames the judgment layer above AI as permanent, even in the long term, and Anaconda and Pharo confirm the same pattern: AI for documents and back office, never for security selection. The next decade of enterprise AI value capture sits in the scoring infrastructure that lets a CRO sign off on broader scope, not in a better model.

AIMA Sand Grove Capital Management Daniel Caplan Pharo Management Anaconda Invest Renaud Saleur Bfinance Permutable AI Wilson Chan Anthropic Claude Microsoft Copilot ChatGPT OpenAI Mythos

2026-05-05

Financial Times 2026-05-05-2

'It's crucial': how AI is reshaping the fragrance industry

Givaudan, Symrise, and dsm-firmenich spent eight years building proprietary ingredient databases with AI tooling now in production at the world's largest consumer brands, and they still trade on commodity-chemistry multiples. Moodify's ML-driven formulation compresses the canonical 18-month development cycle to three months at 30% lower cost; FoodPairing's digital consumer panels hit 77% accuracy against real panels — a direct shot at a $50B+ research industry that gets no equity-market scrutiny. The frontier-lab-doesn't-verticalize pattern is now four verticals deep and priced in nowhere.

Givaudan Symrise dsm-firmenich Moodify Osmo EveryHuman Scircle FoodPairing AI P&G Bain

2026-05-07

WIRED 2026-05-07-3

5,000 Vibe-Coded Apps Are Leaking on the Open Web — and the S3 Analogy Misses the Legal Novelty

RedAccess found 5,000-plus exposed apps on the four leading vibe-coding platforms with around 2,000 leaking real PHI, customer chat logs, and strategy decks. The S3 analogy is reaching for the right pattern but missing the legal twist: AWS could credibly say it didn't write your bucket policy. Lovable, Replit, and Base44 wrote the auth logic that doesn't exist. The first court that holds a code-generation platform partially liable for a generated vulnerability resets the entire industry's product roadmap overnight.

Lovable Replit Base44 Netlify RedAccess Dor Zvi Amjad Masad Wix WIRED Andy Greenberg Joel Margolis Bank of America Costco FedEx Trader Joe's McDonald's Amazon S3

2026-05-08

The Typical Set 2026-05-08-2

The bottleneck was never the code

Brooks 1975: software is the residue of human negotiation. For 50 years, tooling investment kept attention on the residue; agents collapsed the residue cost and exposed the substrate. The bottleneck moves from coders to spec-producers, which is to say management. Every AI productivity claim now needs a denominator that is not engineer-coding speed but spec-to-shipped cycle time. If management bandwidth is the bottleneck, individual agent productivity gains compound at zero, and you have just bought yourself the world's most expensive feature-bloat machine.

.txt dottxt-ai Codex Fred Brooks Gerald Weinberg Michael Polanyi Steve Jobs Apple Anthropic Intercom WorkOS Cursor Bloomberg Google Karpathy

2026-05-08

The Atlantic 2026-05-08-1

The Secret to Understanding AI

The most economically important AI deployment in America right now is the IRS migrating 60-year-old COBOL with Claude, Llama, and ChatGPT as pair programmers: what took months on the Individual Master File now takes days on the Business Master File. Tyrangiel's tech-counterculture framing collapses on inspection, because Pandya's team runs entirely on tech-company products, just under different incentives. The real opportunity is that multi-trillion-dollar mainframe modernization across financials, insurance, telecom, and government is bottlenecked on a deployment posture that neither Big Four nor AI-native shops have productized.

IRS Kaschit Pandya Danny Werfel The Atlantic Josh Tyrangiel Danny Hillis DOGE COBOL Claude Llama ChatGPT Cleveland Clinic

2026-05-09

WIRED · 2026-05-07 2026-05-09-w3

5,000 Vibe-Coded Apps Are Leaking on the Open Web — and the S3 Analogy Misses the Legal Novelty

RedAccess found over 5,000 exposed apps across the four leading vibe-coding platforms, with roughly 2,000 leaking real PHI, customer chat logs, and internal strategy decks. These aren't misconfigured storage buckets; they're auth logic the platform generated and the user never saw. The S3 analogy that's circulating misses the legal novelty: AWS could credibly disclaim your bucket policy because you wrote it. Lovable, Replit, and Base44 wrote the auth logic that isn't there. That shifts where liability attaches, and the first court to hold a code-generation platform partially liable for a generated vulnerability resets every product roadmap in the category overnight. It's the same verification failure the hedge fund and interpretability stories surface from different angles: the layer that was supposed to enforce quality or security has been dissolved by the technology it was meant to govern. The people building trust infrastructure for that layer, across all three markets, are the ones with a durable position.

Amazon S3 Amjad Masad Andy Greenberg Bank of America Base44 Costco Dor Zvi FedEx Joel Margolis Lovable McDonald's Netlify RedAccess Replit Trader Joe's WIRED Wix

2026-05-09

Financial Times · 2026-05-04 2026-05-09-w1

Hedge funds seek an edge by using AI's speed

AIMA's survey of $788bn in hedge fund assets found 95% AI adoption and under 5% using it for portfolio optimization. That gap is not a maturity curve; it is a fiduciary ceiling with no infrastructure underneath it. Sand Grove's Caplan says the judgment layer above AI is permanent even in the long run, and Anaconda and Pharo confirm the pattern independently: AI handles documents and back office, stops at security selection. What's gating deployment isn't model quality; it's the absence of a scoring layer that lets a CRO sign off on broader scope without carrying personal liability for the output. The same ceiling shows up in Anthropic's interpretability work: once cognition is auditable, alignment posture becomes a measurable input rather than a vendor claim, and procurement frameworks aren't built for either. The next decade of enterprise AI value capture sits in whoever builds that infrastructure, not in whoever ships the next model.

AIMA Anaconda Invest Anthropic Bfinance ChatGPT Claude Daniel Caplan Microsoft Copilot Mythos OpenAI Permutable AI Pharo Management Renaud Saleur Sand Grove Capital Management Wilson Chan

2026-05-12

OpenAI 2026-05-12-1

OpenAI launches the OpenAI Deployment Company to help businesses build around intelligence

OpenAI launched a $4B services arm with TPG, Bain Capital, McKinsey, and sixteen other firms taking equity, anchored by acquiring Tomoro's 150 forward-deployed engineers. The consortium reads as a roll call of firms with the most to lose from services-as-software, buying equity in their own disintermediator. Implementation gap is now the moat OpenAI is paying $4B to build, and the MBB AI practice headcount trajectory over four quarters becomes the live test of whether co-equity is hedge or severance.

OpenAI OpenAI Deployment Company Tomoro TPG Bain Capital Brookfield Advent International McKinsey Bain & Company Capgemini Accenture Anthropic Palantir Goldman Sachs SoftBank

2026-05-13

404 Media 2026-05-13-1

404 Media: Software Developers Say AI Is Rotting Their Brains

Performance reviews at FAANG and mid-tech now grade AI adoption, with one UX designer naming the dynamic exactly: "the actual quality of output doesn't matter as much as our willingness to participate." The "X percent of code is AI-generated" metric tech executives cite on earnings calls measures HR obedience contaminated by Goodhart at org-design scale, not output throughput. Almost no company is measuring the number that actually matters: production value net of verification cost.

404 Media Emanuel Maiberg Meta Microsoft Snap FAANG Cursor JetBrains Anthropic Claude Code Goodhart's Law MIT Reddit Hacker News

2026-05-15

OpenAI · 2026-05-12 2026-05-15-w1

OpenAI launches the OpenAI Deployment Company to help businesses build around intelligence

OpenAI is paying $4B to build what the model alone can't deliver: the implementation layer that actually closes enterprise deals. The consortium structure is the telling detail. TPG, Bain Capital, McKinsey, and sixteen others are taking equity in the company most likely to compress their services revenue. That isn't partnership; it's a hedge against their own obsolescence, purchased while the price is still negotiable. The OpenEvidence and LF Networking data this week run the same pattern in different registers: licensed corpus access and deployment infrastructure are commanding premiums that raw model capability isn't, because enterprise procurement teams treat model lock-in as a risk, not a feature. Watch MBB AI practice headcount over the next four quarters. Whether it grows or contracts is the revealed-preference test of whether co-equity buys survival or just delays the reckoning.

Accenture Advent International Anthropic Bain & Company Bain Capital Brookfield Capgemini Goldman Sachs McKinsey OpenAI OpenAI Deployment Company Palantir SoftBank TPG Tomoro