MIT CSAIL

5 items

The Atlantic · 2026-03-31 2026-04-03-w3

How AI Is Creeping Into The New York Times

Five detection tools scored the same New York Times column between 0% and 60% AI-generated, which means the forensics produce more variance than the underlying question has resolution. The sharpest detail isn't the spread — it's that OpenAI built a watermarking tool accurate to 99.9% and shelved it because users would leave, which is a clean statement of where the incentives actually point. That calculus connects directly to what ICONIQ found in GTM: the accountability moment in software is shifting from contract signature to renewal, and every quarter a customer reconsiders is a quarter the provenance of the output they're paying for could matter. Private credit funds are classifying Inovalon as IT Services while Inovalon's own website says software company; institutions are trying to detect AI-written content with tools that disagree by 60 points. When the measurement layer this unreliable, the risk isn't any single exposure — it's that the systems designed to flag concentration and authenticity are lagging the thing they're supposed to track.

tisram.ai 2026-03-31-m2

Scarcity Is Now a Product Decision

Commoditization theory predicted a race to the bottom; the Ramp data showed a race to the top. Anthropic's 70% first-time win rate against OpenAI, in a market where the cheaper option is abundant and the pricier option is supply-constrained, is the month's most structurally interesting data point. The MIT CSAIL finding that compute efficiency varies 40x within individual labs does more than complicate the scaling moat thesis: it suggests supply constraint at the frontier isn't purely a capacity planning accident. It may be baked into how frontier models get produced at all. Morningstar's 37 downgrades versus two upgrades landed the same week, and the ratio encodes the same logic: AI compresses output costs at the application layer and reconstitutes scarcity one layer down, in infrastructure that handles verification, security, and network complexity. What runs through all three weeks is a consistent falsification test the market hasn't fully priced: if Anthropic's growth sustains when GPU supply eases, the moat is product; if it collapses, scarcity was doing the work. That distinction matters for every enterprise vendor currently repricing around AI features. Every improvement AI delivers to a product is reproducible by the next vendor in six months. Defensibility lives below the application layer now.

tisram.ai 2026-03-31-m3

Evaluation Is the Layer Nobody Built

A $25 pipeline producing publishable economic theory and 700 experiments running in two days look like productivity stories. They're actually stress tests for organizations that still measure AI value by what gets generated rather than what gets used. The legibility piece named the terminal form of this problem: AI-for-science will produce discoveries faster than labs, regulators, and clinical infrastructure can absorb them, and the bottleneck was never generation. That dynamic was already visible in week one, where the BCG data showed cognitive load spiking as oversight demands increased. The human-in-the-loop model assumes a human with enough bandwidth to loop, and that assumption is failing in practice. The tokenmaxxing story closes the arc: when consumption volume becomes the proxy for productivity, every measurement framework in the organization is now optimized for the wrong thing. What all three weeks surface, read together, is that the generation layer is effectively solved and the evaluation layer: scoring architecture, provenance infrastructure, translation tooling between machine output and institutional deployment, is where the next competitive advantage will be built. The companies that treat evaluation as an engineering problem now, rather than a governance afterthought, will hold a position in 18 months that no amount of inference spend can replicate.

MIT CSAIL · 2026-03-19 2026-03-20-w1

MIT CSAIL: 80-90% of Frontier AI Performance Is Just Compute

The week's most clarifying number wasn't a revenue figure or a benchmark score: it was 40x, the compute efficiency variance MIT CSAIL found within individual labs producing frontier models, meaning a single developer can't reliably reproduce its own results even when it controls the spending. That internal inconsistency quietly dissolves the moat thesis from both directions: if the frontier is a spending race and the spending doesn't produce consistent outcomes, neither scale nor safety restrictions reliably compound into durable advantage. That framing lands harder alongside Ramp's transaction data, where the more expensive, supply-constrained product is growing fastest precisely because product differentiation has become so hard to verify that buyers are using price as a trust proxy. And it reframes the Morningstar moat downgrades: if 37 application-layer moats narrowed because AI compresses the cost of performing expertise, the labs producing the underlying models face the same compression one layer down. Pre-training scale is now a commodity floor, not a ceiling; the differentiation that actually moves enterprise purchasing decisions has migrated to post-training alignment and inference-time compute, layers that don't appear in any scaling regression.

MIT CSAIL 2026-03-19-3

MIT CSAIL: 80-90% of Frontier AI Performance Is Just Compute

The study's headline finding confirms what everyone suspects: scale drives frontier performance. The buried finding inverts it: individual labs produce models with 40x compute efficiency variance, meaning they can't reliably reproduce their own results. If the frontier is a spending race and the spending doesn't produce consistent outcomes, the moat thesis weakens from both directions. The entire analysis is also blind to where differentiation actually moved: post-training alignment, tool use, and inference-time compute are now the layers where product quality diverges, and none of them show up in a pre-training scaling regression.