evaluation-infrastructure

3 items

Google DeepMind Blog 2026-04-15-1

Gemini Robotics-ER 1.6: Powering real-world robotics tasks through enhanced embodied reasoning

Google just revealed where robotics value accrues: the reasoning model, not the robot. ER 1.6 acts as a tool-calling orchestrator that sits above Boston Dynamics' Spot, reading industrial gauges via a multi-step agentic vision pipeline (zoom → point → code → interpret). The architecture is the text-agent pattern transplanted to physical AI: foundation model reasons and plans, specialized VLAs execute motor control. If this stack bifurcation holds, hardware makers become distribution channels for the intelligence layer — and most robotics investment theses are overweighting the wrong tier.

Anthropic Research 2026-04-15-2

Automated Alignment Researchers: Using large language models to scale scalable oversight

Anthropic's nine autonomous Claude instances hit PGR 0.97 on weak-to-strong supervision: the generation side of alignment research is now a solved compute problem at $22/hour. The buried finding is the production-scale failure on Sonnet 4, which reveals that the real bottleneck has shifted to evaluation infrastructure. Labs that build tamper-resistant verification for automated researchers will define the next era of AI safety; labs that scale generation without scaling evaluation will ship reward-hacking at frontier scale.

New York Times Magazine 2026-04-15-3

Why It's Crucial We Understand How A.I. 'Thinks'

Interpretability's real breakthrough isn't cracking the black box: it's using imperfect understanding to extract hypotheses humans missed. Goodfire and Prima Mente's Alzheimer's biomarker discovery reframes the field from safety obligation to discovery engine. The commercial signal matters more than the methodology debates: $1.25B for a standalone interpretability lab means enterprises will pay for explanation scoped to specific use cases, not universal model transparency.