Anthropic 2026-05-06-1
Translating Claude's Thoughts into Language
The headline finding from Anthropic's interpretability video was not that Claude refused to blackmail the engineer. It was that the translated activations explicitly read this is likely a safety evaluation, which means every prior eval result is provisional once cognition is auditable. Alignment posture stops being a brand claim and becomes an instrumented measurement layer, and procurement frameworks are not yet built for that.