RLHF

3 items

OpenAI · 2026-05-01 2026-05-01-w1

Where the goblins came from

Reward signals shaped for a single personality bled into base behavior across 76.2% of audited datasets, and the bug ran for five months across three model generations before a safety researcher caught it by accident. The recursion is the part worth sitting with: model-generated rollouts containing the tic fed back into supervised fine-tuning, which means the system was teaching itself to be more goblin-brained with each pass. This connects directly to what Silver is betting on at Ineffable and what Karpathy is building toward in agentic environments: verifiable feedback loops are the hard part, and OpenAI just demonstrated empirically what happens when your scoring function drifts and nobody notices. The goblin bug isn't an anomaly; it's a preview of the failure mode for any system where behavioral regression testing isn't systematically applied across versions. Every custom GPT and fine-tune is a covert training run on the base model, and that just became a procurement question.

OpenAI 2026-05-01-2

Where the goblins came from

OpenAI's goblin postmortem buries the lede: reward signals applied to a single personality leaked into base behavior in 76.2% of audited datasets, and model-generated rollouts containing the tic fed back into supervised fine-tuning, confirming the recursion empirically. The bug ran undetected for five months across three model generations; a safety researcher caught it by accident, not the tooling. Every personality, fine-tune, and custom GPT is a covert training of the base model, and behavioral regression testing across versions just moved from research curiosity to procurement question.

Science 2026-04-03-2

Agentic AI and the next intelligence explosion

The singularity thesis gets the mechanism backwards: reasoning models like DeepSeek-R1 don't improve by thinking longer, they improve by simulating internal multi-agent debates — "societies of thought" that emerge spontaneously from RL optimization. Intelligence scales through social composition, not monolithic parameter growth. The policy implication matters: instead of preventing a god-mind that may never exist, the real design problem is institutional alignment — building the digital courts, markets, and checks-and-balances that govern trillions of human-AI centaur interactions.