ai-for-science

Google DeepMind · 2026-05-20 2026-05-22-w1

DeepMind Co-Scientist: A multi-agent AI partner to accelerate research

The detail that reorients the entire Co-Scientist paper: the majority of system compute goes to verifying hypotheses, not generating them. DeepMind didn't build a research assistant on top of Gemini — it built a verifier corpus (AlphaFold, ChEMBL, UniProt, the full literature stack) and wrapped a generator around it. That architectural choice is the same bet surfacing in the Bloomberg litigation data and the BBC manipulation piece: generation is cheap and increasingly generic, and the organizations that accumulated verification infrastructure before the model layer commoditized are holding the durable position. Every 'AI for vertical X' startup that priced the model layer priced the wrong thing. The moat was always the corpus that tells you whether the output is true.

# tags

agentic-ai-viability ai-1.0-defensibility ai-economics ai-for-science deepmind evalrig evalrig-adjacent evaluation-infrastructure gemini google harness-as-moat multi-agent-orchestration multi-model-strategy nature pharma-ai pickrig pilot-to-scale verification-infrastructure verifier-infrastructure

Google DeepMind 2026-05-20-1

DeepMind Co-Scientist: A multi-agent AI partner to accelerate research

DeepMind's Co-Scientist paper in Nature drops the actual bombshell in one sentence — the majority of system compute goes to verifying hypotheses, not generating them. The moat isn't Gemini; it's the verifier corpus that grounds each claim: AlphaFold, ChEMBL, UniProt, the literature stack Google has quietly accumulated. Every "AI for vertical X" startup pricing the model layer is pricing the wrong layer of the stack.

# tags

deepmind gemini ai-for-science multi-agent-orchestration verifier-infrastructure ai-1.0-defensibility evaluation-infrastructure pharma-ai ai-economics harness-as-moat google nature agentic-ai-viability verification-infrastructure evalrig evalrig-adjacent pickrig multi-model-strategy pilot-to-scale

OpenAI 2026-05-20-3

OpenAI Model Disproves Erdos Unit Distance Conjecture

An internal OpenAI model disproved Erdos's 1946 planar unit distance conjecture, with Princeton's Sawin extracting an explicit exponent delta=0.014 in a constructive refinement, and Gowers calling it Annals-of-Mathematics quality. The bigger signal isn't the proof. It's Shankar's CoT observation: most of the model's reasoning attempted counterexamples to the conjecture, not validations of it. That's calibrated contrarianism — a scorable behavioral property and the math-grounded analogue to sycophancy detection. Verifier-rich domains are where autonomous AI lands first; counterexample-seeking is how we'll measure whether reasoning is real or performative.

# tags

openai ai-for-science verifier-bottleneck agentic-ai-viability frontier-models automated-research evalrig recursive-self-improvement capability-overhang harness-as-moat research-methodology ai-economics ai-labor-displacement ai-1.0-defensibility

Nature 2026-05-07-2

How much of the scientific literature is generated by AI?

Three independent studies converge on the same finding: 30% of peer reviews at Organization Science, 1 in 8 top-tier biomedical papers, and 43% of arXiv CS review preprints now contain AI-generated text. The verifier and the verified are using the same tool. This is the fourth domain in 30 days where verification has emerged as the binding constraint on AI-era knowledge work, after enterprise dev, frontier math, and frontier physics. The investable thesis is no longer single-domain. The next moat in scientific publishing is detection-vendor integration; pre-2026 literature becomes a scarcity asset; mid-tier journals collapse.

# tags

ai-detection ai-for-science verifier-infrastructure evalrig ai-1.0-defensibility ai-content-markets publisher-economics evaluation research-methodology ai-cognitive-sovereignty nature evaluation-infrastructure ai-governance

New York Times Magazine 2026-04-15-3

Why It's Crucial We Understand How A.I. 'Thinks'

Interpretability's real breakthrough isn't cracking the black box: it's using imperfect understanding to extract hypotheses humans missed. Goodfire and Prima Mente's Alzheimer's biomarker discovery reframes the field from safety obligation to discovery engine. The commercial signal matters more than the methodology debates: $1.25B for a standalone interpretability lab means enterprises will pay for explanation scoped to specific use cases, not universal model transparency.

# tags

interpretability ai-governance ai-healthcare reliability ai-trust-signals ai-1.0-defensibility ai-for-science Anthropic alignment deep-learning-foundations evaluation-infrastructure

Quanta Magazine 2026-04-14-2

The AI Revolution in Math Has Arrived

AlphaEvolve found hypercube structures in permutation groups that mathematicians hadn't noticed in 50 years: not by answering the question posed, but by surfacing a pattern nobody thought to look for. The real capability shift isn't AI proving things faster; it's AI scanning combinatorial spaces too large for human intuition and returning structures that reframe entire research programs. Discovery is being commoditized; the scarce resource is now verification infrastructure and the human judgment to recognize which discoveries matter.

# tags

ai-for-science reliability agentic-ai-viability deepmind ai-and-human-capacity ai-cognitive-dependency ai-economics OpenAI Google

tisram.ai 2026-03-31-m3

Evaluation Is the Layer Nobody Built

A $25 pipeline producing publishable economic theory and 700 experiments running in two days look like productivity stories. They're actually stress tests for organizations that still measure AI value by what gets generated rather than what gets used. The legibility piece named the terminal form of this problem: AI-for-science will produce discoveries faster than labs, regulators, and clinical infrastructure can absorb them, and the bottleneck was never generation. That dynamic was already visible in week one, where the BCG data showed cognitive load spiking as oversight demands increased. The human-in-the-loop model assumes a human with enough bandwidth to loop, and that assumption is failing in practice. The tokenmaxxing story closes the arc: when consumption volume becomes the proxy for productivity, every measurement framework in the organization is now optimized for the wrong thing. What all three weeks surface, read together, is that the generation layer is effectively solved and the evaluation layer: scoring architecture, provenance infrastructure, translation tooling between machine output and institutional deployment, is where the next competitive advantage will be built. The companies that treat evaluation as an engineering problem now, rather than a governance afterthought, will hold a position in 18 months that no amount of inference spend can replicate.

# tags

evaluation agentic-ai ai-for-science cognitive-load

◆ entities

Anthropic OpenAI BCG MIT CSAIL DeepMind Asimov Press

→ threads

agentic-ai-viability reliability AI-for-science legibility translation layer infrastructure

⟷ links

2026-03-27-w1 2026-03-27-w2 2026-03-27-w3 2026-03-13-w3 2026-03-20-w1

permalink

Scientific American 2026-03-29-3

AI Techniques Speed Up Forensic Analysis of Crucial Crime Scene Larvae

Two research teams replaced DNA sequencing with ML on cheaper instruments: mass spectrometry IDs species in under five minutes, handheld IR reads larval sex at 90% accuracy. The results are promising; the legal framework isn't. Courts require explainable, independently vetted forensic evidence, and DNA databases took decades to get there. Daubert-admissible AI is a different problem, and right now it's unfunded.

# tags

ai-for-science reliability regulation

SSRN · 2026-03-26 2026-03-27-w2

Can LLMs Discover Novel Economic Theories?

A $25 pipeline generated 257 economic theories and independently converged on the same mechanism a human researcher published months later — not as a curiosity, but as a stress test for every organization currently spending on AI-powered generation. When the cost of producing candidates collapses to noise, the constraint shifts entirely to knowing which candidates are good. That's the connection to tokenmaxxing: both stories are about the same missing layer, the scoring infrastructure that converts output volume into output value. The Karpathy Loop works precisely because it starts with a measurable metric and a stopping criterion — the constraint is the insight, not the generation. Organizations that build deterministic scoring architecture now, with LLM judgment in a minority role, will compound their lead; the ones optimizing for generation throughput are manufacturing commodities at scale.

# tags

agentic-ai ai-economics ai-for-science evaluation

Asimov Press · 2026-03-27 2026-03-27-w3

The Legibility Problem

The legibility piece reframes the entire week's stakes: chess went from centaur to post-human in 20 years, and AI-for-science will follow the same arc, but every output still has to pass through labs, regulators, and clinical infrastructure that speak human. The bottleneck was never discovery — it's the translation layer between what AI generates and what human institutions can actually deploy. That gap is exactly what the measurement problem in tokenmaxxing and the $25 theory pipeline leave open: generation is solved, evaluation is partially solved, but operationalizing the output through organizations that weren't built for machine-speed science is unsolved. Whoever owns that translation infrastructure captures value from every breakthrough that needs to reach the physical world, regardless of which model or lab produced it. The capability race and the legibility race are running at different speeds, and the distance between them is where the real economic value will settle.

# tags

agentic-ai ai-for-science infrastructure reliability

Asimov Press 2026-03-27-3

The Legibility Problem

Everyone's racing to build AI that does science. Nobody's building infrastructure for humans to use what it discovers. The bottleneck isn't discovery: it's deployment through human institutions. Chess went from centaur to post-human in 20 years; science will follow the same arc, but the output must still pass through labs, regulators, and clinical infrastructure that speak human. The entity that owns the translation layer between AI-generated and human-implementable science captures value from every breakthrough that needs to reach the physical world.

# tags

ai-for-science agentic-ai reliability infrastructure

SSRN 2026-03-26-3

Can LLMs Discover Novel Economic Theories?

An automated pipeline generated 257 candidate economic theories for two open asset pricing puzzles at a total cost of $25: the system independently converged on the same limited-participation mechanism a human researcher published months later. The real finding isn't that LLMs can theorize; it's that when generation costs collapse to zero, the only defensible position is evaluation infrastructure. Every org pouring money into AI-powered generation should be spending 10x more on scoring architecture: deterministic anchors carrying majority weight, LLM judgment in the minority.

# tags

ai-economics agentic-ai evaluation ai-for-science

◆ entities

gpt-oss-120b SSRN Li and Lin DeepInfra

→ threads

ai-economics agentic-ai-viability reliability

⟷ links

2026-03-24-1 2026-03-25-2 2026-03-21-1 2026-03-20-3 2026-03-08-1 2026-03-13-w3 2026-03-21-2 2026-03-10-2 2026-03-20-w2 2026-03-20-w1

permalink