A grounded AI agent that reasons through treatment decisions — every step traced to evidence.
Instead of answering in a single pass, ATHENA-R1 decides what evidence it needs, selects from a library of 212 biomedical tools, retrieves grounded evidence, and folds it into the next step — returning a final answer and a full reasoning trace. Built on a Qwen3-8B backbone.
Retrieves verified evidence through tools and grounds every conclusion in source data — each step traces to a drug label or annotation.
Requests tools via the ToolRAG model and applies the best candidate per step, instead of packing every tool into context.
Decomposes a task, chains tool calls, interprets evidence and revises when results are incomplete or unexpected.
Tools query continuously updated biomedical resources, reasoning over evidence beyond its parametric knowledge.
Treatment reasoning lies at the heart of medicine — weighing disease context, candidate therapies, comorbidities, contraindications and evolving evidence. It is inherently iterative: evidence must be gathered and revised across steps, not inferred in one pass.
We introduce ATHENA-R1, a grounded AI agent for treatment reasoning. At each step it identifies missing information, selects from 212 biomedical tools, retrieves evidence from curated sources, and folds it into subsequent reasoning. It is trained through two-level self-learning: multi-agent systems build tools, questions and reasoning traces for supervised fine-tuning, then reinforcement learning with rule-based scientific feedback.
Across five benchmarks (3,168 drug-reasoning tasks; 456 patient-specific scenarios) it leads language models and tool-use systems — 94.7% on open-ended drug reasoning (+17.8 pp over GPT-5, +25.9 pp over DeepSeek-R1) and 82.9% on TreatmentPC. Experts from 28 disease organizations prefer it across all eight criteria; in records from 5.4M patients, its adverse-event hypotheses track elevated risk in matched subpopulations.
Tools span drug mechanisms, interactions, safety and disease annotations, interfacing with openFDA (FDA labels since 1939), Open Targets and the Human Phenotype Ontology. Released as ToolUniverse — pip install tooluniverse
Reasoning traces are too large and varied to annotate by hand. ATHENA-R1 learns them from generated traces — first the structure of reasoning, then how to act within it.
Across five datasets, ATHENA-R1 leads in open-ended evaluation. DrugPC uses FDA drugs approved in 2024 — held out of training — to limit leakage.
With the Chan Zuckerberg Initiative Rare As One network — 29 experts from 28 disease organizations — ATHENA-R1 was compared blind against Qwen3, GPT o3-mini, Gemini-2.0-Flash and DeepSeek-R1. 23 evaluators gave 110 responses; experts preferred it across all eight criteria.
ATHENA-R1 generated adverse-event hypotheses for disease + comorbidity + medication profiles, tested by retrospective cohort analyses on records from over 5.4 million patients at Clalit Health Services, adjusting for age, sex, socioeconomic status and utilization.
@article{gao2025athena,
title = {A grounded AI agent for treatment reasoning},
author = {Gao, Shanghua and Noori, Ayush and Zhu, Richard and Kong, Zhenglun
and Su, Xiaorui and Ginder, Curtis and Kauffman, Justin
and Glicksberg, Benjamin and Lampert, Joshua and Sakhuja, Ankit
and Sawant, Ashwin and Clifton, David A. and Dagan, Noa
and Balicer, Ran and Zitnik, Marinka},
year = {2025}
}
Questions? Email Shanghua Gao and Marinka Zitnik.