Grounded treatment reasoning

ATHENA-R1

A grounded AI agent that reasons through treatment decisions — every step traced to evidence.

Shanghua Gao1, Ayush Noori1,2,3, Richard Zhu1, Zhenglun Kong1, Xiaorui Su1, Curtis Ginder1,4, Justin Kauffman5, Benjamin Glicksberg5,6,7, Joshua Lampert5,6,8, Ankit Sakhuja5,9,10, Ashwin Sawant5,9,11, ATHENA-R1 Evaluation Consortium12, David A. Clifton2,13, Noa Dagan3,14,15, Ran Balicer3,14,16, Marinka Zitnik1,3,17,18,19,†
† Correspondence — marinka@hms.harvard.edu
  1. Department of Biomedical Informatics, Harvard Medical School, Boston, MA
  2. Department of Engineering Science, University of Oxford, UK
  3. Berkowitz Family Living Laboratory, Harvard Medical School & Clalit Research Institute
  4. Cardiovascular Division, Brigham and Women's Hospital, Harvard Medical School
  5. Windreich Dept. of AI and Human Health, Icahn School of Medicine at Mount Sinai
  6. Hasso Plattner Institute for Digital Health at Mount Sinai
  7. Mindich Child Health Institute & Depts. of Pediatrics and Genetics, Mount Sinai
  8. Mount Sinai Fuster Heart Hospital, Icahn School of Medicine at Mount Sinai
  9. Mount Sinai AI Assurance Lab, Mount Sinai Health System
  10. Institute for Critical Care Medicine, Icahn School of Medicine at Mount Sinai
  11. Division of Medicine, Icahn School of Medicine at Mount Sinai
  12. ATHENA-R1 Evaluation Group (members in Supplementary Information)
  13. Oxford Suzhou Centre for Advanced Research, University of Oxford, China
  14. Clalit Research Institute, Innovation Division, Clalit Health Services, Israel
  15. Faculty of Computer and Information Science, Ben Gurion University of the Negev
  16. Faculty of Health Sciences, Ben Gurion University of the Negev
  17. Kempner Institute, Harvard University, Cambridge, MA
  18. Broad Institute of MIT and Harvard, Cambridge, MA
  19. Harvard Data Science Initiative, Cambridge, MA
reasoning trace · illustrative
▸ treatment problem
Which treatment adjustments are safe for metformin in an elderly patient with Type 2 Diabetes, hypertension, and early CKD?
65y maleeGFR 52ACE + HCTZlactic-acidosis concernonce-daily
— ready — step 0 / 23
TOOLS REASONING CLINICIAN parallel tools α | β | γ threads human input
▸ tool call◆ reasoning● clinician
◂ evidence-grounded answer
01Continue metformin — safe at eGFR 52 with monitoringopenFDA dosing label · HPO phenotype
02Reduce dose if eGFR drops below 45 mL/minFDA dosing · renal-clearance reasoning
03Switch ACE inhibitor → amlodipine 5 mg once dailyOpen Targets interaction · label safety
04Replace HCTZ → indapamide 1.25 mgdrug-label comparison
05Monitor eGFR + electrolytes every 3 monthsintegrated reasoning
0%
How it works

Reasoning, one grounded step at a time

Instead of answering in a single pass, ATHENA-R1 decides what evidence it needs, selects from a library of 212 biomedical tools, retrieves grounded evidence, and folds it into the next step — returning a final answer and a full reasoning trace. Built on a Qwen3-8B backbone.

Knowledge grounding

Retrieves verified evidence through tools and grounds every conclusion in source data — each step traces to a drug label or annotation.

Goal-oriented tool use

Requests tools via the ToolRAG model and applies the best candidate per step, instead of packing every tool into context.

Multi-step reasoning

Decomposes a task, chains tool calls, interprets evidence and revises when results are incomplete or unexpected.

Real-time knowledge

Tools query continuously updated biomedical resources, reasoning over evidence beyond its parametric knowledge.

Abstract

Treatment reasoning lies at the heart of medicine — weighing disease context, candidate therapies, comorbidities, contraindications and evolving evidence. It is inherently iterative: evidence must be gathered and revised across steps, not inferred in one pass.

We introduce ATHENA-R1, a grounded AI agent for treatment reasoning. At each step it identifies missing information, selects from 212 biomedical tools, retrieves evidence from curated sources, and folds it into subsequent reasoning. It is trained through two-level self-learning: multi-agent systems build tools, questions and reasoning traces for supervised fine-tuning, then reinforcement learning with rule-based scientific feedback.

Across five benchmarks (3,168 drug-reasoning tasks; 456 patient-specific scenarios) it leads language models and tool-use systems — 94.7% on open-ended drug reasoning (+17.8 pp over GPT-5, +25.9 pp over DeepSeek-R1) and 82.9% on TreatmentPC. Experts from 28 disease organizations prefer it across all eight criteria; in records from 5.4M patients, its adverse-event hypotheses track elevated risk in matched subpopulations.

Tool library

A universe of 212 biomedical tools

Tools span drug mechanisms, interactions, safety and disease annotations, interfacing with openFDA (FDA labels since 1939), Open Targets and the Human Phenotype Ontology. Released as ToolUniverse — pip install tooluniverse

Click a slice to drill in
Training

Two-level self-learning

Reasoning traces are too large and varied to annotate by hand. ATHENA-R1 learns them from generated traces — first the structure of reasoning, then how to act within it.

DataGen multi-agent pipeline · Level 1
ToolGen212 tools from APIs QuestionGen85,340 tasks TRACEGEN Helper Tool Provider Solver ATHENA-R1-Instruct378,027 SFT samples
The DataGen multi-agent system builds the tools, writes tasks from FDA labels, and composes step-by-step reasoning traces.
378,027
Instruction samples
85,340
Reasoning traces
177,626
Reasoning steps
281,695
Tool calls
RL with scientific feedback · Level 2
PolicyATHENA-R1 Rollouts in212-tool env SCIENTIFIC FEEDBACK · 6 1 · correctness 2 · format 3 · evidence 4 · multi-step 5 · arg grounding 6 · non-redundant GRPO update
Reinforcement learning: rollouts are scored on six rule-based dimensions, and GRPO favors higher-scoring traces.
TreatmentPC · ablation
39.2%
Qwen3-8B base
+27.3
66.5%
+ Level 1 · SFT
+8.3
74.8%
+ Level 2 · RL
Both levels matter: supervised fine-tuning lifts the base by 27.3 pp and reinforcement learning adds a further 8.3 pp.
Benchmarks

Beating frontier reasoning models

Across five datasets, ATHENA-R1 leads in open-ended evaluation. DrugPC uses FDA drugs approved in 2024 — held out of training — to limit leakage.

DrugPC · open-ended accuracy
ATHENA-R194.7
GPT-576.9
DeepSeek-R1 671B68.8
Qwen348.7
3,168 questions · 11 categories — +17.8 pp over GPT-5
Open-ended accuracy on DrugPC across 11 categories of drug information.
TreatmentPC · open-ended accuracy
ATHENA-R182.9
GPT-572.2
DeepSeek-R1 671B67.5
Qwen3-Next60.1
Qwen339.2
456 patient-specific scenarios — +10.7 pp over GPT-5
On TreatmentPC the answer depends on patient context. Giving GPT-5 tool access doesn't close the gap — ATHENA-R1 uses tools on every problem.
Expert evaluation

Preferred by rare-disease experts

With the Chan Zuckerberg Initiative Rare As One network — 29 experts from 28 disease organizations — ATHENA-R1 was compared blind against Qwen3, GPT o3-mini, Gemini-2.0-Flash and DeepSeek-R1. 23 evaluators gave 110 responses; experts preferred it across all eight criteria.

Blinded preference · 8 criteria
Cognitive traceability95.5
Helpfulness of rationale94.5
Completeness66.4
Task success63.6
Harm avoided61.8
Alignment w/ consensus59.1
Accuracy of content58.2
Clinical relevance57.3
4.16
ATHENA-R1 · mean ±0.90
2.44
reference models ±1.26
Share of evaluations preferring ATHENA-R1, and mean absolute ratings (1–5). All differences significant (P < 5×10⁻⁵). Physicians separately rated five real clinical vignettes — task success averaged 4.63 ± 0.52.
Population-scale validation

Risk hypotheses tested in 5.4M patients

ATHENA-R1 generated adverse-event hypotheses for disease + comorbidity + medication profiles, tested by retrospective cohort analyses on records from over 5.4 million patients at Clalit Health Services, adjusting for age, sex, socioeconomic status and utilization.

Adjusted odds ratios · 95% CI
1.00.51.52.0 ADJUSTED ODDS RATIO β-blocker → acute kidney failurehypertension + gout 1.84 β-blocker → hyperkalemiahypertension + gout 1.78 DPP-4 inh. → hepatocellular carcinomadiabetes + ischemic heart disease 1.48 diuretic → squamous cell carcinomahypertension + gout 1.08 statin → liver failurehyperlipidemia + hypothyroidism 1.04 metformin → respiratory failurediabetes + chronic kidney disease 1.00
Red = CI excludes 1 (significant elevated risk); gray = not significant. Positive controls recovered known effects; negative controls stayed near null.
Get started & cite

Run it. Cite it.

# install the tool library pip install tooluniverse # code · demos · weights github.com/mims-harvard/ATHENA huggingface.co/mims-harvard/ATHENA-R1-8B
@article{gao2025athena,
  title  = {A grounded AI agent for treatment reasoning},
  author = {Gao, Shanghua and Noori, Ayush and Zhu, Richard and Kong, Zhenglun
            and Su, Xiaorui and Ginder, Curtis and Kauffman, Justin
            and Glicksberg, Benjamin and Lampert, Joshua and Sakhuja, Ankit
            and Sawant, Ashwin and Clifton, David A. and Dagan, Noa
            and Balicer, Ran and Zitnik, Marinka},
  year   = {2025}
}

Questions? Email Shanghua Gao and Marinka Zitnik.