Grounded treatment reasoning

ATHENA-R1

An AI agent for treatment reasoning over a biomedical tool universe.

Shanghua Gao¹, Ayush Noori^1,2,3, Richard Zhu¹, Curtis Ginder^1,4, Zhenglun Kong¹, Xiaorui Su¹, Justin Kauffman⁵, Benjamin S. Glicksberg^5,6,7, Joshua Lampert^5,6,8, Ankit Sakhuja^5,9,10, Ashwin Sawant^5,9,11, ATHENA-R1 Evaluation Consortium¹², David A. Clifton^2,13, Noa Dagan^3,14,15, Ran Balicer^3,14,16, Marinka Zitnik^{1,3,17,18,19,†}

Department of Biomedical Informatics, Harvard Medical School, Boston, MA
Department of Engineering Science, University of Oxford, Oxford, UK
The Ivan and Francesca Berkowitz Family Living Laboratory Collaboration at Harvard Medical School and Clalit Research Institute, Boston, MA, USA
Cardiovascular Division, Department of Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, MA
The Windreich Department of Artificial Intelligence and Human Health, Icahn School of Medicine at Mount Sinai, New York, NY, USA
The Hasso Plattner Institute for Digital Health at Mount Sinai, Icahn School of Medicine at Mount Sinai and Mount Sinai Health System, New York City, NY, USA
Mindich Child Health and Development Institute and the Departments of Pediatrics and Genetics & Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, USA
Mount Sinai Fuster Heart Hospital, Icahn School of Medicine at Mount Sinai, New York, NY, USA
Mount Sinai AI Assurance Lab, Mount Sinai Health System, New York, NY, USA
Institute for Critical Care Medicine, Icahn School of Medicine at Mount Sinai, New York, NY, USA
Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, NY, USA
ATHENA-R1 Evaluation Group (the list of members and their affiliations appears in the Supplementary Information)
Oxford Suzhou Centre for Advanced Research, University of Oxford, Suzhou, Jiangsu, China
Clalit Research Institute, Innovation Division, Clalit Health Services, Ramat Gan, Israel
Faculty of Computer and Information Science, Ben Gurion University of the Negev, Be'er Sheva, Israel
Faculty of Health Sciences, School of Public Health, Ben Gurion University of the Negev, Be'er Sheva, Israel
Kempner Institute for the Study of Natural and Artificial Intelligence, Harvard University, Cambridge, MA
Broad Institute of MIT and Harvard, Cambridge, MA
Harvard Data Science Initiative, Cambridge, MA

† Correspondence — marinka@hms.harvard.edu

Read the paper Code Weights Online Methods Supplementary

reasoning trace · illustrative example, not clinical advice

▸ treatment problem

Which treatment adjustments are safe for metformin in an elderly patient with type 2 diabetes, hypertension, and early CKD?

65y maleeGFR 52ACE + HCTZlactic-acidosis concernonce-daily

— ready — step 0 / 23

▸ tool call◆ reasoning● clinician

◂ evidence-grounded answer

01Continue metformin — safe at eGFR 52 with monitoringopenFDA dosing label · HPO phenotype

02Reduce dose if eGFR drops below 45 mL/minFDA dosing · renal-clearance reasoning

03Switch ACE inhibitor → amlodipine 5 mg once dailyOpen Targets interaction · label safety

04Replace HCTZ → indapamide 1.25 mgdrug-label comparison

05Monitor eGFR + electrolytes every 3 monthsintegrated reasoning

Reasoning, one grounded step at a time

Instead of answering in a single pass, ATHENA-R1 decides what evidence it needs, selects from a library of 212 biomedical tools, retrieves grounded evidence, and folds it into the next step — returning a final answer and a full reasoning trace. Built on a Qwen3-8B backbone.

Knowledge grounding

Retrieves verified evidence through tools and grounds every conclusion in source data — each step traces to a drug label or annotation.

Goal-oriented tool selection

Selects the most relevant tools per step from the 212-tool library, instead of packing every tool into context.

Multi-step reasoning

Decomposes a task, chains tool calls, interprets evidence and revises when results are incomplete or unexpected.

Real-time retrieval

Tools query continuously updated biomedical resources, reasoning over evidence beyond its parametric knowledge.

Treatment reasoning lies at the heart of medicine — weighing disease context, candidate therapies, comorbidities, contraindications and evolving evidence. It is inherently iterative: evidence must be gathered and revised across steps, not inferred in one pass.

We introduce ATHENA-R1, a grounded AI agent for treatment reasoning. At each step it identifies missing information, selects from 212 biomedical tools, retrieves evidence from curated sources, and folds it into subsequent reasoning. It is trained through two-level self-learning: multi-agent systems build tools, questions and reasoning traces for supervised fine-tuning, then reinforcement learning with rule-based scientific feedback.

Across five benchmarks (3,168 drug-reasoning tasks; 456 patient-specific scenarios) it leads language models and tool-use systems — 94.7% on open-ended drug reasoning (+17.8 pp over GPT-5, +25.9 pp over DeepSeek-R1) and 82.9% on TreatmentPC. Experts from 28 disease organizations prefer it across all eight criteria; in records from 5.4M patients, its adverse-event hypotheses track elevated risk in matched subpopulations.

A universe of 212 biomedical tools

Tools span drug mechanisms, interactions, safety and disease annotations, interfacing with openFDA (FDA labels since 1939), Open Targets and the Human Phenotype Ontology. The biomedical tool library is available as pip install tooluniverse. Check out ToolUniverse ↗ for an expanded library of general-purpose tools for scientific discovery beyond treatment reasoning.

Categories cycle automatically · click a wedge to pause

Tool category

—

Two-level self-learning

Reasoning traces are too large and varied to annotate by hand. ATHENA-R1 learns them from generated traces — first the structure of reasoning, then how to act within it.

DataGen multi-agent pipeline · Level 1

The DataGen multi-agent system builds the tools, writes tasks from FDA labels, and composes step-by-step reasoning traces.

378,027

Instruction samples

85,340

Reasoning traces

177,626

Reasoning steps

281,695

Tool calls

RL with scientific feedback · Level 2

Reinforcement learning: rollouts are scored on six rule-based dimensions, and GRPO favors higher-scoring traces.

TreatmentPC · ablation

39.2%

Qwen3-8B base

+27.3

66.5%

+ Level 1 · SFT

+8.3

74.8%

+ Level 2 · RL

Both levels matter: supervised fine-tuning lifts the base by 27.3 pp and reinforcement learning adds a further 8.3 pp.

Beating frontier reasoning models

Across five datasets, ATHENA-R1 leads in open-ended evaluation. DrugPC uses FDA drugs approved in 2024 — held out of training — to limit leakage.

DrugPC · open-ended accuracy

ATHENA-R194.7

GPT-576.9

DeepSeek-R1 671B68.8

Qwen348.7

3,168 questions · 11 categories — +17.8 pp over GPT-5

Open-ended accuracy on DrugPC across 11 categories of drug information.

TreatmentPC · open-ended accuracy

ATHENA-R182.9

GPT-572.2

DeepSeek-R1 671B67.5

Qwen3-Next60.1

Qwen339.2

456 patient-specific scenarios — +10.7 pp over GPT-5

On TreatmentPC the answer depends on patient context. Giving GPT-5 tool access doesn't close the gap — ATHENA-R1 uses tools on every problem.

Preferred by rare-disease experts

With the Chan Zuckerberg Initiative Rare As One network — 29 experts from 28 disease organizations — ATHENA-R1 was compared blind against Qwen3, GPT o3-mini, Gemini-2.0-Flash and DeepSeek-R1. 23 evaluators gave 110 responses; experts preferred it across all eight criteria.

Blinded preference · 8 criteria

reference model preferredpairwise preference · % of 110 responsesATHENA-R1 preferred

3.6

Cognitive traceability

95.5

2.7

Helpfulness of rationale

94.5

17.3

Completeness

66.4

19.1

Task success

63.6

19.1

Possibility of harm

61.8

16.4

Alignment w/ consensus

59.1

18.2

Accuracy of content

58.2

20.0

Clinical relevance

57.3

Ties and "neither did well" responses (10–25% combined) not shown on bars.

4.16

ATHENA-R1 · mean ±0.90

2.44

reference models ±1.26

Share of evaluations preferring ATHENA-R1, and mean absolute ratings (1–5). All differences significant (P < 5×10⁻⁵). Physicians separately rated five real clinical vignettes — task success averaged 4.63 ± 0.52.

Risk hypotheses tested in 5.4M patients

These hypotheses are AI-generated, not literature-derived — ATHENA-R1 reasoned over disease + comorbidity + medication profiles to predict adverse events that should be elevated in specific patient subpopulations. The predictions were then tested retrospectively against records from over 5.4 million patients at Clalit Health Services, using matched cohort analyses adjusted for age, sex, socioeconomic status and outpatient utilization.

Adjusted odds ratios · 95% CI

Three of six AI-generated predictions reached statistical significance after adjusting for confounders, with odds ratios of 1.48–1.84 in the targeted subpopulations. Red = confirmed (95% CI excludes 1); gray = adjusted regression did not confirm in this cohort, though prevalence patterns aligned for two of the three.

Run it. Cite it.

# install the tool library pip install tooluniverse # code · demos · weights github.com/mims-harvard/ATHENA huggingface.co/mims-harvard/ATHENA-R1-8B

@article{gao2026athena,
  title  = {An AI agent for treatment reasoning
             over a biomedical tool universe},
  author = {Gao, Shanghua and Noori, Ayush and
             Zhu, Richard and Ginder, Curtis and
             Kong, Zhenglun and Su, Xiaorui and
             Kauffman, Justin and Glicksberg, Benjamin S. and
             Lampert, Joshua and Sakhuja, Ankit and
             Sawant, Ashwin and
             {ATHENA-R1 Evaluation Consortium} and
             Clifton, David A. and Dagan, Noa and
             Balicer, Ran and Zitnik, Marinka},
  year   = {2026}
}

Questions? Email Shanghua Gao and Marinka Zitnik.