MedHELM: Holistic Evaluation of Large Language Models for Medical Tasks

Suhana Bedi†, Hejie Cui†, Miguel Fuentes†, Alyssa Unell†, Michael Wornow, Juan M. Banda, Nikesh Kotecha, Timothy Keyes, Yifan Mai, Mert Oez, Hao Qiu, Shrey Jain, Leonardo Schettini, Mehr Kashyap, Jason Alan Fries, Akshay Swaminathan, Philip Chung, Fateme Nateghi Haredasht, Ivan Lopez, Asad Aali, Gabriel Tse, Ashwin Nayak, Shivam Vedak, Sneha S. Jain, Birju Patel, Oluseyi Fayanju, Shreya Shah, Ethan Goh, Dong-han Yao, Brian Soetikno, Eduardo Reis, Sergios Gatidis, Vasu Divi, Robson Capasso, Rachna Saralkar, Chia-Chun Chiang, Jenelle Jindal, Tho Pham, Faraz Ghoddusi, Steven Lin, Albert S. Chiou, Christy Hong, Mohana Roy, Michael F. Gensheimer, Hinesh Patel, Kevin Schulman, Dev Dash, Danton Char, Lance Downing, Francois Grolleau, Kameron Black, Bethel Mieso, Aydin Zahedivash, Wen-wai Yim, Harshita Sharma, Tony Lee, Hannah Kirsch, Jennifer Lee, Nerissa Ambers, Carlene Lugtu, Aditya Sharma, Bilal Mawji, Alex Alekseyev, Vicky Zhou, Vikas Kakkar, Jarrod Helzer, Anurang Revri, Yair Bannett, Roxana Daneshjou, Jonathan Chen, Emily Alsentzer, Keith Morse, Nirmal Ravi, Nima Aghaeepour, Vanessa Kennedy, Akshay Chaudhari, Thomas Wang, Sanmi Koyejo, Matthew P. Lungren, Eric Horvitz, Percy Liang, Michael A. Pfeffer, Nigam H. Shah

December 2025

Code Project

Overview of MedHELM. MedHELM provides a clinician-validated taxonomy of 121 medical tasks, a 37-benchmark evaluation suite, and a robust LLM-jury methodology for assessing real-world performance of LLMs in healthcare.

Type

Journal article

Publication

Nature Medicine

Large language models (LLMs) achieve near-perfect scores on medical licensing exams, yet these benchmarks fail to capture the complexity of real-world clinical practice. MedHELM addresses this gap by introducing a clinician-validated taxonomy of five categories, 22 subcategories, and 121 medical tasks; a suite of 37 benchmarks (including real-world EHR datasets); and an improved evaluation methodology leveraging an LLM-jury approach.

Our evaluation of nine frontier LLMs shows significant variation in performance: reasoning models such as DeepSeek R1 and o3-mini achieve the highest win-rates (66% and 64%), while existing automated metrics (e.g., ROUGE, BERTScore) underperform compared with clinician-aligned LLM-judging (ICC = 0.47). Models perform best on Clinical Note Generation and Patient Communication, moderately on Medical Research Assistance and Clinical Decision Support, and worst on Administration & Workflow tasks.

MedHELM establishes a standardized, extensible, and clinically grounded framework for benchmarking LLMs, offering critical insights for healthcare deployment and future medical AI development.

MedHELM: Holistic Evaluation of Large Language Models for Medical Tasks

Related