<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Juan M. Banda | Hejie Cui</title><link>https://hejiecui.com/author/juan-m.-banda/</link><atom:link href="https://hejiecui.com/author/juan-m.-banda/index.xml" rel="self" type="application/rss+xml"/><description>Juan M. Banda</description><generator>Source Themes Academic (https://sourcethemes.com/academic/)</generator><language>en-us</language><lastBuildDate>Mon, 01 Dec 2025 00:00:00 -0800</lastBuildDate><image><url>https://hejiecui.com/images/logo_hu55b4809d0d762654adf09f4071918d91_2611179_300x300_fit_lanczos_2.png</url><title>Juan M. Banda</title><link>https://hejiecui.com/author/juan-m.-banda/</link></image><item><title>MedHELM: Holistic Evaluation of Large Language Models for Medical Tasks</title><link>https://hejiecui.com/publication/medhelm/</link><pubDate>Mon, 01 Dec 2025 00:00:00 -0800</pubDate><guid>https://hejiecui.com/publication/medhelm/</guid><description>&lt;p>Large language models (LLMs) achieve near-perfect scores on medical licensing exams, yet these benchmarks fail to capture the complexity of real-world clinical practice. &lt;strong>MedHELM&lt;/strong> addresses this gap by introducing a clinician-validated taxonomy of five categories, 22 subcategories, and 121 medical tasks; a suite of 37 benchmarks (including real-world EHR datasets); and an improved evaluation methodology leveraging an &lt;strong>LLM-jury&lt;/strong> approach.&lt;/p>
&lt;p>Our evaluation of nine frontier LLMs shows significant variation in performance: reasoning models such as DeepSeek R1 and o3-mini achieve the highest win-rates (66% and 64%), while existing automated metrics (e.g., ROUGE, BERTScore) underperform compared with clinician-aligned LLM-judging (ICC = 0.47). Models perform best on Clinical Note Generation and Patient Communication, moderately on Medical Research Assistance and Clinical Decision Support, and worst on Administration &amp;amp; Workflow tasks.&lt;/p>
&lt;p>MedHELM establishes a standardized, extensible, and clinically grounded framework for benchmarking LLMs, offering critical insights for healthcare deployment and future medical AI development.&lt;/p></description></item></channel></rss>