Healthcare AI | Hejie Cui

MedHELM: Holistic Evaluation of Large Language Models for Medical Tasks

Large language models (LLMs) achieve near-perfect scores on medical licensing exams, yet these benchmarks fail to capture the complexity of real-world clinical practice. MedHELM addresses this gap by introducing a clinician-validated taxonomy of five categories, 22 subcategories, and 121 medical tasks; a suite of 37 benchmarks (including real-world EHR datasets); and an improved evaluation methodology leveraging an LLM-jury approach.

CuraBench: A Benchmark Dataset Generation System for Healthcare AI Evaluation

Ensuring that artificial intelligence (AI) tools in healthcare operate safely and effectively requires robust evaluation within realistic clinical contexts. Traditional evaluation methods often rely on standardized benchmarks that fail to capture the full complexity of patient care, while manually curating a dataset for a specific deployment scenario can be time-consuming and limiting.