2 | Hejie Cui

MedHELM: Holistic Evaluation of Large Language Models for Medical Tasks

Large language models (LLMs) achieve near-perfect scores on medical licensing exams, yet these benchmarks fail to capture the complexity of real-world clinical practice. MedHELM addresses this gap by introducing a clinician-validated taxonomy of five categories, 22 subcategories, and 121 medical tasks; a suite of 37 benchmarks (including real-world EHR datasets); and an improved evaluation methodology leveraging an LLM-jury approach.

TIMER: Temporal Instruction Modeling and Evaluation for Longitudinal Clinical Records

Electronic health records (EHRs) contain rich longitudinal information essential for clinical decision-making, yet large language models (LLMs) struggle to reason across patient timelines. We introduce \textbf{TIMER} (\textbf{T}emporal \textbf{I}nstruction \textbf{M}odeling and \textbf{E}valuation for Longitudinal Clinical \textbf{R}ecords), a method to improve LLMs’ temporal reasoning over multi-visit EHRs through time-aware instruction tuning.

A Review on Knowledge Graphs for Healthcare: Resources, Applications, and Promises

Objective: This comprehensive review aims to provide an overview of the current state of Healthcare Knowledge Graphs (HKGs), including their construction, utilization models, and applications across various healthcare and biomedical research domains.

BrainGB: A Benchmark for Brain Network Analysis with Graph Neural Networks

Mapping the connectome of the human brain using structural or functional connectivity has become one of the most pervasive paradigms for neuroimaging analysis. Recently, Graph Neural Networks (GNNs) motivated from geometric deep learning have attracted broad interest due to their established power for modeling complex networked data.