Hejie Cui
Senior Research Scientist, Google

I am a Senior Research Scientist at Google, focusing on large language models and agentic AI systems. Previously, I was at Amazon Rufus (Foundation Models Team), where I worked on large-scale LLM post-training and alignment for Amazon's Shopping LLM. My research spans reinforcement learning, instruction fine-tuning, synthetic data generation, and evaluation methods for improving LLM reasoning, reliability, and controllability.

Before that, I was a Postdoctoral Researcher at Stanford University, advised by Prof. Nigam H. Shah and Prof. Sanmi Koyejo. I received my Ph.D. in Computer Science from Emory University, advised by Prof. Carl Yang, and my B.Eng. in Software Engineering from Tongji University as Valedictorian (GPA 4.9/5.0, Rank 1/164) with 3x National Scholarships. During undergrad, I worked with Prof. Tianwei Yu on machine learning research and interned at the Perk Lab with Prof. Gabor Fichtinger at Queen's University.

Curriculum Vitae

Education
  • Stanford University
    Postdoctoral Researcher
    2024 - 2025
  • Emory University
    Ph.D. in Computer Science
    2019 - 2024
  • Tongji University
    B.Eng. in Software Engineering
    2015 - 2019
Experience
  • Google
    Senior Research Scientist
    2026 - Present
  • Amazon Rufus (Foundation Models Team)
    Applied Scientist
    2025 - 2026
  • Microsoft Research
    Research Intern
    Summer 2023
  • Amazon
    Applied Scientist Intern
    Summer 2022
Honors & Awards
News
2026
T2PO was accepted to ICML 2026 as a Spotlight paper.
May 04
Joined Google as a Senior Research Scientist.
Apr 01
2025
OpenAI used MedHELM as a medical evaluation benchmark for clinical-scenario model evaluation.
MedHELM was published in Nature Medicine.
As proposal lead for CuraBench, I secured $100K in funding from Stanford RAISE Health and Stanford HAI.
2024
Served as Junior Chair for the Large Models and Multimodal AI Roundtable at ML4H 2024.
Selected as a Rising Star at the Michigan AI Symposium.
Oct 01
Our survey on LLM domain specialization was cited in the 2024 Economic Report of the President.
Selected Publications (view all )
T²PO: Uncertainty-Guided Exploration Control for Stable Multi-Turn Agentic Reinforcement Learning
T²PO: Uncertainty-Guided Exploration Control for Stable Multi-Turn Agentic Reinforcement Learning

Haixin Wang, Hejie Cui#, Chenwei Zhang, Xin Liu, Shuowei Jin, Shijie Geng, Xinyang Zhang, Nasser Zalmout, Zhenyu Shi, Yizhou Sun (# corresponding author)

The International Conference on Machine Learning (ICML) 2026 Spotlight

Recent progress in multi-turn reinforcement learning (RL) has significantly improved reasoning LLMs' performances on complex interactive tasks. Despite advances in stabilization techniques such as fine-grained credit assignment and trajectory filtering, instability remains pervasive and often leads to training collapse. We argue that this instability stems from inefficient exploration in multi-turn settings, where po

T²PO: Uncertainty-Guided Exploration Control for Stable Multi-Turn Agentic Reinforcement Learning

Haixin Wang, Hejie Cui#, Chenwei Zhang, Xin Liu, Shuowei Jin, Shijie Geng, Xinyang Zhang, Nasser Zalmout, Zhenyu Shi, Yizhou Sun (# corresponding author)

The International Conference on Machine Learning (ICML) 2026 Spotlight

Recent progress in multi-turn reinforcement learning (RL) has significantly improved reasoning LLMs' performances on complex interactive tasks. Despite advances in stabilization techniques such as fine-grained credit assignment and trajectory filtering, instability remains pervasive and often leads to training collapse. We argue that this instability stems from inefficient exploration in multi-turn settings, where po

CoMem: Context Management with A Decoupled Long-Context Model
CoMem: Context Management with A Decoupled Long-Context Model

Yuwei Zhang, Chengyu Dong, Shuowei Jin, Changlong Yu, Hejie Cui, Hongye Jin, Xinyang Zhang, Hamed Bonab, Colin Lockard, Jianshu Chen, Zhenyu Shi, Jingbo Shang, Xian Li, Bing Yin

The International Conference on Machine Learning (ICML) 2026

Context management enables agentic models to solve long-horizon tasks through iterative summarization of previous interaction histories. However, this process typically incurs substantial decoding overhead for the extra summarization tokens, which significantly affect the end-to-end response latency at deployment. In this paper, we introduce CoMem, a novel framework that decouples memory management from the primary a

CoMem: Context Management with A Decoupled Long-Context Model

Yuwei Zhang, Chengyu Dong, Shuowei Jin, Changlong Yu, Hejie Cui, Hongye Jin, Xinyang Zhang, Hamed Bonab, Colin Lockard, Jianshu Chen, Zhenyu Shi, Jingbo Shang, Xian Li, Bing Yin

The International Conference on Machine Learning (ICML) 2026

Context management enables agentic models to solve long-horizon tasks through iterative summarization of previous interaction histories. However, this process typically incurs substantial decoding overhead for the extra summarization tokens, which significantly affect the end-to-end response latency at deployment. In this paper, we introduce CoMem, a novel framework that decouples memory management from the primary a

EHRBench: An Automated and Reliable EHR-based Benchmark for Clinical Decision Making with LLMs
EHRBench: An Automated and Reliable EHR-based Benchmark for Clinical Decision Making with LLMs

Yuzhang Xie, Keqi Han, Yunpeng Xiao, Hejie Cui, Guanchen Wu, Ziyang Zhang, Kai Shu, Jiaying Lu, Xiao Hu, Carl Yang

The ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), Datasets and Benchmarks Track 2026 Oral

Clinical decision-making (CDM) is central to real-world clinical workflows, where clinicians infer diagnoses, select treatments, or anticipate future health outcomes under incomplete evidence. EHRBench is an automated and reliable EHR-grounded benchmark for evaluating LLM-based clinical decision-making at scale. It constructs nearly 1M QA items spanning diagnosis, treatment, and prognosis, and benchmarks more than 30 representative LLMs to reveal actionable gaps toward clinically reliable LLM systems.

EHRBench: An Automated and Reliable EHR-based Benchmark for Clinical Decision Making with LLMs

Yuzhang Xie, Keqi Han, Yunpeng Xiao, Hejie Cui, Guanchen Wu, Ziyang Zhang, Kai Shu, Jiaying Lu, Xiao Hu, Carl Yang

The ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), Datasets and Benchmarks Track 2026 Oral

Clinical decision-making (CDM) is central to real-world clinical workflows, where clinicians infer diagnoses, select treatments, or anticipate future health outcomes under incomplete evidence. EHRBench is an automated and reliable EHR-grounded benchmark for evaluating LLM-based clinical decision-making at scale. It constructs nearly 1M QA items spanning diagnosis, treatment, and prognosis, and benchmarks more than 30 representative LLMs to reveal actionable gaps toward clinically reliable LLM systems.

MedHELM: Holistic Evaluation of Large Language Models for Medical Tasks
MedHELM: Holistic Evaluation of Large Language Models for Medical Tasks

Suhana Bedi*, Hejie Cui*, Miguel Fuentes*, Alyssa Unell*, Michael Wornow, Juan M. Banda, Nikesh Kotecha, Timothy Keyes, Yifan Mai, Mert Oez, Hao Qiu, Shrey Jain, Leonardo Schettini, Mehr Kashyap, Jason Alan Fries, Akshay Swaminathan, Philip Chung, Fateme Nateghi Haredasht, Ivan Lopez, Asad Aali, Gabriel Tse, Ashwin Nayak, Shivam Vedak, Sneha S. Jain, Birju Patel, Oluseyi Fayanju, Shreya Shah, Ethan Goh, Dong-han Yao, Brian Soetikno, Eduardo Reis, Sergios Gatidis, Vasu Divi, Robson Capasso, Rachna Saralkar, Chia-Chun Chiang, Jenelle Jindal, Tho Pham, Faraz Ghoddusi, Steven Lin, Albert S. Chiou, Christy Hong, Mohana Roy, Michael F. Gensheimer, Hinesh Patel, Kevin Schulman, Dev Dash, Danton Char, Lance Downing, Francois Grolleau, Kameron Black, Bethel Mieso, Aydin Zahedivash, Wen-wai Yim, Harshita Sharma, Tony Lee, Hannah Kirsch, Jennifer Lee, Nerissa Ambers, Carlene Lugtu, Aditya Sharma, Bilal Mawji, Alex Alekseyev, Vicky Zhou, Vikas Kakkar, Jarrod Helzer, Anurang Revri, Yair Bannett, Roxana Daneshjou, Jonathan Chen, Emily Alsentzer, Keith Morse, Nirmal Ravi, Nima Aghaeepour, Vanessa Kennedy, Akshay Chaudhari, Thomas Wang, Sanmi Koyejo, Matthew P. Lungren, Eric Horvitz, Percy Liang, Michael A. Pfeffer, Nigam H. Shah (* equal contribution)

Nature Medicine (5-Year Impact Factor: 52.4) 2026

Large language models (LLMs) achieve near-perfect scores on medical licensing exams, yet these benchmarks fail to capture the complexity of real-world clinical practice. MedHELM addresses this gap by introducing a clinician-validated taxonomy of five categories, 22 subcategories, and 121 medical tasks; a suite of 37 benchmarks (including real-world EHR datasets); and an improved evaluation methodology leveraging an L

MedHELM: Holistic Evaluation of Large Language Models for Medical Tasks

Suhana Bedi*, Hejie Cui*, Miguel Fuentes*, Alyssa Unell*, Michael Wornow, Juan M. Banda, Nikesh Kotecha, Timothy Keyes, Yifan Mai, Mert Oez, Hao Qiu, Shrey Jain, Leonardo Schettini, Mehr Kashyap, Jason Alan Fries, Akshay Swaminathan, Philip Chung, Fateme Nateghi Haredasht, Ivan Lopez, Asad Aali, Gabriel Tse, Ashwin Nayak, Shivam Vedak, Sneha S. Jain, Birju Patel, Oluseyi Fayanju, Shreya Shah, Ethan Goh, Dong-han Yao, Brian Soetikno, Eduardo Reis, Sergios Gatidis, Vasu Divi, Robson Capasso, Rachna Saralkar, Chia-Chun Chiang, Jenelle Jindal, Tho Pham, Faraz Ghoddusi, Steven Lin, Albert S. Chiou, Christy Hong, Mohana Roy, Michael F. Gensheimer, Hinesh Patel, Kevin Schulman, Dev Dash, Danton Char, Lance Downing, Francois Grolleau, Kameron Black, Bethel Mieso, Aydin Zahedivash, Wen-wai Yim, Harshita Sharma, Tony Lee, Hannah Kirsch, Jennifer Lee, Nerissa Ambers, Carlene Lugtu, Aditya Sharma, Bilal Mawji, Alex Alekseyev, Vicky Zhou, Vikas Kakkar, Jarrod Helzer, Anurang Revri, Yair Bannett, Roxana Daneshjou, Jonathan Chen, Emily Alsentzer, Keith Morse, Nirmal Ravi, Nima Aghaeepour, Vanessa Kennedy, Akshay Chaudhari, Thomas Wang, Sanmi Koyejo, Matthew P. Lungren, Eric Horvitz, Percy Liang, Michael A. Pfeffer, Nigam H. Shah (* equal contribution)

Nature Medicine (5-Year Impact Factor: 52.4) 2026

Large language models (LLMs) achieve near-perfect scores on medical licensing exams, yet these benchmarks fail to capture the complexity of real-world clinical practice. MedHELM addresses this gap by introducing a clinician-validated taxonomy of five categories, 22 subcategories, and 121 medical tasks; a suite of 37 benchmarks (including real-world EHR datasets); and an improved evaluation methodology leveraging an L

TIMER: Temporal Instruction Modeling and Evaluation for Longitudinal Clinical Records
TIMER: Temporal Instruction Modeling and Evaluation for Longitudinal Clinical Records

Hejie Cui*, Alyssa Unell*, Bowen Chen, Jason Alan Fries, Emily Alsentzer, Sanmi Koyejo, Nigam H. Shah (* equal contribution)

npj Digital Medicine (5-Year Impact Factor: 17.0) 2025

Electronic health records (EHRs) contain rich longitudinal information essential for clinical decision-making, yet large language models (LLMs) struggle to reason across patient timelines. We introduce \textbf{TIMER} (\textbf{T}emporal \textbf{I}nstruction \textbf{M}odeling and \textbf{E}valuation for Longitudinal Clinical \textbf{R}ecords), a method to improve LLMs’ temporal reasoning over multi-visit EHRs through t

TIMER: Temporal Instruction Modeling and Evaluation for Longitudinal Clinical Records

Hejie Cui*, Alyssa Unell*, Bowen Chen, Jason Alan Fries, Emily Alsentzer, Sanmi Koyejo, Nigam H. Shah (* equal contribution)

npj Digital Medicine (5-Year Impact Factor: 17.0) 2025

Electronic health records (EHRs) contain rich longitudinal information essential for clinical decision-making, yet large language models (LLMs) struggle to reason across patient timelines. We introduce \textbf{TIMER} (\textbf{T}emporal \textbf{I}nstruction \textbf{M}odeling and \textbf{E}valuation for Longitudinal Clinical \textbf{R}ecords), a method to improve LLMs’ temporal reasoning over multi-visit EHRs through t

A Review on Knowledge Graphs for Healthcare: Resources, Applications, and Promises
A Review on Knowledge Graphs for Healthcare: Resources, Applications, and Promises

Hejie Cui*, Jiaying Lu*, Ran Xu*, Shiyu Wang, Wenjing Ma, Yue Yu, Shaojun Yu, Xuan Kan, Chen Ling, Liang Zhao, Zhaohui S. Qin, Joyce C. Ho, Tianfan Fu, Jing Ma, Mengdi Huai, Fei Wang, Carl Yang (* equal contribution)

Journal of Biomedical Informatics (JBI) (IF: 4.5) 2025

Objective: This comprehensive review aims to provide an overview of the current state of Healthcare Knowledge Graphs (HKGs), including their construction, utilization models, and applications across various healthcare and biomedical research domains. Methods: We thoroughly analyzed existing literature on HKGs, covering their construction methodologies, utilization techniques, and applications in basic science researc

A Review on Knowledge Graphs for Healthcare: Resources, Applications, and Promises

Hejie Cui*, Jiaying Lu*, Ran Xu*, Shiyu Wang, Wenjing Ma, Yue Yu, Shaojun Yu, Xuan Kan, Chen Ling, Liang Zhao, Zhaohui S. Qin, Joyce C. Ho, Tianfan Fu, Jing Ma, Mengdi Huai, Fei Wang, Carl Yang (* equal contribution)

Journal of Biomedical Informatics (JBI) (IF: 4.5) 2025

Objective: This comprehensive review aims to provide an overview of the current state of Healthcare Knowledge Graphs (HKGs), including their construction, utilization models, and applications across various healthcare and biomedical research domains. Methods: We thoroughly analyzed existing literature on HKGs, covering their construction methodologies, utilization techniques, and applications in basic science researc

CLIMB: Data Foundations for Large Scale Multimodal Clinical Foundation Models
CLIMB: Data Foundations for Large Scale Multimodal Clinical Foundation Models

Wei Dai, Peilin Chen, Malinda Lu, Daniel Li, Haowen Wei, Hejie Cui, Paul Pu Liang

The International Conference on Machine Learning (ICML) 2025

Recent advances in clinical AI have enabled remarkable progress across many clinical domains. However, existing benchmarks and models are primarily limited to a small set of modalities and tasks, which hinders the development of large-scale multimodal methods that can make holistic assessments of patient health and well-being. To bridge this gap, we introduce Clinical Large-Scale Integrative Multimodal Benchmark (CLI

CLIMB: Data Foundations for Large Scale Multimodal Clinical Foundation Models

Wei Dai, Peilin Chen, Malinda Lu, Daniel Li, Haowen Wei, Hejie Cui, Paul Pu Liang

The International Conference on Machine Learning (ICML) 2025

Recent advances in clinical AI have enabled remarkable progress across many clinical domains. However, existing benchmarks and models are primarily limited to a small set of modalities and tasks, which hinders the development of large-scale multimodal methods that can make holistic assessments of patient health and well-being. To bridge this gap, we introduce Clinical Large-Scale Integrative Multimodal Benchmark (CLI

Biomedical Visual Instruction Tuning with Clinician Preference Alignment
Biomedical Visual Instruction Tuning with Clinician Preference Alignment

Hejie Cui*, Lingjun Mao*, Xin Liang, Jieyu Zhang, Hui Ren, Quanzheng Li, Xiang Li, Carl Yang (* equal contribution)

The Conference on Neural Information Processing Systems (NeurIPS) 2024

Recent advancements in multimodal foundation models have showcased impressive capabilities in understanding and reasoning with visual and textual information. Adapting these foundation models trained for general usage to specialized domains like biomedicine requires large-scale domain-specific instruction datasets. While existing works have explored curating such datasets automatically, the resultant datasets are not

Biomedical Visual Instruction Tuning with Clinician Preference Alignment

Hejie Cui*, Lingjun Mao*, Xin Liang, Jieyu Zhang, Hui Ren, Quanzheng Li, Xiang Li, Carl Yang (* equal contribution)

The Conference on Neural Information Processing Systems (NeurIPS) 2024

Recent advancements in multimodal foundation models have showcased impressive capabilities in understanding and reasoning with visual and textual information. Adapting these foundation models trained for general usage to specialized domains like biomedicine requires large-scale domain-specific instruction datasets. While existing works have explored curating such datasets automatically, the resultant datasets are not

Microstructures and Accuracy of Graph Recall by Large Language Models
Microstructures and Accuracy of Graph Recall by Large Language Models

Yanbang Wang, Hejie Cui, Jon Kleinberg

The Conference on Neural Information Processing Systems (NeurIPS) 2024 IC2S2 Oral

Graphs data is crucial for many applications, and much of it exists in the relations described in textual format. As a result, being able to accurately recall and encode a graph described in earlier text is a basic yet pivotal ability that LLMs need to demonstrate if they are to perform reasoning tasks that involve graph-structured information. Human performance at graph recall by has been studied by cognitive scient

Microstructures and Accuracy of Graph Recall by Large Language Models

Yanbang Wang, Hejie Cui, Jon Kleinberg

The Conference on Neural Information Processing Systems (NeurIPS) 2024 IC2S2 Oral

Graphs data is crucial for many applications, and much of it exists in the relations described in textual format. As a result, being able to accurately recall and encode a graph described in earlier text is a basic yet pivotal ability that LLMs need to demonstrate if they are to perform reasoning tasks that involve graph-structured information. Human performance at graph recall by has been studied by cognitive scient

Open Visual Knowledge Extraction via Relation-Oriented Multimodality Model Prompting
Open Visual Knowledge Extraction via Relation-Oriented Multimodality Model Prompting

Hejie Cui*, Xinyu Fang*, Zihan Zhang, Ran Xu, Xuan Kan, Xin Liu, Manling Li, Yangqiu Song, Carl Yang (* equal contribution)

The Conference on Neural Information Processing Systems (NeurIPS) 2023

Images contain rich relational knowledge that can help machines understand the world. Existing methods on visual knowledge extraction often rely on the pre-defined format (e.g., sub-verb-obj tuples) or vocabulary (e.g., relation types), restricting the expressiveness of the extracted knowledge. In this work, we take a first exploration to a new paradigm of open visual knowledge extraction. To achieve this, we present

Open Visual Knowledge Extraction via Relation-Oriented Multimodality Model Prompting

Hejie Cui*, Xinyu Fang*, Zihan Zhang, Ran Xu, Xuan Kan, Xin Liu, Manling Li, Yangqiu Song, Carl Yang (* equal contribution)

The Conference on Neural Information Processing Systems (NeurIPS) 2023

Images contain rich relational knowledge that can help machines understand the world. Existing methods on visual knowledge extraction often rely on the pre-defined format (e.g., sub-verb-obj tuples) or vocabulary (e.g., relation types), restricting the expressiveness of the extracted knowledge. In this work, we take a first exploration to a new paradigm of open visual knowledge extraction. To achieve this, we present

BrainGB: A Benchmark for Brain Network Analysis with Graph Neural Networks
BrainGB: A Benchmark for Brain Network Analysis with Graph Neural Networks

Hejie Cui, Wei Dai, Yanqiao Zhu, Xuan Kan, Antonio Aodong Chen Gu, Joshua Lukemire, Liang Zhan, Lifang He, Ying Guo, Carl Yang

IEEE Transactions on Medical Imaging (5-Year Impact Factor: 12.3) 2022

Mapping the connectome of the human brain using structural or functional connectivity has become one of the most pervasive paradigms for neuroimaging analysis. Recently, Graph Neural Networks (GNNs) motivated from geometric deep learning have attracted broad interest due to their established power for modeling complex networked data. Despite their superior performance in many fields, there has not yet been a systemat

BrainGB: A Benchmark for Brain Network Analysis with Graph Neural Networks

Hejie Cui, Wei Dai, Yanqiao Zhu, Xuan Kan, Antonio Aodong Chen Gu, Joshua Lukemire, Liang Zhan, Lifang He, Ying Guo, Carl Yang

IEEE Transactions on Medical Imaging (5-Year Impact Factor: 12.3) 2022

Mapping the connectome of the human brain using structural or functional connectivity has become one of the most pervasive paradigms for neuroimaging analysis. Recently, Graph Neural Networks (GNNs) motivated from geometric deep learning have attracted broad interest due to their established power for modeling complex networked data. Despite their superior performance in many fields, there has not yet been a systemat

Brain Network Transformer
Brain Network Transformer

Xuan Kan, Wei Dai, Hejie Cui, Zilong Zhang, Ying Guo, Carl Yang

The Conference on Neural Information Processing Systems (NeurIPS) 2022 Spotlight

Human brains are commonly modeled as networks of Regions of Interest (ROIs) and their connections for the understanding of brain functions and mental disorders. Recently, Transformer-based models have been studied over different types of data, including graphs, shown to bring performance gains widely. In this work, we study Transformer-based models for brain network analysis. Driven by the unique properties of data,

Brain Network Transformer

Xuan Kan, Wei Dai, Hejie Cui, Zilong Zhang, Ying Guo, Carl Yang

The Conference on Neural Information Processing Systems (NeurIPS) 2022 Spotlight

Human brains are commonly modeled as networks of Regions of Interest (ROIs) and their connections for the understanding of brain functions and mental disorders. Recently, Transformer-based models have been studied over different types of data, including graphs, shown to bring performance gains widely. In this work, we study Transformer-based models for brain network analysis. Driven by the unique properties of data,

On Positional and Structural Node Features for Graph Neural Networks on Non-attributed Graphs
On Positional and Structural Node Features for Graph Neural Networks on Non-attributed Graphs

Hejie Cui, Zijie Lu, Pan Li, Carl Yang

The ACM International Conference on Information and Knowledge Management (CIKM) 2022 Most Influential CIKM Paper of 2022

Graph neural networks (GNNs) have been widely used in various graph-related problems such as node classification and graph classification, where the superior performance is mainly established when natural node features are available. However, it is not well understood how GNNs work without natural node features, especially regarding the various ways to construct artificial ones. In this paper, we point out the two ty

On Positional and Structural Node Features for Graph Neural Networks on Non-attributed Graphs

Hejie Cui, Zijie Lu, Pan Li, Carl Yang

The ACM International Conference on Information and Knowledge Management (CIKM) 2022 Most Influential CIKM Paper of 2022

Graph neural networks (GNNs) have been widely used in various graph-related problems such as node classification and graph classification, where the superior performance is mainly established when natural node features are available. However, it is not well understood how GNNs work without natural node features, especially regarding the various ways to construct artificial ones. In this paper, we point out the two ty

All publications