CuraBench: A Benchmark Dataset Generation System for Healthcare AI Evaluation

Publication
KDD 2025 HealthDay Blue Sky Ideas Track

Ensuring that artificial intelligence (AI) tools in healthcare operate safely and effectively requires robust evaluation within realistic clinical contexts. Traditional evaluation methods often rely on standardized benchmarks that fail to capture the full complexity of patient care, while manually curating a dataset for a specific deployment scenario can be time-consuming and limiting. We propose CuraBench, a configurable benchmark generation system designed to create customized synthetic datasets tailored to specific clinical use cases. CuraBench’s taxonomy-driven configurable approach enables diverse evaluation scenarios—from assessing how AI systems interpret longitudinal patient histories, to evaluating clinical note summarization. By leveraging real-world healthcare data, CuraBench produces synthetic yet realistic scenarios configured to match the requirements of various medical settings, specialties, and patient demographics. Preliminary validation (TIMER) demonstrates the effectiveness of configurable benchmark generation in revealing evaluation biases undetectable with existing benchmarks. By streamlining the creation of comprehensive benchmark datasets, CuraBench represents a significant step toward responsible AI deployment, ensuring that models are rigorously tested in environments that mirror their intended clinical use.

Related