TIMER: Temporal Instruction Modeling and Evaluation for Longitudinal Clinical Records

Overview of TIMER. TIMER enhances model performance through instruction tuning with timestamp-linked instruction-response pairs generated across longitudinal EHR timelines. Our evaluation employs both clinician-curated benchmarks and a controlled sampling strategy to create instruction sets with varying temporal distributions, enabling assessment of how models reason across different time periods in patient histories.
Publication
In Submission

Electronic health records (EHRs) contain rich longitudinal information essential for clinical decision-making, yet large language models (LLMs) struggle to reason across patient timelines. We introduce \textbf{TIMER} (\textbf{T}emporal \textbf{I}nstruction \textbf{M}odeling and \textbf{E}valuation for Longitudinal Clinical \textbf{R}ecords), a method to improve LLMs’ temporal reasoning over multi-visit EHRs through time-aware instruction tuning. TIMER grounds LLMs in patient-specific temporal contexts by linking each instruction-response pair to specific timestamps, ensuring temporal fidelity throughout the training process. Evaluations show that TIMER-tuned models outperform conventional medical instruction-tuned approaches by 6.6% in completeness on clinician-curated benchmarks, with distribution-matched training demonstrating advantages up to 6.5% in temporal reasoning evaluation. Qualitative analyses reveal that using TIMER enhances temporal boundary adherence, trend detection, and chronological precision, which are necessary for applications such as disease trajectory modeling and treatment response monitoring. Overall, TIMER provides a methodological basis for developing LLMs that can effectively engage with the inherently longitudinal nature of data derived from patient care.

Related