Towards the Extraction of Event Logs from Dutch Medical Reports Using Large Language Models

Allmin Susaiyah, Natalia Sidorova

Eindhoven University of Technology

The Dutch Medical Treatments Contracts Act requires medical professionals to keep detailed records of their patients. These records include medical and nursing reports, which can be used as input for process analytics technologies to uncover useful insights and help optimize medical personnel workload, detect anomalies, and ensure compliance with regulatory standards. However, this input needs to come in the form of event logs instead of unstructured text data. An event log contains records of the activities carried out during a process, and the information about their order.
Various efforts have been made to extract event logs from textual data, drawing from information extraction and natural language processing disciplines. However, these approaches often rely on rule-based systems or supervised learning algorithms, each with limitations regarding adaptability and the need for extensive training data. Moreover, challenges persist due to varied mentions of activities, ambiguous resolutions of these mentions, and non-chronological ordering of mentions within text reports. Additionally, the scarcity of Dutch-language tools exacerbates the issue, as existing methodologies are predominantly tailored for English texts.
In this preliminary study, we explore the potential of large language models (LLMs) in extracting event logs from text data, addressing these challenges. LLMs have gained considerable attention for their process across diverse natural language processing tasks in numerous languages. We propose leveraging LLMs for three main tasks: (1) detecting relevant medical activities in Dutch textual medical reports, (2) disambiguating multiple mentions of the same activities, and (3) establishing the partial order of these activities' occurrences.
Our proposed framework involves systematically guiding large language models through various prompting methods, including zero-shot, one-shot, and multi-shot settings, as well as prompting them to generate coherent chains of thought. We outline our proposed methodology, validation strategy, and present preliminary results.
We use a labeled set of nursing notes sentences in Dutch to assess our framework. Our evaluation centers on detecting mentions of five predefined classes of activities. We present the quantitative metrics of the system's performance in terms of precision, recall, and F1-Score. Additionally, we conduct ablation studies to assess framework performance with and without certain prompting methods. We also analyze qualitative examples demonstrating cases where the model performed well and poorly.
The validation is currently limited to individual sentences rather than entire notes, thus missing contextual cues that could boost the language model's performance. Balancing long contexts presents a challenge, and finding an optimal balance is ongoing. Additionally, this study faces constraints due to the time and cost of running a large language model, exceeding those of conventional methods.
Future work will focus on improving mention disambiguation and ordering. We envision our framework opening new possibilities for process analytics on Dutch textual data, with potential applications extending beyond nurse notes.