MedRoBERTa.nl: The Dutch Medical Language Model

Vrije Universiteit Amsterdam
2020-2025

Modeling Hospital Notes from Electronic Health Records

MedRoBERTa.nl is one of the two encoder models worldwide that has been pre-trained on free text from real-world hospital data and is publicly available. The anonymized model allows NLP researchers and medical professionals to build medical text mining technology with a solid base model: the model can be fine-tuned for any task. We have published several papers in which we evaluate our model, as well as our anonymization strategy. A list of publications can be found at the end of the page.

Project Description

Electronic Health Records (EHRs) contain written notes by all kinds of medical professionals about all aspects of well-being of a patient. When adequately processed with a Large Language Model (LLM), this enormous source of information can be analysed quantitatively, which can lead to new insights, for example in treatment development or in patterns of patient recovery. However, the language used in clinical notes is very idiosyncratic, which available generic LLMs have not encountered in their pre-training. They therefore have not internalized an adequate representation of the semantics of this data, which is essential for building reliable Natural Language Processing (NLP) software. In our project, which was a collaboration between the Vrije Universiteit Amsterdam and the Amsterdam Medical Centers, we have developed the first domain-specific LLM for Dutch EHRs. In 2021, it was the first encoder LLM pre-trained on real-world hospital data to be published open-source worldwide. In our research, we discuss in detail why and how we built our model, pre-training it on the notes in EHRs using different strategies, and how we were able to publish it publicly by thoroughly anonymizing it. Our papers report on extensive evaluation of our model, comparing it to various other LLMs. Since its publication, our model has been implemented in various projects funded by different Dutch hospitals, and has been researched further by other academics. In our latest paper, we synthesize a subset of these studies as well.

Poster

Publications

Verkijk, S., & Vossen, P. (2021, December). MedRoBERTa. nl: a language model for Dutch electronic health records. In Computational Linguistics in the Netherlands (Vol. 11, pp. 141-159). Computational Linguistics in the Netherlands.

Verkijk, S., & Vossen, P. (2022, June). Efficiently and thoroughly anonymizing a transformer language model for Dutch electronic health records: a two-step method. In Proceedings of the Thirteenth Language Resources and Evaluation Conference (pp. 1098-1103).

Kim, J., Verkijk, S., Geleijn, E., van der Leeden, M., Meskers, C., Meskers, C., ... & Widdershoven, G. (2022, June). Modeling Dutch medical texts for detecting functional categories and levels of COVID-19 patients. In Proceedings of the Thirteenth Language Resources and Evaluation Conference (pp. 4577-4585).

Verkijk, S., & Vossen, P. (2025). Creating, anonymizing and evaluating the first medical language model pre-trained on Dutch Electronic Health Records: MedRoBERTa. nl. Artificial Intelligence in Medicine, 103148.

BibTeX

@article{verkijk2025creating,
  title={Creating, anonymizing and evaluating the first medical language model pre-trained on Dutch Electronic Health Records: MedRoBERTa. nl},
  author={Verkijk, Stella and Vossen, Piek},
  journal={Artificial Intelligence in Medicine},
  pages={103148},
  year={2025},
  publisher={Elsevier}
}