GPT-NL: Ethical AI development for the Dutch language

Dominique Blok

TNO

Erik de Graaf

TNO

GPT-NL is a publicly funded initiative dedicated to developing a sovereign, transparent, and ethically driven Dutch Large Language Model (LLM). The project prioritizes ethical and legal compliance, striving towards adherence to copyright law, GDPR, and the upcoming AI Act in every stage of development - from data collection to model training. In contrast to many other LLM initiatives, that rely on scraped datasets such as Common Crawl, which often do not meet these stringent requirements, GPT-NL puts a strong focus on data cleaning to aim for full compliance and ethical integrity. A key priority of GPT-NL is also to ensure social responsibility and to minimize the model's bias. Given the limited availability of Dutch data, GPT-NL will leverage techniques such as oversampling, transfer learning, high-quality data selection, and data synthesis to optimize the use of available resources. These approaches help overcome data scarcity challenges and ensure that the resulting model is both robust and beneficial to Dutch society.