F(r)ietje: An efficient, transparent LLM for Dutch

Bram Vanroy

KU Leuven; Instituut voor de Nederlandse Taal

In this work, I introduce "F(r)ietje"*, a new large language model (LLM) tailored specifically to Dutch. Unlike state-of-the-art models like GEITje 7B Ultra, 7 billion parameters large, Fietje is based on a phi-2 architecture with just 2.7 billion parameters. It was further pretrained on a filtered corpus of 28B Dutch tokens, then specialized to follow instructions, and finally aligned with AI feedback. It is intended to improve accessibility to Dutch LLMs while also scoring well on benchmarks.

Despite its more compact size, early versions of Fietje achieve impressive results on standard benchmarks, closely approaching the performance of models more than twice its size. For instance, in question answering (squad-nl) and sentiment analysis (dutch-social), Fietje scores on the same level as the highest scoring 7B Dutch model (within margin of error).

While these results are welcome, the goal of Fietje is not to beat leaderboards. Its standout feature is its efficiency. Its smaller size should improve access to powerful Dutch LLMs for researchers, industry, and hobbyists, making it much easier to deploy LLMs on their own devices, down to their laptop, without the need of expensive infrastructure. To this end, optimised GGUF versions are also made available alongside integration with tools like ollama.

Committed to accelerating research on Dutch LLMs and the adoption of open-source LLMs over closed ones, I release the Fietje models under the apache 2.0 license. Furthermore, all the training data, code, and logs are open, and all evaluations are reproducible.

---

* The working name is “Fietje” (from phi-2, fie-tje) but my phone autocorrected “fietje” to “frietje” once, so I have asked my network which name they prefer. The final model name is therefore subject to change.
© 2024 CLIN 34 Organisators. All rights reserved. Contact us via email.