Benchmarking Beyond Translation: The Intrinsically Dutch Compound Noun Challenge for LLMs
Rik van Noord
University of Groningen
Given the ever increasing number and corresponding capabilities of large language models, accurately benchmarking them is becoming increasingly important. As usual, there are ample resources available for English, but surprisingly little for other languages. Dutch, for example, is currently benchmarked by using translations of English benchmarks, such as MMLU, ARC and TruthfulQA. While this generally works well, these benchmarks conflate the model’s ability to solve problems with its ability to understand and generate (in) the Dutch language. We argue that there is a need for challenging benchmarks developed specifically for a single language, which meet the following conditions: i) the task is inherently non-translatable to English (or any other language), (ii) it is challenging for state-of-the-art LLMs, (iii) remains doable for humans and (iv) requires generating Dutch text from an open vocabulary.