Evaluating the State-of-the-Art Automatic Speech Recognition systems for Dutch

Dragoș Alexandru Bălan

University of Twente

Automatic Speech Recognition (ASR) technology has rapidly advanced over the past few years to the point where one model can be used for multiple languages, without additional fine-tuning required. However, there is no systematic research that has been conducted to evaluate the performance of these novel ASR models on Dutch, under different speech conditions. This research tries to address this gap by evaluating different versions of Whisper, MMS, and a fine-tuned version of XLS-R on Dutch. The models will be compared with each other, as well as a baseline, Kaldi_NL, which has been fine-tuned on Dutch. The datasets used are the N-Best 2008 evaluation corpus, which contains data from broadcast news and conversational telephone speech in the Netherlands and Belgium, and JASMIN-CGN, a corpus of Dutch speech from native elderly, children, as well as non-native children and adults. The results show that version "large-v2" of Whisper with Voice Activity Detection (VAD) consistently performs the best in nearly all categories. It manages to achieve relative WER improvements over the baseline between 20%-38% for N-Best and 14%-65% for JASMIN-CGN. On the contrary, version "large-v3" with VAD performs worse overall compared to "large-v2" with VAD, with relative WER deteriorations of 34%-147% and 13%-60% for the read and conversational speech subsets of JASMIN-CGN respectively. As for N-Best, Whisper "large-v3" scores a relative WER degradation of 5%-23%, with the exception of Flemish conversational telephone speech, which performed a relative 2% better. Aside from Whisper, XLS-R, which has been fine-tuned on the Dutch subset of Common Voice, performed better overall compared to the baseline or MMS. MMS is among the worst-performing systems together with Whisper "large-v3". These results showcase the adaptability and accuracy of Whisper "large-v2", performing the best overall when compared with other similar state-of-the-art models and with the baseline. Potential research that can stem from this work is fine-tuning Whisper to cover a specific topic or group of speakers that are less represented, thus making ASR more focused or inclusive. Additionally, by doing this evaluation of a large variety of ASR systems, points of reference have been established which can then be used to develop and evaluate newer models for the Dutch language.