Watch Your Vocabulary More Than Your Pre-training Data

R. Kinds, S. Abdi, D. Timmer, S. van Loon, T. Caselli

University of Groningen

Language Models (LMs) have revolutionized Natural Language Processing. It is known that the success of LMs is largely dependent on three main factors: the Transformer architecture and the self-attention mechanism; the availability of large amounts of data for pre-training; and the training objective(s). Additionally, the size of LMs plays a role. Empirical evidence has shown that a valid strategy to improve the performance of these models is scaling up: increase the number of layers, the number of attention heads, the embedding size, the vocabulary size, and the total number of parameters. Nevertheless, there are still issues that are far from being solved. Bacco et al., (2023) observed an instability of encoder-based LMs when further pre-trained. More recently, Petty et al. (2024) have investigated the impact of depth (i.e., number of layers) and width (i.e., number of attention heads) of encoder-based LMs when controlling for the total number of parameters, showing that if there are constraints on the total size of the model, increasing its depth is not an optimal solution to improve their performance.
In this contribution we compare four monolingual LMs for Dutch when tested for offensive and abusive language detection against the DALC corpus (Caselli et al., 2022). In particular, we compare a BERT-based architecture, BERTje, against three RoBERTa-based models, namely RobBERT-v2, RobBERT-2022, and RobBERT-2023. All the models correspond to the base versions, presenting the same depth and width. The differences concern: the pre-training objective, with the RoBERTa-based using only MLM; the size of their vocabulary, with BERTje having only 30k tokens and the RobBERT LMs ranging between 40k up to 50k; the pre-training data, with RobBERT LMs using different versions of the OSCAR corpus (with increasing sizes) and BERTje a more controlled dataset composed by Wikipedia, news articles, and books. BERTje, RobBERT-v2, and RobBERT-2023 are all trained from scratch while RobBERT-2022 is a further pre-trained version (with a larger vocabulary) of RobBERT-v2. In addition to this, RobBERT-2023 adopts the Tik-to-Tok approach where token embeddings are initialized using the English RoBERTa model. Considering the commonalities of these models, we expect that the LM with the biggest vocabulary and largest pre-training data would result in the best performing once fine-tuned. However, this is not the case. For offensive language, RobBERT-v2 and RobBERT-v2 outperform BERTje (macro-F1 0.816 and 0.818 vs. 0.799), while RobBERT-2023 lags behind (macro-F1 0.775). The same behavior is observed when testing for abusive language, where RobBERT-2023 has a macro-F1 of 0.693 with a negative Δ=0.09 against RobBERT-v2, 0.11 against RobBERT-2022, and 0.02 against BERTje.
Contrary to the expectation that larger vocabularies and extensive pre-training data yield superior results, our analysis reveals additional dynamics at play. While RobBERT-2023 boasts the largest resources, it underperforms compared to RobBERT-v2 and RobBERT-2022. Surprisingly, BERTje, with a more focused dataset, also competes favorably. These findings underscore the need for nuanced considerations in LM development, beyond sheer scalability, to refine model design and enhance their applicability across diverse linguistic contexts.