How Good is Your Wikipedia? Quality Estimation and Methods of Data Pruning for Non-English Wikipedias
Kushal Tatariya
KU Leuven
Artur Kulmizev
KU Leuven
Esther Ploeger
Aalborg University
Marcell Bollman
Linköping University
Jiaming Luo
Johannes Bjerva
Aalborg University
Miryam de Lhoneux
KU Leuven
Heather Lent
Aalborg University
Efforts in low-resource and multilingual NLP have traditionally employed Wikipedia as a "high quality" data source - either for language model pre-training or downstream task annotation. However, insights about the data quality of Wikipedia largely stem from the English Wiki, which enjoys a disproportionately large community of writers, editors, and administrators. In this project, we conduct a comprehensive evaluation of non-English Wikipedia articles and demonstrate that quality varies highly by Wiki. For example, we show that approximately 10% of the Assamese Wiki contains non-Assamese text, with multiple articles written almost entirely in English. Going further, we experiment with a series of data selection measures that aim to cull high-quality articles from raw Wiki dumps. To this end, we demonstrate that models trained on such high-quality partitions either perform similar to, or better than models trained on full Wikis, despite featuring less than half as many articles. Ultimately, our experiments highlight the importance of principled data curation for the low-resource and multilingual settings in NLP.