How Good is Your Wikipedia? Quality Estimation and Methods of Data Pruning for Non-English Wikipedias

Kushal Tatariya

KU Leuven

Artur Kulmizev

KU Leuven

Esther Ploeger

Aalborg University

Marcell Bollman

Linköping University

Jiaming Luo

Google

Johannes Bjerva

Aalborg University

Miryam de Lhoneux

KU Leuven

Heather Lent

Aalborg University

Efforts in low-resource and multilingual NLP have traditionally employed Wikipedia as a "high quality" data source - either for language model pre-training or downstream task annotation. However, insights about the data quality of Wikipedia largely stem from the English Wiki, which enjoys a disproportionately large community of writers, editors, and administrators. In this project, we conduct a comprehensive evaluation of non-English Wikipedia articles and demonstrate that quality varies highly by Wiki. For example, we show that approximately 10% of the Assamese Wiki contains non-Assamese text, with multiple articles written almost entirely in English. Going further, we experiment with a series of data selection measures that aim to cull high-quality articles from raw Wiki dumps. To this end, we demonstrate that models trained on such high-quality partitions either perform similar to, or better than models trained on full Wikis, despite featuring less than half as many articles. Ultimately, our experiments highlight the importance of principled data curation for the low-resource and multilingual settings in NLP.
© 2024 CLIN 34 Organisators. All rights reserved. Contact us via email.