Leveraging Linguistic Typology for Fairer Multilingual Language Technology
Esther Ploeger
Aalborg University
Wessel Poelman
KU Leuven
Andreas Holck Høeg-Petersen
Aalborg University
Anders Schlichtkrull
Aalborg University
Miryam de Lhoneux
KU Leuven
Johannes Bjerva
Aalborg University
The NLP research community has devoted increased attention to languages beyond English, resulting in considerable improvements for multilingual NLP. However, these improvements only apply to a small subset of the world's languages. Aiming to extend this, an increasing body of research aspires to enhance the generalizability of multilingual models across languages. Language sampling plays a key role in this, ranging from data selection, to modeling and evaluation. Language sampling in NLP is done in several ways: based on language family, geographic area or linguistic typology. All are commonly used to motivate 'diverse' language samples, which are then used to imply generalization of language models across a broad range of human languages. However, there are no set definitions, best practices or common metrics that justify or validate these claims. Skewed language selection can lead to overestimated multilingual performance, so it is crucial to have a common understanding of these concepts to improve multilingual language modeling.
Our recent work address these issues in two ways: 1) we highlight shortcomings of 'typological diversity' claims in NLP research and 2) we develop a principled framework that allows NLP researchers to systematically select languages with diverse typological characteristics to make more informed decisions regarding data collecting, training, evaluation in multilingual NLP.