Leveraging Linguistic Typology for Fairer Multilingual Language Technology

Esther Ploeger

Aalborg University

Wessel Poelman

KU Leuven

Andreas Holck Høeg-Petersen

Aalborg University

Anders Schlichtkrull

Aalborg University

Miryam de Lhoneux

KU Leuven

Johannes Bjerva

Aalborg University

The NLP research community has devoted increased attention to languages beyond English, resulting in considerable improvements for multilingual NLP. However, these improvements only apply to a small subset of the world's languages. Aiming to extend this, an increasing body of research aspires to enhance the generalizability of multilingual models across languages. Language sampling plays a key role in this, ranging from data selection, to modeling and evaluation. Language sampling in NLP is done in several ways: based on language family, geographic area or linguistic typology. All are commonly used to motivate 'diverse' language samples, which are then used to imply generalization of language models across a broad range of human languages. However, there are no set definitions, best practices or common metrics that justify or validate these claims. Skewed language selection can lead to overestimated multilingual performance, so it is crucial to have a common understanding of these concepts to improve multilingual language modeling.

Our recent work address these issues in two ways: 1) we highlight shortcomings of 'typological diversity' claims in NLP research and 2) we develop a principled framework that allows NLP researchers to systematically select languages with diverse typological characteristics to make more informed decisions regarding data collecting, training, evaluation in multilingual NLP.