Exploring the Lack of Cultural Diversity in Multilingual Datasets for Conversational AI
Lea Krause
Vrije Universiteit Amsterdam
Ana Valdivia
University of Oxford
In an era where AI and NLP are globally reshaping our interaction with technology, the significance and impact of multilingual datasets is more critical than ever. With the increased application of AI in various domains ranging from healthcare to border control, the ability of these systems to understand, interpret, and respond in multiple languages has become essential. However, while multilingual datasets have become a cornerstone in NLP, their development often involves a critical shortcoming: many datasets are created by simply translating content from English into other languages, disregarding the worldview inadvertently captured during its creation (Rogers, 2021). This approach retains the English-centric cultural context, leading to datasets that are only superficially representative of other languages - resulting in AI systems that might be linguistically diverse but culturally monolithic (Ponti et al., 2020; Ghafoor et al., 2021). Furthermore, this method underutilises AI's potential for accurate cross-cultural interaction, glossing over the rich cultural and linguistic nuances that define different speakers (Nguyen et al., 2020; Berger & Packard, 2021; Shwartz, 2022).
Research focusing on multilingual datasets often aims to bridge linguistic gaps by adapting existing datasets to new languages (Bornea et al., 2021); however, common sources like SQuAD (Rajpurkar et al., 2016) and TriviaQA (M. Joshi et al., 2017) are predominantly Anglo-centric. Although web crawling offers a path to more authentic data, issues with data quality persist (Kreutzer et al., 2022). The need for more comprehensive multilingual datasets is underscored by the current lack of coverage and the wide applicability of language models globally (Esma Wali et al., 2020; P. Joshi et al., 2020; Yu et al., 2022). Cultural awareness, crucial for inclusive treatment across various fields, including medicine (Seibert et al., 2002), plays a significant role in NLP. Ignoring cultural contexts can degrade system performance (Goldfarb-Tarrant et al., 2023; Lee et al., 2023), especially in tasks requiring deep semantic and pragmatic understanding, such as machine translation or dialogue systems (Waseem et al., 2021). Recent developments, such as Hershcovich et al. (2022) on challenges and strategies in Cross-Cultural NLP and the growing focus on cultural sensitivity in multimodal reasoning (F. Liu et al., 2021; Ponti et al., 2020; Yin et al., 2021), highlight the field's progress. These studies emphasise practices like adaptation—a technique frequently used in machine translation (Nord, 1994)—to generate data that accurately represents under-resourced cultures (Zhi Li & Yin Zhang, 2023), showcasing promising advancements in the area.
Our research builds on these foundational works by examining 16 popular multilingual datasets used in question-answering and dialogue systems. We focus on how these datasets incorporate cultural dimensions, if at all, and how culture is defined in their development. Many of these papers reveal a clear oversight, either omitting cultural definitions entirely or offering only vague, high-level descriptions without engaging in a deep and critical analysis of cultural factors. We further analyse their topic coverage, comparing them to sources from English as well as local data sources, thereby investigating whether they cover topics relevant to the groups encompassed by the languages covered.