Comparing and Contrasting Modality Disparity in Multimodal Sarcasm Detection

Wu Chi Hsiu

University of Antwerp

Aaron Maladry

LT3, Ghent University

Veronique Hoste

LT3, Ghent University

Multimodality in NLP is lauded for its potential to enrich more traditional text-based approaches in cases where audio or visual cues in communication are salient, such as emotion and sarcasm detection. Yet, despite the development of visual and audio NLP practices in recent years, in many cases textual modality remains disproportionately the dominant modality in multimodal settings. In the MUSTARD++ paper [1], for example, audio and video modalities not only perform subpar as unimodal input, but also struggle to substantially sway their textual counterparts in trimodal emotion and sarcasm detection tasks. The heterogeneous nature of audio and video, however, is often overlooked and thus a more fine-grained approach must be established to truly capitalize on the salience of multimodal data. This paper investigates the importance of the three modalities for conveying sarcasm in MUSTARD++ (video clips of sarcastic/nonsarcastic utterance from popular television shows), and how well each of the modalities is captured by a multimodal system.

We first start by utilising the manual annotation of the respective importance of modalities in the test set, highlighting how each modality contributes to different perspectives of emotion and sarcasm within an utterance. Then, we examine this through a multimodal model leveraging the unsupervised learning paradigm Data2Vec as feature extractor. Meta’s Data2Vec shares a common learning strategy across the textual, audio, and video modality to generate modality- agnostic, contextualized representations [2]. We hypothesize the modality gap of traditional approaches, which originates from combining features generated from modality-specific expert models with vastly different structures, would be reduced by facilitating a basis where cross-modal representations could be learned. Ablation studies are then performed across various combinations of three different modalities (audio, vision and text). Both standard performance metrics and manual error analysis on individual samples are conducted to help identify whether the unified learning task provides us with a more encompassing representation of emotion and sarcasm, or whether there are cases where certain modalities may still struggle to excel in performance.

[1] Anupama Ray, Shubham Mishra, Apoorva Nunna, and Pushpak Bhattacharyya. 2022. A Multimodal Corpus for Emotion Recognition in Sarcasm. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 6992–7003, Marseille, France. European Language Resources Association.
[2] Baevski, A., Hsu, W.N., Xu, Q., Babu, A., Gu, J. and Auli, M., 2022, June. Data2vec: A general framework for self-supervised learning in speech, vision and language. In International Conference on Machine Learning (pp. 1298-1312). PMLR.