Comparing LLMs with commercial Machine Translation systems for Metaphor Translation

Aletta Dorst, Alina Karakanta

Leiden University Centre for Linguistics, Leiden University

Literary translation has been called the last bastion of machine translation (MT), despite a growing number of studies obtaining promising results for different genres and languages (e.g. Green et al., 2010; Voigt & Jurafsky, 2012; Besacier, 2014; Toral & Way, 2015a, 2015b, 2018). However, no studies have focused specifically on metaphor, which is pervasive in fiction and non-fiction alike (e.g. Steen et al., 2010a, b). The studies by Steen et al. show that 13.7% of the words in a sample of approx. 40,000 words of fiction was related to metaphor, compared to 16.4% in news, 18.5% in academic texts, and 7.7% in casual conversation. Subsequently, Dorst (2011, 2015) showed that fiction may contain fewer metaphor-related words than news and academic discourse but it contains a higher percentage of more explicit, deliberate and creative ones. In a follow-up focusing on metaphor in literary machine translation, Dorst (2023) postulates that it is not the creative metaphors that are problematic for machines, but the conventional and idiomatic ones. Currently, it is unclear whether the perceived quality of literary MT or the errors observed are related to particular types or uses of metaphor.

Given recent developments in MT, also brought about by large language models (LLMs) (Kocmi et al., 2023), our current project investigates (i) how different types of metaphor are translated by different systems, (ii) how literary translators react to machine-translated metaphors during post-editing, and (iii) how readers react to machine-translated metaphors when reading short excerpts. This study presents the results of the first phase, in which we determine 1) how suitable commercial state-of-the-art MT systems (Neural Machine Translation and LLMs) are for translating metaphors and 2) how different types of metaphor are translated.

A test set was created containing four English fiction texts from the VUAMC corpus (Steen et al., 2010c), manually aligned to their published Dutch translations (482 sentences, ~6700 words). It was translated automatically using Google Translate, DeepL and ChatGPT4. ChatGPT received a simple prompt in the form: ‘Translate the following sentences from English into Dutch (NL)’. The outputs were evaluated for general translation quality against the human reference using automatic metrics (BLEU [Papineni et al., 2002], chrF [Popović, 2015], TER [Snover et al., 2006] and COMET [Rei et al., 2020]). In addition, the systems’ fluency and accuracy in translating different types of metaphors are assessed via human evaluation.
In terms of general translation quality, we observe a discrepancy between string-based automatic MT quality metrics (BLEU, chrF and TER) which rate DeepL as the significantly highest-performing engine, and neural COMET, which favours Google Translate. ChatGPT, despite coming with high promises for creative domains, performs significantly worse, at least without sophisticated prompting. A preliminary manual analysis confirms that all three systems translate metaphors directly in most instances, leading to incorrect, unidiomatic and sometimes meaningless translations for many conventionalized metaphors. It also suggests that DeepL outperforms both Google Translate and ChatGPT when it comes to idiomatic metaphors, occasionally producing fluent translations that are indistinguishable from human translation.