Metaphorically speaking: a clustering-based exploration of metaphor identification in Dutch using transformer architectures

Lisa Hubin

KU Leuven

Metaphors are an integral part of our everyday language. Therefore, a language model would ideally be able to identify and interpret this kind of figurative speech. The aim of this research was to explore how a transformer architecture represents the difference between embeddings of literally and metaphorically used words and if this distinction could be captured through clustering methods. The experiment focused specifically on Dutch, using contextual word embeddings made by BERTje. For a set of eight words, sentences were collected and annotated for literal or metaphorical use. Based on these sentences, contextual embeddings were made and visualised per word in a two-dimensional space. Then, several clustering algorithms were tested to see how well their clusters corresponded with the literal and metaphorical labels. Specifically the k-means algorithm was used, as well as agglomerative clustering with Ward, complete, average and single linkage. When looking at the visualised embeddings, in all eight cases two general areas could be distinguished for the literal and the metaphorical embeddings. For some words, the different labels showed some overlap were the two area’s meet, but the majority of embeddings from the same label always tend to stick to one side of the plot. This seems like a good indication that the difference between literal and metaphorical language use is encoded into the contextualized embeddings. Trying to capture this distinction through clustering however, showed varying levels of success depending on the word. It seems like distances between embeddings are most often based on formal or syntactic features of the word itself or the context it is used in. In some cases these features happen to be a good indication for literal or metaphorical use and the clustering works well, but in other cases they aren’t. When using two clusters, the average Adjusted Rand Index (ARI) of the best performing algorithm for each word was 0,410 and the average V-measure was 0,420. Overall k-means gave the highest results, achieving an average ARI of 0,366 and V-measure of 0,368 across all words. Additionally a dendrogram was made for the agglomerative clusters and a silhouette analysis was used to establish what the ideal amount of clusters would be for each dataset. This showed that in cases where two clusters didn’t manage to separate the literal from the metaphorical embeddings very well, a slightly larger amount of clusters would often help to make more homogenous clusters. In conclusion, a visual distinction can be seen between the metaphorical and literal embeddings, but this distinction doesn’t necessarily take on the form of separate clusters.