Word Sense Discrimination using French Transformer Models
Stef Accou, Tim Van de Cruys
KU Leuven
We investigate unsupervised word sense discrimination techniques for French monolingual transformer models, specifically FlauBERT and CamemBERT. The study explores various methodologies encompassing clustering of contextualized embeddings and lexical substitution-based approaches. The first approach involves applying Principal Component Analysis (PCA) over target sentences containing homonymous instances of a target word and subsequently clustering the resulting representations. Another method, inspired by the work of Amrami & Goldberg, generates sparse vectors based on model-predicted substitutes and their probabilities and clusters these vectors. Lastly, an enhanced lexical substitution method, informed by Zhou (2019), specifically designed for BERT-models, is applied. This approach employs dropout of embeddings instead of masking, ensuring semantic coherence between the original word and substitutes while minimizing semantic changes by comparing contextualized embeddings of substitute words with the original embedding of the target word in the sentence.
A set of 11 target words is defined to include nouns and verbs differing in their degree of homonymy and polysemy. The evaluation encompasses two distinct datasets. The first being a relatively small (+-100 sentences for each of the 11 target words) curated dataset in which the different senses of homonymous words are present. This dataset has been manually annotated with a sense label for each target word and serves as a gold standard across all methods. The second dataset augments the gold standard dataset with sentences from a web crawl corpus, is larger and contains more noise. We compare different methods of cluster estimation and choose to employ BIC (Bayesian Information Criterion) to automatically induce the amount of clusters to be generated. Subsequently, a Gaussian Mixture Model (GMM) clustering method was employed with the number of clusters determined by the BIC. The gold standard dataset facilitates the evaluation of all methods and models using a variety of hard-clustering metrics, given the absence of consensus on preferred methodologies for comparing word sense discrimination algorithms.
The findings reveal that FlauBERT generally outperforms CamemBERT on less noisy datasets when clustering contextualized embeddings. However, CamemBERT demonstrates greater robustness in handling noisy data with less performance degradation across different datasets. We found that integrating Zhou’s (2019) approach to lexical substitution in an otherwise unchanged replication of Amrami & Goldberg’s (2014) method demonstrates the most promising results. However, it presents the drawback of significantly higher computational costs, due to the necessity of computing multiple contextualized embeddings for each target sentence. Additionally, the randomness introduced by the dropout of embeddings requires multiple iterations to mitigate the effect of chance, making it less suitable for larger datasets.