Multilingual Definition Modeling for Neologisms
Tanmay Khokle
KU Leuven
Tim Van de Cruys
Centre for Computational Linguistics, KU Leuven
Kris Heylen
Instituut voor de Nederlandse Taal
Dictionaries are essential resources for formally defining the fundamental elements of human language: words. Human lexicographers invest significant time manually crafting dictionary entries. As language evolves, the task of defining new words and meanings continually expands. This work explores leveraging transformer-based language models to automate the generation of semantically coherent definitions, thereby improving this labor-intensive process. A key focus is the automatic generation and evaluation of definitions for neologisms. This involves not only producing syntactically correct definitions but also leveraging reasoning and world knowledge to extrapolate contextual information for unfamiliar words or symbols. In our multicultural and multilingual society, it is crucial to adopt an inclusive approach to generating definitions that are applicable across languages. Languages often borrow elements from each other, a nuance challenging to capture with a restrictive vocabulary. This thesis focuses on English and Dutch words, utilizing a multilingual Encoder-Decoder model within a text-to-text framework known for capturing semantic information. By building on existing definition modeling research, this work incorporates state-of-the-art language models to tackle the challenging subtask of modeling new words. It lays the foundation for curating relevant datasets and developing training and evaluation techniques, ultimately creating a tool to enhance lexicographers' efficiency.