Examining Boundaries: Recontextualizing Dutch Syllabification Algorithms
Gus Lathouwers, Helmer Strik, Wieke Harmsen, Catia Cucchiarini
Radboud University
Syllabification concerns the task of dividing a word into syl-la-bles. Even though seemingly easy to fulfill, training an algorithm to perform the task with high accuracy remains a challenge. Difficulties in the automation of syllabification arise from the need to recognize complex language patterns. However, the practical use cases associated with meaningful syllable information extraction, such as text-to-speech and language recognition processing purposes (e.g., Diaz-Asper et al., 2022), make it a worthwhile pursuit.
With respect to the Dutch language, a number of language-specific and non-language-specific algorithms have been developed over the years. However, research studying the relative efficacy of algorithms is generally dated or does not address the nuances of different algorithms' behavior. Furthermore, researchers have occasionally failed to take into account features unique to the Dutch language in algorithm refinement (e.g., Trogkanis and Elkan, 2010; Bartlett, 2007). One often confused concept in the literature is the difference between word hyphenation and syllabification: the former describes breaking words according to Dutch spelling rules, the latter breaks words according to phonological conventions.
The current research seeks to (a) delineate the accuracy of different syllabification algorithms that exist for the Dutch language, (b) optimize algorithm parameters where possible, and (c) perform a comparison on how well algorithms are suited to different data types. Historically, algorithm testing has often been conducted on dictionary word sets, yet these have the potential to generalize poorly when applied to real-life datasets. To that end, algorithm accuracy is here examined on three different data types, namely (a) traditional dictionary words (CELEX, n=29375), (b) loanwords from other languages (Sijs, 2005; n=1135), and (c) Dutch pseudowords (CHOREC, n=99).
Algorithms of different origins have been recreated, or algorithm patterns have been adjusted. They include a Dutch language-specific algorithm (Brandt Corstius, 1970), a popular data-driven algorithm that enjoys widespread use (Liang, 1983), and a modern algorithm built on sequential task selection principles (Conditional Random Fields; CRF, Trogkanis and Elkan, 2010). Data was also gathered as to how a modern AI chatbot, namely ChatGPT, would fare at the task of syllabification.
Results show the algorithms responding uniquely to different data types. Overall, the data-driven algorithms perform best (words syllabized correctly rate of 99.4% for CRF, 98.5% for Liang). However, on one dataset (pseudowords), the Dutch-specific language-driven algorithm is found to have the best shared performance (words syllabized correctly rate of 98.0% for Brandt, 98.0% for CRF). All algorithms struggle significantly more with the syllabification of loanwords than dictionary words (mean words syllabized correctly rate of 80.7%). The AI chatbot shows considerable inconsistency when applying any ruleset to syllabizing the words given.
Current research aims to provide a reference point for future syllabification endeavors when applied to the Dutch language. The explorative study should provide guidelines as to the weaknesses of algorithms when applied to different data types, as well as offer general insight on how these can be optimally used in the analysis of the Dutch language.
References:
Bartlett, E. (2007). A discriminative approach to automatic syllabification [Unpublished master's thesis]. University of Alberta.
Brandt Corstius, H. (1970). Exercises in computational linguistics [Doctoral dissertation]. University of Amsterdam.
Diaz-Asper, M., Holmlund, T. B., Chandler, C., Diaz-Asper, C., Foltz, P. W., Cohen, A. S., & Elvevåg, B. (2022). Using automated syllable counting to detect missing information in speech transcripts from clinical settings. Psychiatry Research, 315, 114712.
Liang, F. M. (1983). Word hy-phen-a-tion by com-put-er [Doctoral dissertation]. Stanford University.
Van Der Sijs, N. (2005). Van Dale Groot Leenwoordenboek. De invloed van andere talen op het Nederlands. Van Dale Lexicografie. https://www.dbnl.org/tekst/sijs002groo01_01/
Trogkanis, N., & Elkan, C. (2010). Conditional random fields for word hyphenation. In J. Hajič, S. Carberry, S. Clark, & J. Nivre (Eds.), Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (pp. 366–374). Association for Computational Linguistics.