Automated Correction of Error Patterns in Dutch Child Speech: Leveraging Transformer Architecture for Enhanced Accuracy

Camille Lavigne, Alex Stasica

Utrecht University

The study of spontaneous speech carries significant linguistic implications across various domains (Schmid et al. 2016). However, researchers face significant challenges, such as the time-consuming manual transcription and analysis process (Odijk, 2021), as well as the inherently tidy nature of spontaneous speech, particularly in language acquisition contexts. The SASTA project endeavors to automate this process to the fullest extent possible, accommodating the diverse nature of spontaneous language. Although tools exist for subsequent analyses (e.g. syntactic parser), their accuracy is hindered by the occurrence of deviantly pronounced words (and therefore deviantly transcribed) in speakers’ utterances.

In response, Stasica et al. (2024) initiated work employing BERTje (De Vries et al., 2019) in conjunction with the Levenshtein distance. They aimed to explore the automatic correction of these deviant words within the Dutch CHILDES corpora (MacWhinney, 2000), considering contextual cues. This study revealed a 17% success rate in correction, with 3% incorrect corrections and 80% no corrections, prioritizing a small increase of accuracy over a better accuracy with more wrongly corrected patterns (ie. maximizing recall over precision).

The present study delves into the potential of rectifying these deviations using a Transformer architecture (Vaswani et al., 2017), previously utilized in translation and spelling correction tasks. Unlike the prior work of Stasica et al. (2024) employing an encoder-only architecture, our approach incorporates an encoder-decoder model. Our objective aligns with previous research: to transform error-ridden transcriptions into ones containing only valid Dutch words. Initial data analysis revealed that certain errors are too distant from the target words for correction by our model. Consequently, we opted to concentrate our efforts on the most prevalent error patterns, noting that the top 100 frequent error patterns (out of the approximately 3000 found), constituting only 1% of patterns, encompass over 80% of total errors. Furthermore, the majority of errors involve deletions followed by substitutions, guiding our focus on enhanced reliability.

To identify these patterns, we employ a variant of the Levenshtein distance, using its distance matrix along with the Dgijkstra algorithm to find the shortest path within the matrix in order to reconstruct the error patterns. Assessment of our algorithm's accuracy involved comparisons with CHILDES references, encompassing word replacements, explanations, and incomplete word forms. Preliminary testing on a small subset yielded a 60% accuracy rate, an improvement over Stasica et al. (2024), showcasing the potential of Transformer architecture. However, our current algorithm lacks data augmentation and relies solely on the CHILDES dataset. To further refine our results, we plan to incorporate data augmentation, though obtaining suitable datasets, particularly those featuring children's speech, poses a challenge. Out of the considered dataset is the ChiSCor (van Dijk et al. 2023) which contains fantasy stories freely told by 442 Dutch children. We aim to enhance accuracy through augmentation. We aim to increase our preliminary accuracy results thanks to data augmentation.