Automatic Metadiscourse Classification in Learners’ Writing and Speaking with SpanBERT

Wenwen Guan, Marijn Alta, Jelke Bloem

University of Amsterdam

Metadiscourse (MD) is a rhetoric strategy used to mainly achieve an interpersonal goal instead of adding to the propositional content of utterances. It embraces verbal expressions that highlight textual structure (textual categories) and engages with the readers/listeners of the discourse (interactional categories). Its classification is in essence multi-label span classification, a tricky task that allows flexible unit boundaries and overlaps between spans. Automatic MD classification has rarely been investigated. A few studies on similar linguistic phenomena have explored supervised approaches including Supervised Sequence Models (Madnani et al., 2012), Support Vector Machines (Cotos & Pendar, 2016), a joint Continuous Bag-of-Words and Convolutional Neural Network model (Alharbi, 2016), and the combination of Support Vector Machines and Conditional Random Fields (Correia, 2018). In general, these approaches yielded a satisfactory accuracy for textual categories, whereas they performed poorly on the interactional categories. Large Language Models (LLMs) have never been tested on MD classification tasks. Therefore, we aim to enhance automatic MD classification using a pretrained LLM, to benefit both corpus linguistics research and text analysis applications such as chatbots and online proofreaders.
We assume that a LLM that has been pretrained to process spans will perform well in span classification tasks. Thus, we use SpanBERT, which has been pre-trained to better represent and predict text spans (Joshi et al., 2020). A dataset of 602,386 tokens was extracted from the International Corpus Network of Asian Learners of English (ICNALE) and was manually annotated with 23 MD categories for the experiment. The data includes speaking and writing collected from participants with low English proficiency to high proficiency, indicating non-standard use of English and even some errors in the texts. We test both the base-cased and large-cased SpanBERT models with different dataset splits and hyperparameter configurations.
We use spaCy’s SpanCategorizer as a baseline, which obtains an overall F1 score of 0.77. Compared to the spaCy baseline, the base-cased SpanBERT shows a significant improvement by achieving accuracy and F1 scores of 0.89 and 0.88 respectively. The large-cased model does not further improve accuracy. The data splits and hyperparameter optimisation do not play a significant role in enhancing the model performance. By inspecting category-wise performance, we find that overfitting due to the unbalanced category distribution is a critical issue. Five textual categories obtain F1 scores higher than 0.90, while four categories only get F1 scores below 0.70 if not zero. The categorical results highlight the need for more balanced data representation which would yield better model performance.