The Strange Case of AI-generated and human-authored texts: A preliminary study comparing features from different levels of linguistic analysis and languages

Veronica Juliana Schmalz

KU Leuven

In recent times, the release of conversational interfaces powered by large language models (LLMs) has profoundly influenced society. Generative AI tools such as ChatGPT, Perplexity, Gemini, HuggingChat, Copilot, and similar others can provide effortless assistance to anyone wishing to write a text. Based solely on a prompt, which may also include indications of a desired author’s writing style, any type of text can be generated with minimal effort. Given their accessibility and straightforward use, these generative language technologies are being used across various fields, including education. However, while they may offer benefits in certain scenarios, there are equally occasions where it becomes essential to clearly distinguish between texts authored by humans, each with their own unique linguistic abilities and characteristics, and texts generated by AI tools.
In this regard, many researchers have been working on the detection of AI-generated texts, or the evaluation of their automatically generated outputs, in the last few years. They make use of different methodologies and linguistic features that can be derived from the texts. However, as recent studies have shown, these systems can often be inaccurate and unreliable, especially when the AI-generated texts have gone through content obfuscation techniques and the underlying applied metrics are not sufficiently inspected and justified, or solely focus on superficial linguistic aspects. Moreover, in some cases, results are biased towards non-English writers and minoritarian groups.
To address these issues shedding more light on a more detailed linguistic analysis of human and AI-generated texts, we propose a novel methodology for the distinction of AI-authored texts vs. human texts in educational domains. We primarily focus on three corpora: one of English and one of Dutch essays written by university students, and one of Spanish collected from adult native speakers. To these, we add automatically generated texts using five open-source conversational interfaces for text generation powered by different LLMs. First, we analyze the different linguistic features contained in the texts, considering characters, lexicon, semantics and syntax. We then compare which features are more distinctly marked in either the student texts or the AI-generated texts. Next, we select the best features to conduct three comparative experiments with different white-box classification algorithms. In this way, we depart from the majority of approaches used so far, limiting not only the amount of text data and domains, but also providing a more detailed and linguistically justified analysis to achieve a sound author’s type differentiation.
With this work-in-progress research project, our goal is to develop a robust method for efficiently discerning between human-authored and AI-authored texts in educational contexts. Through it, we aim to provide a more transparent and easily interpretable solution for the differentiation of the linguistic characteristics present in AI-authored texts among students and instructors. Moreover, we hope this will pave the way toward more easily interpretable computational linguistic techniques in this ever-growing area of research.
© 2024 CLIN 34 Organisators. All rights reserved. Contact us via email.