In-Corpus and Cross-Corpus Analysis of Native Language Identification Using Machine Learning

Shima Rahimi, Ehsan Lotfi, Walter Daelemans

University of Antwerp

Native Language Identification (NLI) has gained prominence as a method for identifying an individual's native language (L1) based on their use of a second language (L2) in writing and speech. This study addresses two main objectives: (1) a comparative evaluation of machine learning algorithms' effectiveness in both in-corpus and cross-corpus generalization for NLI, and (2) an examination of the influence of topic overlap on model performance. To achieve these aims, subsets of the TOEFL and ICLEv3 corpora were utilized. By incorporating content-based text complexity and content-independent POS tags as features in our classifiers, the SVM model showed the most promising results, particularly when utilizing POS tags for both in-corpus and cross-corpus evaluation. Despite only eight languages being common between the two corpora, the resulting F1-scores displayed variations for each language, prompting an investigation into the potential impact of topic familiarity. However, while an analysis of F1-scores across pivotal topics was conducted, no straightforward correlation emerged, highlighting the need for further exploration in this area.