Sentiment Analysis of Arabic-English Code-Switched Data: A Comparative Study of Machine Learning and Lexicon-Based Approaches
Nadine Donia
École Polytechnique Fédérale de Lausanne
Weiwei Sun
Cambridge University
Arun Kumar Rajasekaran
Cambridge University
Despite the prevalence of Arabic-English code-switching on social media among Arabic-speaking users, there are no dedicated sentiment analysis models that account for this linguistic complexity. This paper seeks to address the gap mentioned above by presenting a comparative study of two approaches to sentiment analysis of code-switched Arabic-English data: machine learning and lexicon-based approaches. The machine learning approach uses a logistic-regression based model trained on word embeddings in the dataset. The lexicon-based approach uses a function that performs morphological analysis of the words. The Arabic-English lexicon used was obtained by fusing distinct monolingual Arabic and English lexicons. The two were later tested on three datasets: a monolingual, manually labeled Arabic dataset, a monolingual, manually labeled English dataset, and a multilingual, code-switched, manually labeled Twitter dataset. The results of the experiments showed that the lexicon-based approach was able to achieve higher accuracy compared to the machine learning approach, with an accuracy score of 0.74, outperforming other simple code-switched sentiment analysis models for other languages. In addition, an Arabic-English code switching corpus is introduced with 10,000 tweets along with their sentiment labels and an Arabic-English lexicon.