Retrieval augmented chatbot for an IT helpdesk (WIP)

Michael Wheeler, Suzan Verberne

LIACS, Leiden University

The performance of large language models (LLMs) in comprehending and generating language is incredible. Yet, these models still have challenges when they are asked a question of which the answer is not contained in the training data. Our aim with this paper is to assist Leiden University’s IT helpdesk in answering questions from students and employees with the use of an LLM in conjunction with retrieval augmented generation (RAG). RAG models have been shown to be effective when linked to an internal knowledge base that has simple question and answer schemes. Leiden University also has an internal knowledge base, but this is not sufficient to answer every question that a user might come up with. Therefore, we use the old tickets – previously asked questions – and their solutions as our knowledge base. Our database consists of 90K tickets that have been solved by an expert. Using old tickets poses significant challenges to the LLM, retrieval, and evaluation: the previously answered questions may contain irrelevant information, are of very diverse length, and the answer is often provided in a multi-turn dialogue between the user and the helpdesk employee Our preliminary results show that the quality of the answer generated by the LLM is strongly influenced by the quality of the retrieved context. As a baseline experiment we tried a zero shot setting in which we only prompt the LLM with a question and zero context. This setting results in a RougeL score of 0.11, a Bleu score of 0.05 and Bertscore@F1 of 0.65. In our retrieval pipeline we use BM25 which has a Recall@10 of 0.24 but there are many similar tickets creating redundancy. We doubt whether the LLM is able to distinguish the solution from the problem in the ticket when it is fed the complete dialogue. We test our implementation on tickets that are not guaranteed to have a solution in the database. We also try to first summarize the problems to reduce noise before feeding it into the LLM. Our experiments indicate that RAG alone is not sufficient to solve LLM-based Question Answering if the data is of mixed quality and length, and that more effort is needed to properly support helpdesk employees with a RAG-based chatbot.