Evaluating Humor Generation in an Improvisational Comedy Setting
Thomas Winters
KU Leuven
Stijn Van der Stockt
DPG Media
While computational humor generation has long been considered a challenging task, recent large language models have significantly improved the quality of generated jokes. Evaluating humor quality is usually difficult, as not only is the exact quality subjective, but the delivery also plays a role. Another disparity in evaluation standards between human and computer-generated humor is the difference in writing time between the two. In this study, we evaluate and compare the quality of humor generated by GPT-4 with human-written jokes in an improvisational comedy setting in Dutch. In a live performance setting on national TV, nine different audience suggestions were used over three improvisational comedy games. Three human comedians each performed their own improvised joke and an AI-generated joke, resulting in a total of 54 improvised jokes. The AI-generated jokes were selected from outputs generated by prompting GPT-4 with a few-shot chain-of-thought prompt specific to each game, together with the audience suggestion. An audience of 40 people then rated all jokes on a 4-point scale, resulting in 2160 ratings, which allows us to compare the difference in quality between AI and human-created jokes delivered by the same comedian for the same audience suggestion. We found that humor generated by the AI and the human comedians was rated similarly, with 34.6% preference for human over AI, 29.7% preference for AI over human and 35.7% equal. Human-created jokes also scored slightly higher on average (2.59 vs 2.67). Interestingly, AI jokes received more “best joke” votes, suggesting that while humans are more consistent overall, AI can occasionally create standout humorous content. These results imply that current language models can effectively generate relatively high-quality humor, closely rivaling human comedians when put in an improvisational context, highlighting the potential for new forms of real-time collaborative humor generation between AI and humans.