The Riddle Experiment: two groups are trying to solve a black story behind a screen, only one group is alive.
Nikki Rademaker, Linthe van Rooij, Yanna Smid, Tessa Verhoef, Tom Kouwenhoven
LIACS, Leiden University
Investigating the cognitive capabilities of large language models (LLMs), such as GPT-4 has shed light on their performance in areas like Theory of Mind (ToM) and problem-solving. Various studies showed the potential of GPT’s capabilities in problem-solving, often matching human performance. This study explores these capabilities by testing their performance in understanding and solving riddles embedded in black stories, compared to human performance. Black stories require solvers, in our case GPT-4 and humans, to unravel mysterious dark stories with a limited initial description of the end of the story through asking yes/no questions. We prompted the two groups into asking questions to come to the solution of the riddle. The study utilized a set of 12 existing black stories, with deviations in details such as gender, vehicles, objects or locations (train swapped by bus, suitcase swapped by duffle bag etc.). This was done to avoid the possibility of recognizing the existing riddles by GPT-4 in their training data. Each black story was tested two times within the human and GPT-4 group. In total, 23 rounds of riddles were played with people, and 24 with GPT-4. To ensure fairness and maintain conditions similar to those experienced by GPT-4, the experiment was conducted through text messaging for the human test group. This eliminated any influence from tone of voice or body language.
The primary measure of performance was the amount of questions that it took for the participant to solve the riddle, taking into account the amount of hints to come to the solution. Results indicated no significant difference between the groups, with humans performing slightly better overall. Qualitative results showed that GPT-4 excelled in its precise questioning and creative approach. For example, asking through about the type of vehicle until they have confirmation thereby mentioning uncommon vehicles like hot air balloons and submarines. However, GPT-4 often stuck to one detail of the story, missing the bigger picture and summarizing the solution too quickly, without proposing the full solution: “[...] The reason they play this brutal game isn't for a vendetta or threat, but their exact motivations aren't specified”. Humans tend to cover more topics and switch focus faster through multiple aspects of the story, struggling more with the uncommon details of the story. This research suggests that while GPT-4 can closely mimic human performance in problem-solving and reasoning, subtle nuances in riddle comprehension still favor human performance.