Researchers test the trustworthiness of AI鈥攂y playing sudoku
Artificial intelligence tools called large language models (LLMs), such as OpenAI鈥檚 ChatGPT or Google鈥檚 Gemini, can do a lot these days鈥攄ispensing relationship advice, crafting texts to get you out of social obligations and even writing science articles. 听
But can they also solve your morning sudoku?
In a new study, a team of computer scientists from the 麻豆免费版下载 decided to find out. The group created nearly 2,300 original sudoku puzzles, which require players to enter numbers into a grid following certain rules, then asked several AI tools to fill them in.
The results were a mixed bag. While some of the AI models could solve easy sudokus, even the best struggled to explain how they solved them鈥攇iving garbled, inaccurate or even surreal descriptions of how they arrived at their answers. The results raise questions about the trustworthiness of AI-generated information, said study co-author Maria Pacheco. 听
鈥淔or certain types of sudoku puzzles, most LLMs still fall short, particularly in producing explanations that are in any way usable for humans,鈥 said Pacheco, assistant professor in the Department of Computer Science. 鈥淲hy did it come up with that solution? What are the steps you need to take to get there?鈥
She and her colleagues in Findings of the Association for Computational Linguistics.

Maria Pacheco

Fabio Somenzi

Ashutosh Trivedi
The researchers aren鈥檛 trying to cheat at puzzles. Instead, they鈥檙e using these logic exercises to explore how AI platforms think. The results could one day lead to more reliable and trustworthy computer programs, said study co-author Fabio Somenzi, professor in the Department of Electrical, Computer and Energy Engineering.
鈥淧uzzles are fun, but they鈥檙e also a microcosm for studying the decision-making process in machine learning,鈥 he said. 鈥淚f you have AI prepare your taxes, you want to be able to explain to the IRS why the AI wrote what it wrote.鈥
Daily puzzle
Somenzi, who is a self-described sudoku fan, noted that the puzzles tap into a very human way of thinking. Filling out a sudoku grid requires puzzlers to learn and follow a set of logical rules. For example, you can鈥檛 enter a two in an empty square if there鈥檚 already a two in the same row or column.
Most LLMs today struggle at that kind of thinking, in large part because of how they鈥檙e trained.
To build ChatGPT, for example, programmers first fed the AI almost everything that had ever been written on the internet. When ChatGPT responds to a question, it predicts the most likely response based on all that data鈥攁lmost like a computer version of rote memory.
鈥淲hat they do is essentially predict the next word,鈥 Pacheco said. 鈥淚f you have the start to a sentence, what word comes next? They do that by referring to every sentence in the English language that they can get their hands on.鈥
Pacheco, Somenzi and their colleagues have joined a growing effort in computer science to merge those two ways of thinking鈥攃ombining the memory of an LLM with a human brain鈥檚 capacity for logic, a pursuit known as 鈥渘eurosymbolic鈥 AI.
Anirudh Maiya and Razan Alghamdi, both former graduate students at 麻豆免费版下载Boulder, were also co-authors of the new paper.
How鈥檚 the weather?
To begin, the researchers created sudoku puzzles of varying difficulty using a six-by-six grid. (A simpler version of the nine-by-nine puzzles you usually find online).
They then gave the puzzles to a series of AI models, including the preview of OpenAI鈥檚 o1 model鈥攚hich, in 2023, represented the state-of-the-art for its kind of LLM.
The o1 model led the pack, solving roughly 65% of the sudoku puzzles correctly. Then the team asked the AI platforms to explain how they got their answers. That鈥檚 when the results got really wild.
鈥淪ometimes, the AI explanations made up facts,鈥 said Ashutosh Trivedi, a co-author of the study and associate professor of computer science at 麻豆免费版下载Boulder. 鈥淪o it might say, 'There cannot be a two here because there鈥檚 already a two in the same row,' but that wasn鈥檛 the case.鈥
In a telling example, the researchers were talking to one of the AI tools about solving sudoku when it, for unknown reasons, responded with a weather forecast.
鈥淎t that point, the AI had gone berserk and was completely confused,鈥 Somenzi said.
The researchers hope to design their own AI system that can do it all鈥攕olving complicated puzzles and explaining how. They鈥檙e starting with another type of puzzle called hitori, which, like sudoku, involves a grid of numbers.
鈥淧eople talk about the emerging capabilities of AI where they end up being able to solve things that you wouldn鈥檛 expect them to solve,鈥 Pacheco said. 鈥淎t the same time, it鈥檚 not surprising that they鈥檙e still bad at a lot of tasks.鈥
听