Researchers test the trustworthiness of AI—by playing sudoku

Artificial intelligence tools called large language models (LLMs), such as OpenAI’s ChatGPT or Google’s Gemini, can do a lot these days—dispensing relationship advice, crafting texts to get you out of social obligations and even writing science articles. ��

But can they also solve your morning sudoku?

In a new study, a team of computer scientists from the �鶹��Ѱ�� decided to find out. The group created nearly 2,300 original sudoku puzzles, which require players to enter numbers into a grid following certain rules, then asked several AI tools to fill them in.

The results were a mixed bag. While some of the AI models could solve easy sudokus, even the best struggled to explain how they solved them—giving garbled, inaccurate or even surreal descriptions of how they arrived at their answers. The results raise questions about the trustworthiness of AI-generated information, said study co-author Maria Pacheco. ��

“For certain types of sudoku puzzles, most LLMs still fall short, particularly in producing explanations that are in any way usable for humans,” said Pacheco, assistant professor in the Department of Computer Science. “Why did it come up with that solution? What are the steps you need to take to get there?”

She and her colleagues in Findings of the Association for Computational Linguistics.

Maria Pacheco

Fabio Somenzi

Ashutosh Trivedi

The researchers aren’t trying to cheat at puzzles. Instead, they’re using these logic exercises to explore how AI platforms think. The results could one day lead to more reliable and trustworthy computer programs, said study co-author Fabio Somenzi, professor in the Department of Electrical, Computer and Energy Engineering.

“Puzzles are fun, but they’re also a microcosm for studying the decision-making process in machine learning,” he said. “If you have AI prepare your taxes, you want to be able to explain to the IRS why the AI wrote what it wrote.”

Daily puzzle

Somenzi, who is a self-described sudoku fan, noted that the puzzles tap into a very human way of thinking. Filling out a sudoku grid requires puzzlers to learn and follow a set of logical rules. For example, you can’t enter a two in an empty square if there’s already a two in the same row or column.

Most LLMs today struggle at that kind of thinking, in large part because of how they’re trained.

To build ChatGPT, for example, programmers first fed the AI almost everything that had ever been written on the internet. When ChatGPT responds to a question, it predicts the most likely response based on all that data—almost like a computer version of rote memory.

“What they do is essentially predict the next word,” Pacheco said. “If you have the start to a sentence, what word comes next? They do that by referring to every sentence in the English language that they can get their hands on.”

Pacheco, Somenzi and their colleagues have joined a growing effort in computer science to merge those two ways of thinking—combining the memory of an LLM with a human brain’s capacity for logic, a pursuit known as “neurosymbolic” AI.

Anirudh Maiya and Razan Alghamdi, both former graduate students at �鶹��Ѱ��Boulder, were also co-authors of the new paper.

How’s the weather?

To begin, the researchers created sudoku puzzles of varying difficulty using a six-by-six grid. (A simpler version of the nine-by-nine puzzles you usually find online).

They then gave the puzzles to a series of AI models, including the preview of OpenAI’s o1 model—which, in 2023, represented the state-of-the-art for its kind of LLM.

The o1 model led the pack, solving roughly 65% of the sudoku puzzles correctly. Then the team asked the AI platforms to explain how they got their answers. That’s when the results got really wild.

“Sometimes, the AI explanations made up facts,” said Ashutosh Trivedi, a co-author of the study and associate professor of computer science at �鶹��Ѱ��Boulder. “So it might say, 'There cannot be a two here because there’s already a two in the same row,' but that wasn’t the case.”

In a telling example, the researchers were talking to one of the AI tools about solving sudoku when it, for unknown reasons, responded with a weather forecast.

“At that point, the AI had gone berserk and was completely confused,” Somenzi said.

The researchers hope to design their own AI system that can do it all—solving complicated puzzles and explaining how. They’re starting with another type of puzzle called hitori, which, like sudoku, involves a grid of numbers.

“People talk about the emerging capabilities of AI where they end up being able to solve things that you wouldn’t expect them to solve,” Pacheco said. “At the same time, it’s not surprising that they’re still bad at a lot of tasks.”

�鶹��Ѱ��