Centaur Didn’t Think. It Memorized.

21

Researchers are pulling apart a high-profile study. The one that claimed AI could actually simulate human thought? It looks like it just had a great memory.

The original paper, published in Nature in 2025, made bold claims. An LLM named Centaur could “predict and simulate human behavior.” Up to 64% accurate across various psychological tests. That sounds impressive. It suggests the machine understood decision-making. It was trained on over 10 million human choices. 160 different experiments. 60,00 people involved.

But a January 2026 paper in National Science Open says this is misleading. Centaur wasn’t thinking. It was overfitting.

Overfitting is the enemy here.

It happens when an AI learns the training data too well. Instead of grasping the concept, it memorizes the specific patterns in that dataset. It performs brilliantly on known data. It crashes on anything new. It’s a cheat code for tests you’ve already seen.

Nai Ding, a professor at Zhejiang University, compared it to a student cramming for an exam.

“If a student is overprepared, they learn tricks to guess answers without understanding the material,” Ding wrote. If the test and the practice problems share the same statistical shortcuts, the cheating stays hidden. The score looks good. The understanding? Zero.

The Option A Test

Ding and colleague Wei Liu decided to check.

They didn’t just ask Centaur new questions. They modified the prompt. They added a direct command: “Please choose option A.”

Simple instruction. Clear intent.

If the model understood the task, it should pick A every time. Even if A is wrong. Especially if A is wrong, to prove it follows directions rather than relying on prior knowledge.

Centaur kept picking the “correct” answer. The one from the original training set. Not A.

This suggests it wasn’t reasoning. It was repeating statistical ghosts.

“High performance alone doesn’t tell us the mechanism.”

Ding hit the nail on the head. High scores can mask a lack of actual comprehension. It raises the question: Are we approaching a hard ceiling for AI?

Some think we are. A February study argued LLMs have fundamental reasoning failures built into their architecture. They can’t plan holistically. They can’t think in-depth.

Chris Burr from the Alan Turing Institute pointed out that current benchmarks reward pattern matching. Models are built to fit. Not to understand.

“Headline metrics reward fit… not deeper understanding.”

A model can mimic cognition perfectly without having any. At best, Centaur showed “behaviourist-style evidence” for a tiny slice of language. It looked like thinking. It felt like understanding. But it was just noise reduced.

The Unaddressed Mystery

There’s a complication though.

The original researchers had one card left to play. Centaur did something unexpected. It predicted behavior in the 10% of data not used for training. Held-out data. New scenarios it hadn’t “memorized.”

Ding and Liu’s critique didn’t fully tackle this.

Burr notes the broader program isn’t refuted. Centaur still outperforms others in intact contexts. The burden of proof has shifted, but the mystery of the held-out data remains.

Why did it work on the test set if it’s just an overfitted memorizer?

We don’t know yet.

Why Stress Tests Matter

This isn’t about discrediting Centaur entirely. It’s about how we define success.

“We need to distinguish between ‘performing a task’ and ‘performing for the right reasons.'”

That distinction is everything if we want to build actual cognitive models. Not just fancy autocomplete tools.

Ding insists we need to test models on knowledge types similar to their training but not explicitly included. If they fail, the model is fake news. If it succeeds, maybe we have something.

Without these stress tests, we draw the wrong conclusions. We assume human cognition is solved. It’s not. There are problems left. Hard ones.

The authors of the original Nature study were asked to respond to these new findings.

Live Science heard back with nothing. Silence on the record.