LLMs Are Currently Not Helpful At All For Math Research, Give Garbage Answers: Mathematician Joel David Hamkins

There are some experts who say that they’ve used AI to solve Erdos problems and help with mathematics research, but not everyone is yet on board.

Joel David Hamkins, a prominent mathematician and professor of logic at the University of Notre Dame, recently shared his unvarnished assessment of large language models in mathematical research during an appearance on the Lex Fridman podcast. His experience stands in sharp contrast to the optimistic narratives surrounding AI’s potential in scientific discovery, and his critique centers on a fundamental issue: mathematical correctness.

“I guess I would draw a distinction between what we have currently and what might come in future years,” Hamkins began, acknowledging the possibility of future progress. “I’ve played around with it and I’ve tried experimenting, but I haven’t found it helpful at all. Basically zero. It’s not helpful to me. And I’ve used various systems and so on, the paid models and so on.”

His experience with current AI systems has been consistently disappointing. “My typical experience is interacting with AI on a mathematical question is that it gives me garbage answers that are not mathematically correct, and so I find that not helpful and also frustrating,” he explained. The frustration, for Hamkins, goes beyond mere incorrectness—it’s the nature of the interaction itself that proves problematic.

“The frustrating thing is when you have to argue about whether or not the argument that they gave you is right. And you point out exactly the error,” Hamkins said, describing exchanges where he identifies specific flaws in the AI’s reasoning. The AI’s response? “Oh, it’s totally fine.” This pattern of confident incorrectness followed by dismissal of legitimate criticism mirrors a type of human interaction that Hamkins finds untenable: “If I were having such an experience with a person, I would simply refuse to talk to that person again.”

Despite these issues, Hamkins recognizes that current limitations may not be permanent. “One has to overlook these kind of flaws and so I tend to be a kind of skeptic about the value of the current AI systems. As far as mathematical reasoning is concerned, it seems not reliable.”

Hamkins’ assessment highlights a critical tension in the AI community. While some researchers have reported breakthroughs—such as claims of AI assistance in tackling problems from the Erdos collection of mathematical challenges— some working mathematicians like Hamkins are finding current systems fundamentally unreliable for serious research. Mathematician Terrance Tao has said that AI can generate mathematical proofs that look flawless, but make subtle mistakes that humans wouldn’t. The issue isn’t just that LLMs make mistakes, but that they make them with confidence and resist correction, breaking the collaborative trust essential to mathematical discourse. As AI companies continue to invest heavily in reasoning capabilities and mathematical problem-solving, Hamkins’ experience serves as a sobering reminder that impressive benchmarks don’t always translate to practical utility for domain experts. The gap between AI performance on standardized tests and its ability to serve as a genuine research partner to some mathematicians remains wide, at least for now.

Posted in AI