The first proof is the toughest mathematical test of AI yet. results are mixed

14 February 2026

4 read minutes

AI gets its toughest math test yet. results are mixed

Experts gave AI 10 math problems to solve in a week. OpenAI, researchers and hobbyists all gave it their best effort

by Joseph Howlett Edited by Claire Cameron

Black and white photo of a room full of teenage students taking exams. — Interim Archive/Contributor via Getty Images

It seems that the verdict is as follows: Artificial intelligence will not be able to replace mathematicians.

This is the immediate conclusion from the “First Proof” challenge – perhaps the strongest test yet of the ability of large language models (LLMs) to conduct mathematical research. Determined by 11 top mathematicians on February 5, the results of the test were released early in the morning on Valentine’s Day. It is too early to say how many of the 10 math problems included in the challenge were solved by AI without human assistance. But one thing is clear: no single LLM came close to solving them all.

The mathematicians behind the first proof introduced the AI 10 “lemma” – a math term for small theorems that lead to larger results. These problems are the stock-in-trade of the working mathematician, the kind of small problem one might assign to a brilliant graduate student. According to Mohammed Abouzaid, a mathematics professor at Stanford University and a member of the first proof team, the mathematicians were aiming for problems that would require some originality to solve, not just a mixture of standard techniques.

On supporting science journalism

If you enjoyed this article, consider supporting our award-winning journalism Subscribing By purchasing a subscription, you are helping ensure a future of impactful stories about the discoveries and ideas shaping our world today.

The challenge, while highlighting the limitations of AI, also highlights the emerging AI-enthusiast subculture within the mathematics community. Online discussion boards and social media accounts devoted to mathematics were filled with alleged testimonies from top mathematicians and rogue graduate students. And it underlined how seriously AI startups, including ChatGPIT creator OpenAI, are taking the challenge of teaching mathematics to LLMs.

“We didn’t expect there would be so much activity,” says Abouzaid. “We didn’t expect AI companies to take it so seriously and put so much effort into it.”

The First Proof team revealed solutions to 10 challenges early Saturday, and Posted About my own experiences trying to get an LLM to solve problems. They found that the AI could produce reliable proofs for every problem, but only two were correct – the ninth and 10th problems. And a proof almost identical to the ninth problem already existed. The first problem was also “contaminated” – a sketch of a proof was archived from the website of its author, team member and 2014 Fields Medal winner Martin Hairer – but LLM still failed to fill the gaps.

Abouzaid says the style of evidence that LLM presented was particularly surprising. “The true solutions I’ve seen from AI systems have the flavor of 19th-century mathematics,” he says. “But we’re trying to create 21st-century mathematics.”

External presentations didn’t look much better. Some presentations appear to use varying amounts of human input, many of which appear to be the result of week-long conversations vetted by mathematicians. the important thing is first evidence rule Do not allow human mathematical input or checking.

“Once humans get involved, how do we decide how much is human and how much is AI?” says Lauren Williams, Dwight Parker Robinson Professor of Mathematics at Harvard University and one of the mathematicians who founded the first proof.

OpenAI posted its work on Saturday, which was the result of a week-long sprint using its latest in-house AI models working with “expert feedback” from human mathematicians. Jacob Pachocki, the company’s chief scientist, said in a social media post He believes that six of his ten solutions “have a high probability of being correct.” Mathematicians have already pointed to a possible hole in at least one of those six.

Aside from how much human assistance the AI received, most presentations seem to be pretty solid bullshit. Even before the challenge ended, many of the purported solutions that initially seemed credible were already being questioned by experts.

It will take several days for the submissions to be properly vetted by experts. And deciding whether a proof is actually “original” is even more difficult than deciding whether it is correct. “Nothing in mathematics is completely without precedent,” says University of Toronto mathematician Daniel Litt, who was not part of the first proof team.

“We’re thinking of it as an experiment. Our goal was to get feedback,” says Abouzaid. The team writes that they are planning a second round with tighter controls, and more details will be released on March 14.

For some mathematicians who have been tracking AI progress, the lackluster results match their expectations. “I expected maybe two to three obviously correct solutions from publicly available models,” Litt says. “Ten would have been very surprising to me.”

Still, it was probably impossible to get even some valid solutions to research-level problems from AI until a few months ago. “I’ve already heard from colleagues that they’re in shock,” says Scott Armstrong, a mathematician at Sorbonne University in France. “These tools are coming to change mathematics, and it’s happening now.”

But for others who closely follow AI achievements, it was not a good showing.

“The models seem to have struggled,” says Kevin Barreto, a graduate student at the University of Cambridge who was not part of the first proof team. They recently used AI to solve one of the Erdős problems, a series of challenges posed by Hungarian mathematician Paul Erdős. “To be honest, yes, I’m somewhat disappointed.”

It’s time to stand up for science

If you enjoyed this article, I would like to ask for your support. scientific American He has served as an advocate for science and industry for 180 years, and right now may be the most important moment in that two-century history.

i have been one scientific American I’ve been a member since I was 12, and it’s helped shape the way I see the world. Science Always educates and delights me, and inspires a sense of awe for our vast, beautiful universe. I hope it does the same for you.

if you agree scientific AmericanYou help ensure that our coverage focuses on meaningful research and discovery; We have the resources to report on decisions that put laboratories across America at risk; And that we support both emerging and working scientists at a time when the value of science is too often recognised.

In return, you get the news you need, Captivating podcasts, great infographics, Don’t miss the newsletter, be sure to watch the video, Challenging games, and the best writing and reporting from the world of science. you can even Gift a membership to someone.

There has never been a more important time for us to stand up and show why science matters. I hope you will support us in that mission.

The first proof is the toughest mathematical test of AI yet. results are mixed

On supporting science journalism

It’s time to stand up for science

Man lets AI rent his body

Microsoft added AI to Notepad and it created a security failure because the AI ​​was extremely easy for hackers to circumvent.

Related Articles

Leave a Comment Cancel Reply

Microsoft added AI to Notepad and it created a security failure because the AI was extremely easy for hackers to circumvent.