The Google DeepMind team has introduced aletheaA specialized AI agent designed to bridge the gap between competition-level mathematics and professional research. While the models achieved gold-medal standards at the 2025 International Mathematical Olympiad (IMO), the research required navigating the vast literature and building long-horizon evidence. Aletheia solves this by iteratively generating, verifying, and modifying solutions in natural language.

Architecture: Agentic Loop
Powered by an upgraded version of Aletheia Gemini think deeply. It uses a three-part ‘agent harness’ to improve reliability: :
- Generator: Proposes a candidate solution to a research problem.
- Verifier: An informal natural language system that checks for imperfections or hallucinations.
- Reviewer: Corrects errors identified by the verifier until the final output is accepted.
This separation of duties is important; The researchers observed that explicitly separating validation helps the model recognize flaws that it initially ignores during generation.
Key Technical Findings
The development of Aletheia revealed several insights into how AI handles complex reasoning:
- Estimate-Time Scaling: Allowing the model to perform more calculations at the time of a query – ‘thinking longer’ – significantly increases accuracy. The January 2026 edition of Deep Think reduced the computation required for IMO-level problems 100x compared to the 2025 version.
- Display: alethea achieved 95.1% Accuracy on IMO-proof bench advanced, a huge leap compared to previous records 65.7%. It also demonstrated state-of-the-art futuremath basicAn internal benchmark of PhD level practices.
- Use of equipment: Quote To prevent hallucinations, alethenia is used. Google search and web browsing. This helps him synthesize real-world mathematical literature.
Research Milestones
Alethea has already contributed to several peer-reviewed milestones:
- Fully Autonomous (Feng26): Aletheia produced a paper calculating structure constants called eigenweights Without any human intervention.
- Collaborative (LeeSeo26): The agent provided a high-level roadmap and “big picture” strategy to prove the limits independent setWhich human authors then converted into a hard proof.
- Erdös conjecture: deployed against 700 Open problems, Alethea found. 63 Technically correct solutions and solutions 4 Open questions autonomously.
A taxonomy for AI autonomy
DeepMind proposed a standard for classifying AI math contributions, similar to the levels used for autonomous vehicles.
| level | Autonomy Statement | importance (example) |
| level 0 | primarily human | Negligible novelty (Olympiad level) |
| level 1 | Human-AI Collaboration | Small novelty (Erdos-1051) |
| level 2 | basically autonomous | Publishable Research (Feng26) |
paper feng26 has been classified as Level A2meaning that it is essentially autonomous and of publishable quality.
key takeaways
- Introducing a Research-Grade AI Agent:Alethea is a mathematics research agent that moves beyond competition-level solutions to autonomously generate, verify, and modify mathematical proofs in natural language. It is powered by an advanced version of Gemini think deeply and an agentic loop consisting of generators, verifiers, and reviewers.
- Significant benefits through estimation-time scaling: DeepMind researchers found that giving models more ‘thinking time’ during inference produced substantial gains in accuracy. January 2026 Version of Deep Think reduces calculations required for Olympiad-level performance 100x and achieved a record 95.1% IMO-Accuracy on the proof bench advanced.
- Milestones in Autonomous Research: The system achieved several ‘firsts’, including a research paper (feng26) arose entirely without human intervention in connection with arithmetic geometry. This was also resolved successfully 4 open questions from Erdős conjecture Database autonomously.
- Important role of equipment usage and verification:To deal with ‘hallucinations’ – such as fabricating paper quotes – relies heavily on aletheia Google search and web browsing. Additionally, separating the validation phase from the generation phase proved essential to identify flaws that the model had initially overlooked.
- Proposal for new autonomy classification:The paper suggests a standardized framework for documenting AI-assisted results, featuring axes autonomy (Level H to Level A) and mathematical significance (Level 0 to Level 4). Its purpose is to provide transparency and close the “evaluation gap” between AI claims and professional mathematical standards.
check it out paper. Also, feel free to follow us Twitter And don’t forget to join us 100k+ ml subreddit and subscribe our newsletter. wait! Are you on Telegram? Now you can also connect with us on Telegram.

Michael Sutter is a data science professional and holds a Master of Science in Data Science from the University of Padova. With a solid foundation in statistical analysis, machine learning, and data engineering, Michael excels in transforming complex datasets into actionable insights.

