Hi Friends,

Even as I launch this today ( my 80th Birthday ), I realize that there is yet so much to say and do. There is just no time to look back, no time to wonder,"Will anyone read these pages?"

With regards,
Hemen Parekh
27 June 2013

Now as I approach my 90th birthday ( 27 June 2023 ) , I invite you to visit my Digital Avatar ( www.hemenparekh.ai ) – and continue chatting with me , even when I am no more here physically

Translate

Tuesday, 20 January 2026

When AI Proofs Don't Add Up

When AI Proofs Don't Add Up

Introduction

I’ve been following the surge of excitement — and alarm — around AI-generated mathematics for a few years now. On the one hand, we’re seeing tools that can sketch proofs, suggest lemmas, and surface overlooked literature. On the other, many mathematicians tell me that AI’s solutions to "knotty" problems often look convincing while quietly being wrong. That tension matters for researchers, educators, and anyone who wants to treat machine output as trustworthy.

Background: the debate in brief

AI systems have moved from symbolic theorem provers to large language models (LLMs) trained on vast corpora of human-written mathematics. Recent papers and press pieces document both progress and failures: models now clear many benchmark problems and can even assist formalization efforts, yet a recurring theme is the gap between a correct final answer and a valid, checkable proof (see, for example, the recent survey on AI for formal mathematical reasoning Formal Mathematical Reasoning: A New Frontier in AI).

I’ve noted related themes in my earlier pieces — for example, in my post on using math to reduce chatbot hallucinations Is Math the Path to Chatbots that Don't Make Stuff Up? — and I keep coming back to the same conclusion: math gives us a path toward accountability, but only if we pair models with verification.

Examples: convincing but incorrect outputs

Here are representative failure modes that have been observed across systems and benchmarks:

  • LLMs that produce a plausible multi-step proof but include an invalid inference late in the argument, so the conclusion is unsupported despite the polished prose (documented across recent benchmark analyses).Formal approaches and benchmark analyses
  • Claims of "solutions" to open problems later revealed to be restatements of known results, or dependent on unstated assumptions found only after careful literature review.
  • AI systems that output formal proof-looking code (e.g., Lean fragments) that type-check superficially but rely on omitted lemmas or undeclared axioms.

These aren’t hypothetical: researchers regularly post model-generated proofs that are later corrected or withdrawn after verification or community scrutiny.

Expert perspective

I want to quote a leading mathematician whose caution is widely echoed across the field:

  • Terence Tao (tao@math.ucla.edu): "AI-written proofs have improved to the point of being human-readable and technically correct in many places, but they can still feel off: stressing trivial steps and skimming or skipping subtle, crucial arguments." (paraphrase of public remarks)

I’ll also include a short, labeled fictionalized quote that captures many practicing mathematicians’ anxieties:

  • (Fictionalized) — "The output reads like a mathematician who skimmed two textbooks and then guessed the middle part of the proof." — a working mathematician (fictionalized for illustration).

Why LLMs confidently give wrong answers

When an AI produces a mathematically incorrect but plausible-seeming solution, several technical reasons usually play a role:

  • Overconfidence: models are trained to produce fluent, final-form answers and will present confident, detailed text even when internal uncertainty is high.
  • Hallucination: generative models can invent lemmas, citations, or intermediate constructions that look legitimate but have no grounding in the axioms or literature.
  • Training-data limits and contamination: LLMs can recompose fragments of prior proofs seen during training, producing outputs that are familiar-sounding but not logically valid for the new statement.
  • Lack of formal verification: natural-language proofs are inherently informal. Without translation into a proof assistant (Lean, Coq, Isabelle), subtle logical gaps remain invisible to the model.
  • Self-evaluation blind spots: models are often poor at reliably checking proofs they themselves generated; cross-model verification or formal checkers are needed instead.

Formal verification and automated theorem proving: partial solutions

Formal proof assistants provide a clear path: encode statements and proofs in a machine-checkable language and let the verifier confirm every logical step. Recent research shows promising hybrids:

  • Autoformalization (translating informal proofs into Lean or Coq) can bootstrap provers with human-style intuition.
  • Neural theorem provers combined with symbolic search and verification close many hallucination gaps.

But formalization is expensive, and current autoformalizers aren’t perfect. The pragmatic middle ground is a human+AI loop where AI proposes steps and humans or a formal checker validate them. This is augmentation, not automation.

Implications for researchers, educators, and the public

  • Researchers: AI can accelerate exploration and suggest conjectures, but claims of new theorems still require independent verification. Treat model output as hypotheses rather than finished results.
  • Educators: Students can benefit from AI as a tutor, but instructors must teach critical verification skills: how to spot gaps, demand explicit assumptions, and translate arguments into checkable steps.
  • Public and media: Sensational headlines about "AI solving open math problems" should be approached skeptically; many such claims collapse on peer review and verification.

Recommendations: responsible use and collaboration

  • Pair LLMs with formal verifiers: whenever possible, translate important outputs into a proof assistant for checking.
  • Encourage toolchains that separate generation from verification: different models or systems should generate and then critique/verify one another to reduce self-critique blind spots.
  • Benchmark honestly: report both final-answer correctness and full-proof validity; the discrepancy between the two matters greatly.
  • Build interdisciplinary teams: mathematicians, formal-methods experts, and AI engineers should co-design datasets, interfaces, and pipelines.
  • Teach verification literacy: make formal thinking and basic proof-checking a core part of how we teach students to use AI tools.

Conclusion

AI is changing how we explore mathematics. The good news is that models assist with search, drafting, and idea-generation in ways that were unimaginable a few years ago. The cautionary news is equally clear: many AI-generated mathematical outputs remain plausibly wrong. The path forward is collaborative and pragmatic: use AI as a copilot, demand verification, and invest in toolchains that bridge human intuition and machine rigor.

I’ve been tracking this space and writing about how math can reduce AI hallucination before — see my earlier piece on math and chatbot reliability Is Math the Path to Chatbots that Don't Make Stuff Up?. We need both optimism and discipline: build powerful tools, and build the verification scaffolding that makes their claims trustworthy.


Regards,
Hemen Parekh


Any questions / doubts / clarifications regarding this blog? Just ask (by typing or talking) my Virtual Avatar on the website embedded below. Then "Share" that to your friend on WhatsApp.

Connect with people mentioned above:

Get correct answer to any question asked by Shri Amitabh Bachchan on Kaun Banega Crorepati, faster than any contestant


Hello Candidates :

  • For UPSC – IAS – IPS – IFS etc., exams, you must prepare to answer, essay type questions which test your General Knowledge / Sensitivity of current events
  • If you have read this blog carefully , you should be able to answer the following question:
"Why does translating an informal proof into a formal proof assistant like Lean reduce the risk of hallucinated or incorrect steps in AI-generated mathematics?"
  • Need help ? No problem . Following are two AI AGENTS where we have PRE-LOADED this question in their respective Question Boxes . All that you have to do is just click SUBMIT
    1. www.HemenParekh.ai { a SLM , powered by my own Digital Content of more than 50,000 + documents, written by me over past 60 years of my professional career }
    2. www.IndiaAGI.ai { a consortium of 3 LLMs which debate and deliver a CONSENSUS answer – and each gives its own answer as well ! }
  • It is up to you to decide which answer is more comprehensive / nuanced ( For sheer amazement, click both SUBMIT buttons quickly, one after another ) Then share any answer with yourself / your friends ( using WhatsApp / Email ). Nothing stops you from submitting ( just copy / paste from your resource ), all those questions from last year’s UPSC exam paper as well !
  • May be there are other online resources which too provide you answers to UPSC “ General Knowledge “ questions but only I provide you in 26 languages !




Interested in having your LinkedIn profile featured here?

Submit a request.
Executives You May Want to Follow or Connect
Amit Ramchandani
Amit Ramchandani
CEO and Head
CEO and Head - Investment Banking at Motilal Oswal Financial Services · Managing Director, Investment Banking - JM Financial. · Experience: Motilal Oswal ...
Loading views...
amit.ramchandani@motilaloswal.com
Sonia Dasgupta
Sonia Dasgupta
MD & CEO
Experience · Chief Executive Officer Investment Banking Division · Managing Director and Head – Financial Institutions Group (FIG) · Head of Group Borrowing · Part ...
Loading views...
sonia.dasgupta@jmfl.com
Rajeev Barnwal
Rajeev Barnwal
Chief Artificial Intelligence Officer | GenAI & RAG ...
Technical Architecture & Code Review: Design robust software architectures, review code quality, and ensure technical excellence. • Mobile App Development: ...
Loading views...
rajeev.barnwal@dypatil.edu
Aafan Kadri
Aafan Kadri
CTO and Founding member of Markytics | VJTI ...
CTO and Founding member of Markytics | VJTI | ACPCE | Software Development | Data Science | AI | ML | Cloud · Experience: Volkswagen Group Digital Solutions ...
Loading views...
Prashant SUkhwani
Prashant SUkhwani
Marketing 40 under 40 | Vice President ...
... Consumer Good & Food Retail industry. He is skilled in Brand & Marketing Management, Business Strategy, Digital Marketing, Brand Launches, Market Research ...
Loading views...
psukhwani@burgerking.in

No comments:

Post a Comment