Blog: The Data Problem in AI

Monday, 26 January 2026

The Data Problem in AI

I’ve been watching the AI conversation for years — the training races, the GPU land grabs, the breathless demos. So when Larry Ellison larry.ellison@oracle.com recently said that the biggest problem with today’s large AI models is that “they’re all trained on the same public internet data,” it landed for me as both obvious and profound.[^1]

Why his point matters (and why I agree)

Ellison’s observation — that so many flagship models share largely the same public training corpus — speaks to a structural truth: public data alone gives you generality, not business differentiation. Public corpora build capable general-purpose models, but the real competitive value for enterprises comes when models can reason securely over private, contextual data.

I’ve argued along similar lines before. In my post “A Case of 900 Million Orphans” I warned that the people who generate the raw behavioral trails that power models are often left out of the conversation about value and ownership — and that private, high-quality data will be where real value accrues instead of another round of public-data training.A Case of 900 Million Orphans

Public models give you language and pattern recognition. Private data gives you decisions that matter.

Three implications every leader should hear

Enterprises that think “buying the best foundation model is enough” are mistaken. The frontier is not only model size; it’s secure connection between reasoning engines and private operational data.
Whoever solves secure, auditable inference on private data will capture disproportionate economic value — and not just from selling models, but from selling dependable, regulated decision-making within industries like healthcare, finance, and supply chain.
The debate shifts from “who built the biggest model” to “who can make AI reliably and privately useful for mission-critical operations.”

Practical steps I advise for companies today

Treat data as an asset to be read safely, not a commodity you dump into third-party models.

Vectorize and index critical records behind your control plane; use retrieval-augmented generation (RAG) patterns rather than indiscriminate retraining on private data.

Build an inference-first architecture.

Low-latency, audit trails, and policy enforcement belong at inference time. Train less publicly; serve more privately.

Invest in governance and consent.

Data contracts, provenance, and user/subject consent must be embedded; otherwise trust erodes and regulation follows.

Start small with high-value use cases.

Focus on narrow, measurable problems (claims adjudication, contract summarization, clinical decision support) before scaling horizontally.

Prepare for hybrid models and marketplaces.

Don’t assume a single vendor lock-in; design your stack to let specialized models query private data securely via APIs or isolation layers.

The open cautions I’ll keep repeating

Security and privacy are not checkboxes. Exposing private data — even in vectorized form — without provable controls invites risk.
Synthetic data and on-device learning will change the economics, but they won’t remove the need for strong governance and business-context signal.
Concentration of private data is a double-edged sword. Firms that centralize valuable enterprise datasets can enable breakthroughs — and also create monopolies that invite scrutiny.

My mental model going forward

Think of modern AI as having two phases:

Phase 1 — Foundation models trained on public data: broad capability, rapid innovation, commoditization risk.
Phase 2 — Inference and private-data reasoning: where business value, differentiation, and regulatory tension converge.

This is the phase we should be designing for now.

A short checklist for executives (two weeks to start)

Identify 2 high-impact workflows that fail today for lack of contextual data.
Prototype a RAG-powered pilot with strict access controls and audit logs.
Appoint a cross-functional owner for data governance and model behavior monitoring.
Map compliance risks and draft a minimal consent and redaction plan.

I don’t think the conversation is about replacing models — it’s about connecting them to the right data, with the right guardrails. As I’ve written before, the people who produce the data — customers, employees, citizens — deserve both protection and a voice in how that value is realized.A Case of 900 Million Orphans

If you’re building AI for the enterprise, start with the question: what private knowledge does the model need to do useful work for us? Then build the pipelines, contracts, and controls that let the model reason — securely — against that knowledge.

Regards, Hemen Parekh

[^1]: See reporting on Larry Ellison’s remarks and Oracle’s positioning on enterprise AI and private-data inference: Times of India and Moneycontrol.

Get correct answer to any question asked by Shri Amitabh Bachchan on Kaun Banega Crorepati, faster than any contestant

Hello Candidates :

For UPSC – IAS – IPS – IFS etc., exams, you must prepare to answer, essay type questions which test your General Knowledge / Sensitivity of current events
If you have read this blog carefully , you should be able to answer the following question:

"Why does training on the same publicly available data make large AI models similar, and how does private enterprise data change the value proposition?"

Need help ? No problem . Following are two AI AGENTS where we have PRE-LOADED this question in their respective Question Boxes . All that you have to do is just click SUBMIT
1. www.HemenParekh.ai { a SLM , powered by my own Digital Content of more than 50,000 + documents, written by me over past 60 years of my professional career }
2. www.IndiaAGI.ai { a consortium of 3 LLMs which debate and deliver a CONSENSUS answer – and each gives its own answer as well ! }
It is up to you to decide which answer is more comprehensive / nuanced ( For sheer amazement, click both SUBMIT buttons quickly, one after another ) Then share any answer with yourself / your friends ( using WhatsApp / Email ). Nothing stops you from submitting ( just copy / paste from your resource ), all those questions from last year’s UPSC exam paper as well !
May be there are other online resources which too provide you answers to UPSC “ General Knowledge “ questions but only I provide you in 26 languages !

Interested in having your LinkedIn profile featured here?

Submit a request.

Executives You May Want to Follow or Connect

Dinesh Bisht

Managing Director at Anudra Innovations LLP ...

Managing Director at Anudra Innovations LLP | Strategic Partner at Altum Vista | GCC Consultant | Telecom & Technology || Nokia | Ericsson | C-DOT || MBA-IT ...

Loading views...

Mohit Batta

Managing Director | Telecom & Digital Banking | Built ...

... telecommunications across 29 markets. With multiple international awards, accredited with various first‑time innovations and strategic initiatives ...

Loading views...

mohit.batta@sc.com

Rahul Moghe | Manufacturing Strategy | Global Logistics

undefined

Experience · Senior Vice President Operations, Procurement & Logistics · Vice President Operations & Logistics · AVP operations · Plant Head · Production Manager.

Loading views...

rahul.moghe@vahdam.com

Harsh Bhatti

VP – Operations & Strategy | Ex

VP – Operations & Strategy | Ex-Delhivery | EV Mobility, Fleet Electrification | Logistics and Supply Chain Transformation · I am a strategic leader with ...

Loading views...

harsh@bluwheelz.co.in

Balaji Jagadish

Chief Financial Officer | LinkedIn

SUN Industrial Automation & Solutions Private Limited. Dec 2023 - Apr 2025 1 ... Master of Business Administration - MBA Finance and Financial Management Services ...

Loading views...

Translate

Monday, 26 January 2026

The Data Problem in AI

Why his point matters (and why I agree)

Three implications every leader should hear

Practical steps I advise for companies today

The open cautions I’ll keep repeating

My mental model going forward

A short checklist for executives (two weeks to start)

Interested in having your LinkedIn profile featured here?

No comments:

Post a Comment