Blog: Testing AI Safety Now

Why our tests for AI safety fall short

I write this because the conversation about AI safety is moving faster than the science we use to evaluate it. Recently, in an interview reported by the Hindustan Times, Dr. Rumman Chowdhury — a practitioner who has led AI ethics teams inside industry and advised governments — put it bluntly: “I cannot emphasise enough that our testing mechanisms for gen AI models are insufficient. We don't actually have rigorous scientific methods of testing these AI models.”[1]

Her point landed with the blunt clarity of someone who has run public red teams, led in-house auditing at large platforms, and now builds community-driven evaluation practices: current benchmarks and red‑teaming exercises are necessary but not sufficient. They are often single-turn, curated, and divorced from the messy, interacting realities where these systems run.

In my work I’ve seen the same pattern: elegant lab metrics that fail to predict real-world harm. I first pushed a practical checklist for chatbots years ago (what I call Parekh’s Law of Chatbots)[2] precisely because I was worried that a polished metric could mask a system that misinforms, escalates, or amplifies harms in live interaction.

What’s wrong with today’s methods

Benchmarks are narrow. Many standard evaluations are question–answer pairs that measure a model’s ability to match ground truth on a predetermined dataset. They do not capture long interactions, context drift, or repeated prompting that lead to degradation or harm.
Red teaming is ad hoc. As Dr. Rumman Chowdhury observed, red teams are often “a bunch of people in a room hacking at a model.” That can reveal vulnerabilities — but it rarely produces scalable, reproducible scientific evidence about how likely those vulnerabilities are in the wild.
Cherry-picking and bias. Vendors may test for harms they expect to find; critics sometimes search until they find a breaking case. Both behaviors distort the empirical picture.
Probabilistic systems outpace deterministic tests. Generative models are inherently probabilistic. A single prompt-based test cannot capture the spectrum of plausible outputs a model may produce across millions of interactions.

The result is what many call "pilot purgatory": systems that look fine in demonstration but whose safety profile in real deployments remains unknown.

Why this matters — real-world stakes

When evaluation is shallow, harms show up where it counts: in hiring tools with biased decisions, in moderation systems that amplify polarising content, or in consumer assistants that confidently give incorrect or harmful instructions. As Dr. Rumman Chowdhury points out, these are not hypothetical—these are socio-technical failures that impact education, mental health, civic discourse, and entry-level job opportunities for young people.

Weak testing also erodes public trust and makes policy responses harder. If regulators and courts cannot rely on reproducible, scientific evaluation, enforcement becomes ad hoc and uneven. That’s why Dr. Rumman Chowdhury recommends legal frameworks that set obligations; law creates the incentive to invest in the measurement science that currently lags.

What better scientific methods should look like

We need an empirical upgrade. Here are practical, research-rooted steps we should take now:

Move from model-centric to system-level evaluation

Test the entire socio-technical system (model + UI + human workflows + data flows), not just the model in isolation. Many harms arise from interaction patterns, not single outputs.

Build longitudinal, interaction-based benchmarks

Create reproducible testbeds that simulate realistic multi-turn interactions, adversarial persistence, and benign-but-risky user behavior. Benchmarks must reflect longitudinal effects (the “punch-it-long-enough” failure modes).

Expand red teaming into structured public audits

Democratize adversarial testing with well-documented protocols, responsible disclosure, and third‑party reproducibility. Humble, repeatable competitions (bias bounties) are useful only when their findings feed standardized remediation pathways.

Standardize measurement instruments and share data

Create open, peer-reviewed test suites and shared metrics akin to NIST standards in other domains. Independent replication should be the norm.

Require secure model access for independent evaluators

Law and policy can mandate secure, auditable access so independent researchers can run robust evaluations without compromising IP or privacy.

Tie evaluation to outcomes that matter

Use impact metrics (e.g., downstream decision fairness, mental‑health indicators, youth employment effects) rather than only accuracy or perplexity.

A balanced caution on alarmism

We must be rigorous, but also proportionate. As Dr. Rumman Chowdhury warns, apocalyptic framings can paralyse policy: they make tractable problems harder to solve. There are real, immediate harms (bias in hiring, lost internships, misinformation) and near-term mitigations we can design. Scientific testing helps identify what to fix first.

Recommendations — short and urgent

Governments should fund national AI evaluation institutes (modeled after NIST-style measurement bodies) and mandate independent audits for high‑risk systems.
Companies must publish standardized evaluation suites and cooperate with certified third‑party evaluators.
Researchers and funders should prioritize longitudinal, interaction-based benchmarks and infrastructure for reproducibility.
Civil society should be empowered with legal protections to run public-interest audits and responsible red teams.

Call to action

If you care about safe AI, insist on measurable safety. Support standards, champion independent audit capacity in your institutions, and push for laws that make evaluation an obligation, not an afterthought. As Dr. Rumman Chowdhury said, law can agenda-set: if the law demands rigor, the science will follow.

I believe practical, reproducible testing is possible — but it will take focused scientific investment and coordinated public pressure. I’ve argued for practical safeguards before (see Parekh’s Law of Chatbots)[2]; today I add this: measurement is the missing muscle. Build it.

Connect with me: Hemen Parekh — hcp@recruitguru.com

Regards,
Hemen Parekh

Any questions / doubts / clarifications regarding this blog? Just ask (by typing or talking) my Virtual Avatar on the website embedded below. Then "Share" that to your friend on WhatsApp.

References

[1] "World lacks rigorous scientific methods to test AI safety: Rumman Chowdhury," Hindustan Times. https://www.hindustantimes.com/india-news/world-lacks-rigorous-scientific-methods-to-test-ai-safety-rumman-chowdhury-101771442384020.html

[2] Parekh, Hemen. "Parekh's Law of Chatbots." https://myblogepage.blogspot.com/2023/02/parekhs-law-of-chatbots.html

Get correct answer to any question asked by Shri Amitabh Bachchan on Kaun Banega Crorepati, faster than any contestant

Hello Candidates :

For UPSC – IAS – IPS – IFS etc., exams, you must prepare to answer, essay type questions which test your General Knowledge / Sensitivity of current events
If you have read this blog carefully , you should be able to answer the following question:

"Why do single-turn benchmarks fail to capture real-world safety risks in generative AI systems?"

Need help ? No problem . Following are two AI AGENTS where we have PRE-LOADED this question in their respective Question Boxes . All that you have to do is just click SUBMIT
1. www.HemenParekh.ai { a SLM , powered by my own Digital Content of more than 50,000 + documents, written by me over past 60 years of my professional career }
2. www.IndiaAGI.ai { a consortium of 3 LLMs which debate and deliver a CONSENSUS answer – and each gives its own answer as well ! }
It is up to you to decide which answer is more comprehensive / nuanced ( For sheer amazement, click both SUBMIT buttons quickly, one after another ) Then share any answer with yourself / your friends ( using WhatsApp / Email ). Nothing stops you from submitting ( just copy / paste from your resource ), all those questions from last year’s UPSC exam paper as well !
May be there are other online resources which too provide you answers to UPSC “ General Knowledge “ questions but only I provide you in 26 languages !

Interested in having your LinkedIn profile featured here?

Submit a request.

Executives You May Want to Follow or Connect

Ramana Gogula

Vice President, Technology Innovation at Stanley ...

• Source global startups & emerging technology companies that can accelerate internal innovation vision for product development ... CEO/Founder. Earthen ...

Loading views...

Rohit Manglik

Founder & CEO, InfoBay.AI (formerly EduGorilla)

With a background of patents, publications, and large-scale education technology innovation ... businesses and increasing the positive impact they have in ...

Loading views...

rohit@edugorilla.com

Saikh Mirsat

Founder | CEO | Director at Qlith Innovation Pvt Ltd

As the Founder, CEO, and Director, I am leading a company dedicated to building innovative tech solutions and mentoring aspiring developers.

Loading views...

Zubin Daruwalla

Senior VP

Senior VP - Marketing & Business Development at Kokilaben Dhirubhai Ambani Hospital ... • In-depth knowledge of Healthcare industry • Highly Analytical and ...

Loading views...

zubin.daruwalla@kokilabenhospitals.com

MADHAN MOHAN

Strategic Chief Financial Officer (CFO)

Strategic Chief Financial Officer (CFO) ->Driving Growth, Transformation & Financial Leadership in Manufacturing ✦ Expertise in Financial Strategy, ...

Loading views...

madhanmohan.c@jsautocast.com

Translate

Tuesday, 24 February 2026

Testing AI Safety Now

Why our tests for AI safety fall short

What’s wrong with today’s methods

Why this matters — real-world stakes

What better scientific methods should look like

A balanced caution on alarmism

Recommendations — short and urgent

Call to action

Interested in having your LinkedIn profile featured here?

No comments:

Post a Comment