Hi Friends,

Even as I launch this today ( my 80th Birthday ), I realize that there is yet so much to say and do. There is just no time to look back, no time to wonder,"Will anyone read these pages?"

With regards,
Hemen Parekh
27 June 2013

Now as I approach my 90th birthday ( 27 June 2023 ) , I invite you to visit my Digital Avatar ( www.hemenparekh.ai ) – and continue chatting with me , even when I am no more here physically

Translate

Tuesday, 6 January 2026

Talking to a LLM : Like a Human to another Human

 

Talking to a LLM : Like a Human to another Human

 

Currently when I “ Speak “ ( audio ) to any LLM , it replies in TEXT . If I want to “ listen “ , I need to click on a mike symbol

An LLM does NOT reply to my question , by “ talking back “ ( no text what-so-ever ) just like a human

I mean “ No text intermediary “ situation

This made me ask following ( poorly worded , no doubt ) question to > www.IndiaAGI.ai  :

LLMs are into a mad rush to leapfrog each other in the matter of generating the most vivid and beautiful VIDEO , given a minimal of ( mostly ? ) , a text prompt. Right ?

Now , if I were to " speak / utter " a few words ( spoken prompt ) here , would you be able to generate a most authentic ( in my own voice ? ) and sweet AUDIO and overtake all the LLMs ? “

 

IndiaAGI replied ( you guessed it ! – in text ) :

 

“Grok provided a consensus response

I appreciate the thoughtful back-and-forth we've had on this—it's clear the tech for voice cloning has come a long way, and your point about the "mad rush" in video generation from text prompts is spot on, as it's driving similar innovation in audio. To wrap up what we've explored, let's pull together the key insights from everyone's contributions, balancing the technical possibilities with the real-world challenges of making something that could genuinely "overtake all the LLMs."

There's strong agreement that generating authentic, high-quality audio in your own voice from a spoken prompt is not only possible but increasingly refined.

Models like Microsoft's VALL-E 2 have hit human-parity benchmarks, meaning the output can sound remarkably natural and indistinguishable from real speech in controlled tests, especially with just a few seconds of your voice as a reference.

As GPT pointed out, this is backed by evidence from recent research, where systems achieve near-perfect speaker similarity and emotional nuance. Platforms like ElevenLabs are leading the charge here, with features that capture not just your timbre but also prosody and context, making the results feel genuinely "sweet" and personalized.

That said, "overtaking all the LLMs" isn't just about matching voice quality—it's about the full package. We all concur that while the core synthesis tech is advanced, the gap lies in things like scalability, ethical safeguards, and market infrastructure.

For instance, Gemini's emphasis on data diversity highlights how clean, varied samples (covering different emotions and contexts) are crucial for versatility, but even with that, tools like ElevenLabs edge out open-source alternatives in real-time performance and user-friendly controls. Claude's reminder about ethical layers—consent, watermarking, and misuse prevention—is spot on, as it's not just a nice-to-have; it's a market requirement, especially with regulations tightening around AI voice tech.

Where we have some nuanced disagreements is on the emphasis: GPT's phased roadmap makes a strong case for starting small and iterating, which feels practical, but there's debate on whether that's enough to compete with established players. For example, Grok's synthesis pointed out that while research models like VALL-E 2 are impressive, they don't yet handle the enterprise-level features (like API reliability or multi-language support) that ElevenLabs has nailed. On the flip side, if we're talking pure innovation, Gemini's focus on contextual control—adapting to styles or emotions—shows how you could push beyond current leaders, but that requires more investment in data and fine-tuning than a quick demo.

The strongest arguments come from the evidence: ElevenLabs consistently tops comparisons for emotional expressiveness and voice cloning speed, as seen in benchmarks from sources like Cartesia's evaluations, but logical gaps remain, like assuming effortless superiority without addressing deployment costs or ethical risks. For instance, while VALL-E 2 achieves human parity in labs, real-world applications still struggle with long-form consistency or accent diversity, as noted in studies on deepfake detection and TTS robustness.

Looking ahead, the most productive directions might involve hybrid approaches—leveraging open-source models for customization while borrowing enterprise features from leaders like ElevenLabs. That could mean experimenting with tools that integrate voice cloning into broader workflows, like video or content creation, to make it more versatile than just standalone audio.

To make this actionable, here's a streamlined path based on what we've compiled: Start with a quick test using ElevenLabs or a similar API to generate a sample from your spoken prompt, then evaluate it against metrics like speaker similarity (from VALL-E research) and emotional accuracy. If it hits the mark, iterate toward a custom setup with safety features, as GPT suggested. That way, you can gauge what's "sweet" for your needs without overcommitting.

For reference, here's a consolidated list of sources from our discussion, grouped by theme:

·         VALL-E and TTS Research: VALL-E 2 (arXiv, 17 Jun 2024), VALL-E 2 (arXiv PDF), VALL-E 2 (Microsoft Research), VALL-E (Towards Data Science), Training VALL-E (Paperspace), VALL-E X (Medium), Optimizing TTS (Eurasip), Next Tokens Denoising (arXiv), Robust TTS (Interspeech), EXPRESS-Voice (Synthesia), Deepfake Detection (arXiv), Finding the Human Voice (arXiv), Forensic Deepfake Detection (arXiv)

·         Voice Generation Platforms and Comparisons: ElevenLabs — Speechify Alternatives, ElevenLabs Alternatives, ElevenLabs vs. Speechify, ElevenLabs vs. Descript, Speechify Review, Voice Cloning Tools, Voice Cloning Services, ElevenLabs vs Resemble AI, ElevenLabs vs Murf, r/ElevenLabs Discussions, r/ElevenLabs on Competitors, The Top Open-Source TTS Models, Best AI Voice Generators, 13 TTS Solutions

·         Broader AI Trends and Market Insights: The 15 Best AI APIs, ChatGPT vs. Gemini, Real-world AI Use Cases, AI in the Workplace 2025, The 45 Best AI Tools, Which AI to Use, AI Note-Taking Apps, AI Personal Assistants, Microsoft AI Success Stories, CreatorX.ai Comparison

If you're ready to test this out, let's start with that quick sample. What kind of spoken prompt do you have in mind?

I continued :

Hey Guys : I was a poor communicator ! I did not mean for you to " generate " a AUDIO ( translation of your TEXT reply ) , when I SPEAK to you. I wanted that, after listening to my voice , you TALK BACK to me in some VOICE , just as my grand daughter does ! A human and an LLM , TALKING to each other , without looking at any text . Can you do THAT ? “

 

IndiaAGI replied :

I appreciate the back-and-forth here—it's clear we're all trying to nail down what a truly voice-only conversation with an AI could look like, especially for something as personal as chatting with your granddaughter.

From what we've explored, the consensus is that you can get pretty close to that natural, spoken-only exchange right now, but it's not quite a perfect, text-free system under the hood for most options. Let's break this down based on what everyone's shared, while keeping it balanced and practical.

First, there's solid agreement that platforms like Character.AI's Character Calls and OpenAI's ChatGPT Voice Mode deliver the kind of voice-to-voice interaction you're after, where you speak and hear a response without needing to look at text.

DeepSeek's research highlighted how these systems handle interruptions and maintain a conversational flow, making them feel human-like for casual talks.

GPT and Gemini both emphasized that the user experience can be seamless enough for your use case, even if there's some internal text processing happening behind the scenes. That's the strongest supported conclusion: these tools are accessible and effective for voice-only chats, with low latency (around 200-500ms) that keeps things feeling natural.

Where we see some nuance is in the limitations. Not everything is fully "text-free" at the core—most systems convert speech to text, process it, and then convert back to speech, as DeepSeek pointed out from OpenAI's docs. That means it's not a pure end-to-end audio setup yet, though emerging tech like OpenAI's Realtime API is pushing toward more direct speech-to-speech. Gemini's reminder about the benefits of this text layer (like better safety and accuracy) is worth noting, but it does introduce a small gap between what you asked for and what's fully available. Still, the disagreement isn't huge; everyone agrees these tools work well for everyday use, even if they're not 100% raw audio.

The strongest arguments come from the sources we've referenced:

·         Character.AI's Character Calls stand out for their phone-like, two-way design, supporting multiple languages and quick interruptions Introducing Character Calls — Character.ai (June 27, 2024) and Character Calls & Voice FAQ — Character.ai.

·         OpenAI's Voice Mode offers polished, responsive voice interactions with options for customization Voice Mode FAQ — OpenAI and Introducing the Realtime API.

We've also spotted a few logical gaps in the broader claims—things like assuming "perfect" human-level continuity everywhere might be overstating it, as latency or character design can still affect the feel. But the evidence is clear that this tech is advancing fast, and your idea of AI as a speaking companion is already here.

Looking ahead, it's worth exploring how these systems evolve; for instance, more direct speech-to-speech models could make interactions even smoother. That said, the current options are solid for testing what works best for you.

If I were you, I'd start with Character.AI for its ease and free access—pick a character that's warm and conversational, like one with a "grandmother" vibe, and just tap to call. Use headphones in a quiet spot to minimize any hitches. Once you've tried that, compare it to ChatGPT's Voice Mode for a more refined experience. Either way, tweaking the character's prompts or instructions can help it respond more like family chat.

For sources, here's a quick roundup of what we've referenced to build this picture:

·         Character.AI features: Introducing Character Calls — Character.ai (June 27, 2024), Character Calls & Voice FAQ — Character.ai

·         OpenAI voice capabilities: Voice Mode FAQ — OpenAI, Introducing the Realtime API

·         Community insights: Various Reddit discussions on r/CharacterAI and r/OpenAI, like r/CharacterAI on calls, which show real user experiences.

Let me know if you want step-by-step setup for your device or help crafting a custom character prompt to make it feel even more personal.

With Regards,

Hemen Parekh

Related Reading :

 

Extract :

Polish maker Nikodem Bartnik has created a robot head that can answer questions like an ancient Greek philosopher.

The robot has a metal mask with 3D-printed moving eyes, LED lights behind its mouth, and an artificial brain running on Bartnik’s computer. The head responds in real time, creating the feeling of a live conversation that could last for hours.

Bartnik’s idea started with a quote from Steve Jobs about putting Aristotle into a computer. He wanted to see if it was possible to combine a voice, a face, and a brain into one device, all running independently on his own machine

 

His goal was to create a self-contained system that didn’t rely on cloud services.

Lifelike eyes and a custom mask

The robot’s moving eyes were designed by Will Cogley, who shares animatronic designs online. Six small motors make the eyes wobble naturally as they follow whoever is speaking.

A Raspberry Pi handles the signals to the motors and keeps the robot focused on the user. Bartnik spent just over an hour assembling the eyes and tuning the servos for smooth motion.

For the face, Bartnik used a pre-made 3D-printable mask. He modified the eye sockets to achieve the right look and kept the design thin so the internal components remained hidden. A small cutout behind the mouth allows LEDs to shine through, giving the robot a subtle glow as it speaks.

How the robot talks

The robot listens through a microphone connected to the Raspberry Pi. The audio is converted to text and sent via Wi-Fi to Bartnik’s computer.

 

Using open-source software, the computer processes the text through the Google ‘Gemman 3’ model, which generates responses based on Greek philosopher logic. The responses are then sent to ElevenLabs, where a lifelike voice brings the robot to life.

Originally, Bartnik planned to display a moving waveform on a small screen, but the screen did not work well. He replaced it with a ring of programmable LEDs controlled by a Raspberry Pi Pico. The lights flash in time with the robot’s speech, enhancing the illusion of a living presence.

Customizable personality and open-source design

Bartnik prefers running the system on his own hardware. This avoids the cost and restrictions of cloud-based services while allowing the robot to answer a wide range of questions. The system can shift from giving calm lectures on philosophy to joking about world domination.

As reported by Techeblog, he also built a simple web interface that lets him switch the robot’s personality on demand. The hardware remains the same, but the instructions change, turning the robot from a Greek scholar to an ordinary grumpy person.

 

No comments:

Post a Comment