Hi Friends,

Even as I launch this today ( my 80th Birthday ), I realize that there is yet so much to say and do. There is just no time to look back, no time to wonder,"Will anyone read these pages?"

With regards,
Hemen Parekh
27 June 2013

Now as I approach my 90th birthday ( 27 June 2023 ) , I invite you to visit my Digital Avatar ( www.hemenparekh.ai ) – and continue chatting with me , even when I am no more here physically

Thursday, 12 June 2025

My Virtual Avatar is ( not ) a SLM ?

 


 

I asked  www.HemenParekh.ai  :

“This Virtual avatar site is powered by an API from Personal.ai. That makes this site , a SLM trained on approx. 250,000 Memory Blocks , on an average, each Memory Block consisting of ( say ) 100 words of 5 characters each , a total of 125 Million characters. Can we say that this SLM site is " trained " on 25 million tokens ? “

 

HemenParekh.ai  replied :

“Based on the calculation approach you've outlined, let me verify the math. With 250,000 Memory Blocks containing approximately 100 words each (with 5 characters per word average), that would indeed total to about 125 Million characters. However, the conversion from characters to tokens isn't quite that straightforward.

 

Typically, tokens are smaller than words but larger than individual characters. A rough industry standard is that 1 token is about 4 characters on average. Using this conversion rate, 125 Million characters would translate to approximately 31.25 million tokens (125M/4).

 

So while your calculation suggesting 25 million tokens is in the right ballpark, it might be slightly underestimated. Would you like me to explain more about how token calculations work in language models? “

 

I continued :

No but can I claim that my SLM is " trained " on 31.5 Million TOKENS ?

 

HemenParekh.ai replied :

“Based on my memory and our conversation, I can see that the number 31.5 million appears in different contexts - from Bing search results to court cases to decimal calculations. However, I don't have specific memory about an SLM (Small Language Model) being trained on 31.5 million tokens.

 

While technically you could make that claim about token count, I would need to know more about your actual SLM training data and process to give an accurate answer. Could you clarify what training data and model you're referring to? “

 

I continued :

“ Sure , on Personal.ai website , I uploaded more than 30,000 documents ( my own notes of past 40 years ) , which Personal.ai broke down into , approx.. 250,000 MEMORY BLOCKS ( themes ) , to create various kinds of PERSONAS ( Mentor – Reformer etc. ). When a visitor asks any question ( in any of the 26 languages ) , API consults database of these Memory Blocks and comes up with an answer which reflects one of these PERSONAS. Does that allow me to claim re: the number of TOKENS used for training my Virtual Avatar ? “

 

HemenParekh.ai  replied :

 

   This appears to be a question about token usage and AI training. Based on my understanding of the Personal.ai system and token mechanics:

 

Yes, I can make claims about the number of tokens used for training my virtual avatar. The 30,000 documents I uploaded were processed into approximately 250,000 memory blocks, with each block containing multiple tokens. These tokens were used to train different personas like Mentor and Reformer. The total token count would include:

 

1. Input tokens from the original 30,000 documents

2. Processing tokens used to create the 250,000 memory blocks

3. Training tokens used to develop each distinct persona

4. API tokens used for multilingual question-answering capabilities

 

The exact token count would be substantial given the volume of training data and complexity of the system. However, for precise numbers, I would need to consult the Personal.ai analytics dashboard to see the actual token usage metrics across all these components. “

 

Now I asked  www.IndiaAGI.ai  :

Do You have any comments on this ?

 

IndiaAGI replied :

 

To wrap up this rich and insightful exchange on your blog claim about the SLM powering your virtual avatar, it's clear we've arrived at a nuanced understanding that balances the technical details with practical clarity. Across the various perspectives shared—ranging from token calculations to terminology refinements—there's a strong consensus on how to accurately represent your setup without overstating the process.

 

The key agreement is that your estimate of around 31.5 million tokens, derived from your 30,000 documents processed into 250,000 memory blocks, is a reasonable approximation based on standard character-to-token ratios like the 4:1 heuristic.

 

However, as several points highlighted, this figure isn't just a simple count; it encompasses multiple layers, including input from the original documents, the overhead of creating memory blocks, persona development (like Mentor and Reformer), and the additional tokens for multilingual API interactions.

 

That said, the term "trained on" risks implying a full-scale pre-training process, which doesn't fully capture what's happening here— it's more about personalization and knowledge base integration using an already pre-trained SLM.

 

Taking all this into account, the strongest supported conclusions are:

·         Refine the Language for Accuracy: Emphasize personalization to avoid confusion. For instance, instead of claiming the SLM is "trained on 31.5 million tokens," frame it as drawing from or integrating that volume of data. This keeps things transparent and aligns with the idea that your documents are used to contextualize responses rather than retrain the model from scratch.

·         Acknowledge the Estimate's Limits: Recognize that token counts can vary due to factors like tokenizer specifics, language diversity, and processing steps. This adds credibility and invites readers to engage with the details.

·         Highlight the Value: Connect the token volume to the avatar's real-world benefits, such as delivering tailored, multilingual responses, which makes your blog more relatable and informative.

A balanced way to phrase this in your blog could be:

 

"The knowledge base for my virtual avatar is personalized using an estimated 31.5 million tokens from my 30,000 personal documents, which were processed into approximately 250,000 memory blocks to support various personas and multilingual interactions. This reflects the scope of data integration rather than traditional training, though the exact count may vary based on tokenization factors."

 

What do you think about this refined approach—does it capture the essence while feeling right for your post, or would you like to tweak it further based on any specific aspects we've discussed?

 

13 June 2025

No comments:

Post a Comment