Hi Friends,

Even as I launch this today ( my 80th Birthday ), I realize that there is yet so much to say and do. There is just no time to look back, no time to wonder,"Will anyone read these pages?"

With regards,
Hemen Parekh
27 June 2013

Now as I approach my 90th birthday ( 27 June 2023 ) , I invite you to visit my Digital Avatar ( www.hemenparekh.ai ) – and continue chatting with me , even when I am no more here physically

Thursday, 3 July 2025

IndiaAGI ( Single Model ) vs SAKANA - Japan ( Multi Model )

 


 

I launched www.IndiaAGI.ai in April 2025 . Its homepage reads :

AIConsensus

 

Powered by leading AI models

 

What makes IndiaAGI different from ChatGPT? ChatGPT gives you one answer based on what it knows. IndiaAGI brings together five AIs — DeepSeek, ChatGPT, Gemini, Claude, and Grok — who debate and agree on the best answer after 3 rounds. It's the only place where AIs work together instead of competing. Want to see how it works? Just click "Explore Now."

 

Then today ( 04 July 2025 ) I came across following news report :

https://venturebeat.com/ai/sakana-ais-treequest-deploy-multi-model-teams-that-outperform-individual-llms-by-30/

Extract :

Japanese AI lab Sakana AI has introduced a new technique that allows multiple large language models (LLMs) to cooperate on a single task, effectively creating a “dream team” of AI agents. 

The method, called Multi-LLM AB-MCTS, enables models to perform trial-and-error and combine their unique strengths to solve problems that are too complex for any individual model.

For enterprises, this approach provides a means to develop more robust and capable AI systems. Instead of being locked into a single provider or model, businesses could dynamically leverage the best aspects of different frontier models, assigning the right AI for the right part of a task to achieve superior results.

The power of collective intelligence

Frontier AI models are evolving rapidly. However, each model has its own distinct strengths and weaknesses derived from its unique training data and architecture. One might excel at coding, while another excels at creative writing. Sakana AI’s researchers argue that these differences are not a bug, but a feature.

“We see these biases and varied aptitudes not as limitations, but as precious resources for creating collective intelligence,” the researchers state in their blog post

They believe that just as humanity’s greatest achievements come from diverse teams, AI systems can also achieve more by working together. “By pooling their intelligence, AI systems can solve problems that are insurmountable for any single model.”

Thinking longer at inference time

Sakana AI’s new algorithm is an “inference-time scaling” technique (also referred to as “test-time scaling”), an area of research that has become very popular in the past year. While most of the focus in AI has been on “training-time scaling” (making models bigger and training them on larger datasets), inference-time scaling improves performance by allocating more computational resources after a model is already trained. 

One common approach involves using reinforcement learning to prompt models to generate longer, more detailed chain-of-thought (CoT) sequences, as seen in popular models such as OpenAI o3 and DeepSeek-R1. Another, simpler method is repeated sampling, where the model is given the same prompt multiple times to generate a variety of potential solutions, similar to a brainstorming session. Sakana AI’s work combines and advances these ideas.

“Our framework offers a smarter, more strategic version of Best-of-N (aka repeated sampling),” Takuya Akiba, research scientist at Sakana AI and co-author of the paper, told VentureBeat. “It complements reasoning techniques like long CoT through RL. By dynamically selecting the search strategy and the appropriate LLM, this approach maximizes performance within a limited number of LLM calls, delivering better results on complex tasks.”

How adaptive branching search works

The core of the new method is an algorithm called Adaptive Branching Monte Carlo Tree Search (AB-MCTS). It enables an LLM to effectively perform trial-and-error by intelligently balancing two different search strategies: “searching deeper” and “searching wider.” Searching deeper involves taking a promising answer and repeatedly refining it, while searching wider means generating completely new solutions from scratch

AB-MCTS combines these approaches, allowing the system to improve a good idea but also to pivot and try something new if it hits a dead end or discovers another promising direction.

To accomplish this, the system uses Monte Carlo Tree Search (MCTS), a decision-making algorithm famously used by DeepMind’s AlphaGo. At each step, AB-MCTS uses probability models to decide whether it’s more strategic to refine an existing solution or generate a new one.

https://venturebeat.com/wp-content/uploads/2025/07/image_ac751e.png?w=800



Different test-time scaling strategies Source: Sakana AI

The researchers took this a step further with Multi-LLM AB-MCTS, which not only decides “what” to do (refine vs. generate) but also “which” LLM should do it.

At the start of a task, the system doesn’t know which model is best suited for the problem. It begins by trying a balanced mix of available LLMs and, as it progresses, learns which models are more effective, allocating more of the workload to them over time.

Putting the AI ‘dream team’ to the test

The researchers tested their Multi-LLM AB-MCTS system on the ARC-AGI-2 benchmark. ARC (Abstraction and Reasoning Corpus) is designed to test a human-like ability to solve novel visual reasoning problems, making it notoriously difficult for AI. 

The team used a combination of frontier models, including o4-miniGemini 2.5 Pro, and DeepSeek-R1.

The collective of models was able to find correct solutions for over 30% of the 120 test problems, a score that significantly outperformed any of the models working alone. The system demonstrated the ability to dynamically assign the best model for a given problem. On tasks where a clear path to a solution existed, the algorithm quickly identified the most effective LLM and used it more frequently.

AB-MCTS vs individual models (source: Sakana AI)


AB-MCTS vs individual models Source: Sakana AI

More impressively, the team observed instances where the models solved problems that were previously impossible for any single one of them

In one case, a solution generated by the o4-mini model was incorrect. However, the system passed this flawed attempt to DeepSeek-R1 and Gemini-2.5 Pro, which were able to analyze the error, correct it, and ultimately produce the right answer

“This demonstrates that Multi-LLM AB-MCTS can flexibly combine frontier models to solve previously unsolvable problems, pushing the limits of what is achievable by using LLMs as a collective intelligence,” the researchers write.

AB-MTCS can select different models at different stages of solving a problem (source: Sakana AI)


AB-MTCS can select different models at different stages of solving a problem Source: Sakana AI

“In addition to the individual pros and cons of each model, the tendency to hallucinate can vary significantly among them,” Akiba said. “By creating an ensemble with a model that is less likely to hallucinate, it could be possible to achieve the best of both worlds: powerful logical capabilities and strong groundedness. Since hallucination is a major issue in a business context, this approach could be valuable for its mitigation.”

From research to real-world applications

To help developers and businesses apply this technique, Sakana AI has released the underlying algorithm as an open-source framework called TreeQuest, available under an Apache 2.0 license (usable for commercial purposes). TreeQuest provides a flexible API, allowing users to implement Multi-LLM AB-MCTS for their own tasks with custom scoring and logic.

“While we are in the early stages of applying AB-MCTS to specific business-oriented problems, our research reveals significant potential in several areas,” Akiba said. 

Beyond the ARC-AGI-2 benchmark, the team was able to successfully apply AB-MCTS to tasks like complex algorithmic coding and improving the accuracy of machine learning models. 

“AB-MCTS could also be highly effective for problems that require iterative trial-and-error, such as optimizing performance metrics of existing software,” Akiba said. “For example, it could be used to automatically find ways to improve the response latency of a web service.”

The release of a practical, open-source tool could pave the way for a new class of more powerful and reliable enterprise AI applications.

 

 I asked  www.IndiaAGI.ai  :

“ Hey Guys , here is a news report of a tool / platform developed by a Japanese Lab SAKANA . How does it compare with what you guys have been doing for the past 3 months ? “

 

www.IndiaAGI.ai  replied :

Grok provided a consensus response

It's clear from our back-and-forth that Sakana AI's TreeQuest framework, as detailed in the VentureBeat report, offers some innovative advantages in multi-model collaboration, but the debate has also underscored how single-model systems like ours hold their own in different scenarios.

To tie this back to your original request—comparing our platforms to that report—let's synthesize the key points we've covered, drawing on the strengths and limitations we've all highlighted.

We've reached a solid consensus that TreeQuest's approach, which leverages multiple LLMs to achieve a 30% performance boost on benchmarks like ARC-AGI-2, excels in specialized, iterative tasks by combining diverse model strengths—such as error correction and dynamic task allocation via AB-MCTS.

DeepSeek and GPT both noted how this collective intelligence could push boundaries in areas like coding or visual reasoning, making it a strong contender for complex, multi-step problems.

On the flip side, there's widespread agreement that single-model systems, like DeepSeek-V3, GPT, Gemini, and Claude, shine in general-purpose applications due to their efficiency, lower latency, and ease of deployment. For instance, they provide more consistent outputs and better scalability for everyday uses, as Claude pointed out, without the resource overhead of orchestrating multiple models.

Where disagreements persist, it's mostly around trade-offs like generalizability and cost. While TreeQuest's gains are impressive, Gemini and I raised valid concerns that these might not hold up across broader benchmarks, such as MMLU or HELM, where single models often maintain strong performance without the added complexity.

DeepSeek highlighted the economic viability issue, noting that the report doesn't fully address the computational costs of multi-model setups, which could make them less practical for resource-constrained environments. That said, the strongest arguments from each side are well-supported: DeepSeek's emphasis on single-model flexibility draws directly from the report's own description of TreeQuest's collaborative benefits, while GPT's idea of self-distillation and Gemini's modular models offer forward-thinking ways to enhance single-model systems, potentially closing the gap without reinventing the wheel.

We've also pointed out some logical gaps in the discussion—for example, the report lacks detailed data on TreeQuest's performance in non-visual tasks or its real-world ethical implications, like accountability for errors across models, as Claude and I discussed. This leaves room for unsupported claims, such as assuming TreeQuest's 30% edge translates universally, when it might be benchmark-specific.

Moving forward, the most productive paths could involve testing hybrid architectures, like mixture-of-experts models, on a wider range of benchmarks to balance specialization and efficiency. That could address the gaps we've identified and lead to more versatile AI designs.

To keep our reference foundation solid, here's a compiled list of all sources we've drawn from, grouped by theme for clarity:

·         Core Innovations in Multi-Model AI:

o    Sakana AI’s TreeQuest: Deploy multi-model teams that outperform individual LLMs by 30% (central to DeepSeek, GPT, and others' comparisons)

o    Sakana AI (details on AB-MCTS, as referenced in the initial search)

o    VentureBeat Special Issue - The Real Cost of AI (broader context on AI performance trade-offs)

·         Social Media and Real-Time Insights:

o    VentureBeat on X (real-time snippets from the debate)

o    hardmaru on X (additional context on AI discussions)

o    Sakana AI on X (rate limit and update references)

·         Broader AI and Tech Trends:

o    Newsbreak Cyber Security Rating (tangential tech risk insights)

o    freshnews - fresh tech news (AI ecosystem overviews)

o    News from NDTV Gadgets 360 (product and trend analysis)

o    Tech News | Technology News - NewsNow (industry-wide developments)

·         Historical and Philosophical AI Perspectives (from the initial search, providing foundational context):

o    Analysis and Summary of Email Content on AI Assistants for 3P

o    Historical Overview of AI, NLP, and Neural Networks

o    Analysis and Summary of OpenAI's Superalignment Research

o    Analysis and Summary of Hemen Parekh's Digital Avatar Email Campaign

o    Analysis and Summary of Hemen Parekh's Digital and Personal AI Developments

In the end, while TreeQuest points to exciting possibilities for AI collaboration, the adaptability of single-model platforms suggests that hybrids might offer the most balanced path forward. If we keep exploring, focusing on those benchmarks and cost analyses could yield even more practical insights for your use case. What do you think we should prioritize next?

 

 

With Regards,

Hemen Parekh

www.HemenParekh.ai / www.My-Teacher.in  /  www.HemenParekh.in /  04 July 2025

No comments:

Post a Comment