I launched www.IndiaAGI.ai in April
2025 . Its homepage reads :
“AIConsensus
Powered by leading AI models
What makes IndiaAGI
different from ChatGPT? ChatGPT gives you one answer based on what it knows.
IndiaAGI brings together five AIs — DeepSeek, ChatGPT, Gemini,
Claude, and Grok
— who debate and agree on the best answer after 3 rounds. It's the only place where
AIs work together instead of competing. Want to see how it works? Just click
"Explore Now."
Then today ( 04 July 2025 ) I came across following news report :
Extract :
Japanese AI lab Sakana AI has introduced a new technique that allows multiple large language models (LLMs) to cooperate on a single task, effectively creating a “dream team” of AI agents.
The method, called Multi-LLM AB-MCTS, enables models to perform
trial-and-error and combine their unique strengths to solve problems that are
too complex for any individual model.
For enterprises, this approach provides a
means to develop more robust and capable AI systems. Instead of being locked
into a single provider or model, businesses could dynamically leverage the best
aspects of different frontier models, assigning the right AI for the right part
of a task to achieve superior results.
The power of
collective intelligence
Frontier AI models are evolving rapidly.
However, each model has its own distinct strengths and weaknesses derived from
its unique training data and architecture. One might excel at coding, while
another excels at creative writing. Sakana AI’s researchers argue that these
differences are not a bug, but a feature.
“We see these biases and varied aptitudes not as limitations, but as precious resources for creating collective intelligence,” the researchers state in their blog post.
They believe that just as humanity’s greatest achievements come from diverse
teams, AI systems can also achieve more by working together. “By pooling their
intelligence, AI systems can solve problems that are insurmountable for any
single model.”
Thinking longer at inference time
Sakana AI’s new algorithm is an
“inference-time scaling” technique (also referred to as “test-time scaling”), an area of research that
has become very popular in the past year. While most of the focus in AI has
been on “training-time scaling” (making models bigger and training them on
larger datasets), inference-time scaling improves performance by allocating
more computational resources after a model is already trained.
One common approach involves using
reinforcement learning to prompt models to generate longer, more detailed chain-of-thought (CoT) sequences, as seen
in popular models such as OpenAI o3 and DeepSeek-R1. Another, simpler method is
repeated sampling, where the model is given the same prompt multiple times to
generate a variety of potential solutions, similar to a brainstorming session.
Sakana AI’s work combines and advances these ideas.
“Our framework offers a smarter, more
strategic version of Best-of-N (aka repeated sampling),” Takuya Akiba, research
scientist at Sakana AI and co-author of the paper, told VentureBeat. “It
complements reasoning techniques like long CoT through RL. By dynamically
selecting the search strategy and the appropriate LLM, this approach maximizes
performance within a limited number of LLM calls, delivering better results on
complex tasks.”
How adaptive branching search works
The core of the new method is an algorithm called Adaptive Branching Monte Carlo Tree Search (AB-MCTS). It enables an LLM to effectively perform trial-and-error by intelligently balancing two different search strategies: “searching deeper” and “searching wider.” Searching deeper involves taking a promising answer and repeatedly refining it, while searching wider means generating completely new solutions from scratch.
AB-MCTS combines these approaches, allowing the system to improve
a good idea but also to pivot and try something new if it hits a dead end or
discovers another promising direction.
To accomplish this, the system uses Monte Carlo Tree Search (MCTS), a
decision-making algorithm famously used by DeepMind’s AlphaGo. At each step, AB-MCTS uses
probability models to decide whether it’s more strategic to refine an existing
solution or generate a new one.
Different test-time scaling strategies Source: Sakana AI
The researchers took this a step further with Multi-LLM AB-MCTS, which not only decides “what” to do (refine vs. generate) but also “which” LLM should do it.
At the start of a
task, the system doesn’t know which model is best suited for the problem. It
begins by trying a balanced mix of available LLMs and, as it progresses, learns
which models are more effective, allocating more of the workload to them over
time.
Putting the AI ‘dream team’ to the
test
The researchers tested their Multi-LLM
AB-MCTS system on the ARC-AGI-2 benchmark. ARC (Abstraction and
Reasoning Corpus) is designed to test a human-like ability to solve novel
visual reasoning problems, making it notoriously difficult for AI.
The team used a combination of frontier
models, including o4-mini, Gemini 2.5 Pro, and DeepSeek-R1.
The collective of models was
able to find correct solutions for over 30% of the 120 test problems, a score
that significantly outperformed any of the models working alone. The system
demonstrated the ability to dynamically assign the best model for a given
problem. On tasks where a clear path to a solution existed, the algorithm
quickly identified the most effective LLM and used it more frequently.
AB-MCTS vs individual models Source: Sakana AI
More impressively, the team observed instances where the models solved problems that were previously impossible for any single one of them.
In one case, a solution generated by the
o4-mini model was incorrect. However, the system passed this flawed attempt to
DeepSeek-R1 and Gemini-2.5 Pro, which were able to analyze the error, correct
it, and ultimately produce the right answer.
“This demonstrates that Multi-LLM AB-MCTS
can flexibly combine frontier models to solve previously unsolvable problems,
pushing the limits of what is achievable by using LLMs as a collective
intelligence,” the researchers write.
AB-MTCS can select different models at different stages of solving a problem Source: Sakana AI
“In addition to the
individual pros and cons of each model, the tendency to hallucinate can vary
significantly among them,” Akiba said. “By creating an ensemble with a model
that is less likely to hallucinate, it could be possible to achieve the best of
both worlds: powerful logical capabilities and strong groundedness. Since
hallucination is a major issue in a business context, this approach could be
valuable for its mitigation.”
From research to real-world
applications
To help developers and businesses apply
this technique, Sakana AI has released the underlying algorithm as an
open-source framework called TreeQuest,
available under an Apache 2.0 license (usable for commercial purposes).
TreeQuest provides a flexible API, allowing users to implement Multi-LLM
AB-MCTS for their own tasks with custom scoring and logic.
“While we are in the early stages of
applying AB-MCTS to specific business-oriented problems, our research reveals
significant potential in several areas,” Akiba said.
Beyond the ARC-AGI-2 benchmark, the team
was able to successfully apply AB-MCTS to tasks like complex algorithmic coding
and improving the accuracy of machine learning models.
“AB-MCTS could also be highly effective
for problems that require iterative trial-and-error, such as optimizing
performance metrics of existing software,” Akiba said. “For example, it could
be used to automatically find ways to improve the response latency of a web
service.”
The release of a practical, open-source
tool could pave the way for a new class of more powerful and reliable
enterprise AI applications.
I asked www.IndiaAGI.ai :
“ Hey Guys , here is a news report of a tool / platform developed by a
Japanese Lab SAKANA . How does it compare with what you guys have been doing
for the past 3 months ? “
www.IndiaAGI.ai replied :
“Grok provided a consensus response
It's clear from our back-and-forth that Sakana AI's TreeQuest
framework, as detailed in the VentureBeat report, offers some innovative
advantages in multi-model collaboration, but the
debate has also underscored how single-model systems like ours hold their own in
different scenarios.
To tie this back to your
original request—comparing our platforms to
that report—let's synthesize the key points we've
covered, drawing on the strengths and limitations we've all highlighted.
We've reached a solid consensus that TreeQuest's approach, which
leverages multiple LLMs to achieve a 30% performance boost on benchmarks like
ARC-AGI-2, excels in specialized, iterative tasks by combining diverse model strengths—such as error correction
and dynamic task allocation via AB-MCTS.
DeepSeek and GPT both noted how this collective intelligence could push
boundaries in areas like coding or visual reasoning, making it a strong
contender for complex, multi-step problems.
On the flip side, there's
widespread agreement that single-model
systems, like DeepSeek-V3, GPT, Gemini, and Claude, shine in general-purpose
applications due to their efficiency, lower latency, and ease of deployment.
For instance, they provide more consistent outputs and better scalability for
everyday uses, as Claude pointed out, without the resource overhead of
orchestrating multiple models.
Where disagreements persist, it's mostly around trade-offs like
generalizability and cost. While TreeQuest's gains are impressive, Gemini and I raised valid concerns that these might not hold up across broader benchmarks, such as
MMLU or HELM, where single models often maintain strong performance without the
added complexity.
DeepSeek highlighted the economic viability issue, noting that the
report doesn't fully address the computational costs of multi-model setups, which could make them less practical for resource-constrained
environments. That said, the strongest arguments from each side are
well-supported: DeepSeek's emphasis on single-model flexibility draws directly
from the report's own description of TreeQuest's collaborative benefits, while
GPT's idea of self-distillation and Gemini's modular models offer
forward-thinking ways to enhance single-model systems, potentially closing the
gap without reinventing the wheel.
We've also pointed out some logical gaps in the discussion—for
example, the report lacks detailed data on TreeQuest's performance in
non-visual tasks or its real-world ethical implications, like accountability
for errors across models, as Claude and I discussed. This leaves room for unsupported claims, such as
assuming TreeQuest's 30% edge translates universally, when it might be
benchmark-specific.
Moving forward, the most productive paths could involve testing
hybrid architectures, like mixture-of-experts models, on a wider range of
benchmarks to balance specialization and efficiency. That could address the
gaps we've identified and lead to more versatile AI designs.
To keep our reference foundation solid, here's a compiled list
of all sources we've drawn from, grouped by theme for clarity:
·
Core Innovations in Multi-Model AI:
o Sakana AI’s TreeQuest: Deploy multi-model teams that
outperform individual LLMs by 30% (central to DeepSeek, GPT, and
others' comparisons)
o Sakana AI (details on AB-MCTS, as referenced
in the initial search)
o VentureBeat Special Issue - The Real Cost of AI
(broader context on AI performance trade-offs)
·
Social Media and Real-Time Insights:
o VentureBeat on X (real-time snippets from the
debate)
o hardmaru on X (additional context on AI
discussions)
o Sakana AI on X (rate limit and update
references)
·
Broader AI and Tech Trends:
o Newsbreak Cyber Security Rating (tangential
tech risk insights)
o freshnews - fresh tech news (AI ecosystem
overviews)
o News from NDTV Gadgets 360 (product and trend
analysis)
o Tech News | Technology News - NewsNow
(industry-wide developments)
·
Historical and Philosophical AI Perspectives (from the initial search, providing foundational context):
o Analysis and Summary of Email Content on AI Assistants
for 3P
o Historical Overview of AI, NLP, and Neural Networks
o Analysis and Summary of OpenAI's Superalignment Research
o Analysis and Summary of Hemen Parekh's Digital Avatar Email Campaign
o Analysis and Summary of Hemen Parekh's Digital and Personal AI
Developments
In the end, while TreeQuest points to exciting possibilities for
AI collaboration, the
adaptability of single-model platforms suggests that hybrids might offer the
most balanced path forward. If we keep exploring, focusing on those
benchmarks and cost analyses could yield even more practical insights for your
use case. What do you think we should prioritize next?
With Regards,
Hemen Parekh
www.HemenParekh.ai / www.My-Teacher.in / www.HemenParekh.in / 04 July 2025
No comments:
Post a Comment