Sakana AI Launches Sakana Fugu That Matches Fable And Mythos On Some Benchmarks By Coordinating And Orchestrating Multiple Models

While frontier labs are competing on building the best AI models, smaller startups are looking to match their performance through innovative approaches.

Sakana AI — the Tokyo-based lab valued at $2.65 billion after its November 2025 Series B — has launched Sakana Fugu, a system that aims to match frontier-level AI performance through a fundamentally different approach: rather than training a single powerful model, it coordinates and orchestrates a pool of existing models to tackle complex tasks.

The tagline is “One Model to Command Them All,” and the core idea is that a well-orchestrated team of models can outperform any individual one. Instead of assigning fixed roles or workflows to specific models upfront, Fugu learns to dynamically assemble agents from a pool and route work between them in patterns that aren’t obvious but are, according to Sakana, highly efficient. The result is delivered through a single OpenAI-compatible API, so users aren’t managing multiple model integrations.

Sakana Fugu Benchmarks

Sakana’s benchmark results are hard to dismiss. Across a suite of coding, reasoning, scientific, and agentic evaluations, Fugu and Fugu Ultra land near the top of the field. On SWE-Bench Pro — a demanding software engineering benchmark — Fugu Ultra scores 73.7, ahead of Claude Opus 4.8’s 69.2 and GPT-5.5’s 58.6. On LiveCodeBench, Fugu scores 92.9 and Fugu Ultra 93.2, both ahead of Gemini 3.1 Pro’s 88.5. On Humanity’s Last Exam, one of the hardest general-knowledge benchmarks available, Fugu Ultra reaches 50.0, essentially matching Opus 4.8’s 49.8.

Sakana is careful to note that neither Claude Fable 5 nor Claude Mythos Preview — Anthropic’s highest-tier models — are in Fugu’s agent pool, as neither is publicly accessible. The company says Fugu sits shoulder-to-shoulder with those models on several benchmarks, which would place it at or near the very top of what’s currently available.

The qualitative results are interesting too. In an AutoResearch experiment where an AI agent autonomously ran 123 training experiments over 14 hours on a single H100 GPU to improve a small language model’s training recipe, Fugu Ultra achieved the best mean validation score across all seeds, ahead of three frontier model baselines. An industry researcher using the system for patent landscape analysis across roughly 20 papers and several patents reported completing in a few hours a task that would normally take three to four days.

The Architecture Behind It

Fugu is grounded in two papers accepted at ICLR 2026. The first, TRINITY, uses a lightweight evolved coordinator that assigns models to Thinker, Worker, or Verifier roles across multiple turns, adapting dynamically to the task. The second, the Conductor, uses reinforcement learning to discover natural-language coordination strategies — essentially training the system to figure out how to prompt and route agents for maximum performance, rather than having engineers design those workflows by hand.

This research-first approach is consistent with Sakana’s broader identity. The company, co-founded by David Ha (formerly of Google Brain and Stability AI) and Llion Jones (co-author of the seminal “Attention Is All You Need” paper), has consistently pursued alternatives to brute-force compute scaling. Earlier this year, its AI Scientist system became the first AI to have a fully generated paper pass peer review at a machine learning conference — a milestone that landed in Nature in March 2026. Fugu represents a similar philosophy applied to model deployment: extract more from what already exists rather than building from scratch.

Sakana Fugu Pricing: Two Tiers, One API

Fugu comes in two versions. The standard Fugu model is positioned for everyday work — coding, code review, responsive chatbot services — and balances performance with low latency. Fugu Ultra is the heavier-duty option, designed for long-horizon tasks like Kaggle competitions, paper reproduction, cybersecurity assessments, and patent and literature investigations. Early users have described running Fugu Ultra through full security assessments end-to-end, including reconnaissance, vulnerability checks, and report generation, from a single instruction.

Pricing for Fugu follows an interesting structure: when only one agent is active, users pay the standard rate for that underlying model. When multiple agents coordinate, Sakana charges a single rate based on the top-tier model involved rather than stacking fees. Fugu Ultra has fixed pricing at $5 per million input tokens and $30 per million output tokens, doubling for contexts above 272K tokens.

A subscription plan is also available, with tiers at $20, $100, and $200 per month depending on usage volume. Anyone who subscribes before the end of July 2026 gets a free second month at their initial tier. The API is currently unavailable in the EU and EEA while Sakana works toward GDPR compliance.

Vendor Flexibility as a Feature

One aspect of Fugu that enterprise buyers will likely find attractive is the ability to control which models participate in the pool. Teams with data residency requirements, compliance constraints, or specific vendor preferences can exclude certain providers or models entirely. As the frontier model competition among Anthropic, OpenAI, and Google continues to accelerate — with the top models separated by only a few benchmark points — the ability to access frontier-level performance without locking into a single vendor starts to look genuinely compelling.

Sakana’s pitch is that collective intelligence, properly orchestrated, can outperform any single model — and that a startup in Tokyo, working with different constraints and different assumptions than Silicon Valley’s compute-heavy labs, might be well-placed to build it.