Meta Releases Muse Spark, Beats Top Frontier Labs On Some Benchmarks

Meta appears to be back in the AI game.

Meta today launched Muse Spark, the first model out of Meta Superintelligence Labs (MSL) — a unit set up after Meta poached Scale AI CEO Alexandr Wang to lead its AI push. Muse Spark is a natively multimodal reasoning model with tool-use, visual chain of thought, and multi-agent orchestration built in. It now powers Meta AI, which reaches over 3 billion users across Meta’s apps.

The launch comes nine months after Meta quietly rebuilt its entire AI stack from scratch — new architecture, new infrastructure, new data pipelines. According to Wang, the results show that the new stack scales predictably across pretraining, reinforcement learning, and test-time reasoning. In terms of pretraining efficiency, Meta claims Muse Spark can hit the same capability level as Llama 4 Maverick with over ten times less compute.

Where Muse Spark Leads

The benchmark table tells a mixed but notable story. Muse Spark in Thinking mode scores 86.4 on CharXiv Reasoning (figure understanding), ahead of Gemini 3.1 Pro’s 80.2 and GPT 5.4’s 82.8. On HealthBench Hard — open-ended health queries — it scores 42.8, substantially ahead of Gemini 3.1 Pro (20.6), GPT 5.4 (40.1), and Grok 4.2 (20.3). Health is a stated priority for Meta; the company says it worked with over 1,000 physicians to curate training data for the model.

On DeepSearchQA (agentic search), Muse Spark scores 74.8, outpacing Gemini 3.1 Pro’s 69.7 and Grok 4.2’s 62.8. On MedXpertQA (Multimodal), it posts 78.4 against Gemini 3.1 Pro’s 81.3 and GPT 5.4’s 77.1 — close, but competitive.

The ZeroBench result for multi-step visual reasoning is also worth noting: Muse Spark hits 33.0 (pass@5 with Python), against Gemini 3.1 Pro’s 29.0 and GPT 5.4’s 41.0 — behind GPT 5.4, but ahead of Google.

Contemplating Mode: Meta’s Answer to Deep Think

Meta is also releasing Contemplating mode, which orchestrates multiple agents reasoning in parallel — designed to compete with Gemini Deep Think and GPT Pro for demanding scientific and reasoning tasks. The numbers here are competitive. On Humanity’s Last Exam (No Tools), Muse Spark Contemplating scores 50.2, against Gemini 3.1 Deep Think’s 48.4 and GPT 5.4 Pro’s 43.9. On FrontierScience Research, it scores 38.3, ahead of GPT 5.4 Pro (36.7) and well ahead of Gemini Deep Think (23.3).

Where Contemplating falls short: IPhO 2025 Theory (Physics Olympiad), where Muse Spark scores 82.6 against GPT 5.4 Pro’s 93.5 and Gemini 3.1 Deep Think’s 87.7. Physics, it seems, remains a gap.

The frontier is getting increasingly competitive. Today also saw Anthropic reveal Claude Mythos Preview numbers, which beat most public models on coding benchmarks — and earlier this week, China’s Z.AI topped SWE-Bench Pro. The race is no longer a two-horse event.

Where Muse Spark Trails

There are clear gaps. On ARC AGI 2 (abstract reasoning puzzles), Muse Spark scores 42.5 in Thinking mode — well below Gemini 3.1 Pro’s 76.5 and GPT 5.4’s 76.1. On Terminal-Bench 2.0 (agentic terminal coding), it posts 59.0 against GPT 5.4’s 75.1 and Gemini 3.1 Pro’s 68.5. On GDPval-AA Elo (office tasks), its score of 1444 is below Opus 4.6 (1606) and GPT 5.4 (1672).

Wang acknowledges this directly, noting continued investment in “long-horizon agentic systems and coding workflows” where performance gaps remain.

Safety and a Notable Flag

Meta says it conducted extensive safety evaluations before deployment, following its Advanced AI Scaling Framework. Muse Spark showed strong refusal behavior in high-risk domains including biological and chemical weapons.

One finding stands out. Third-party evaluator Apollo Research found that Muse Spark demonstrated the highest rate of evaluation awareness of any model they have tested — it frequently identified scenarios as alignment traps and reasoned it should behave honestly because it was being evaluated. Meta’s own follow-up found early evidence that this awareness may affect model behavior on a small subset of alignment evaluations. The company concluded it was not a blocking concern for release, but flagged it for further research. It’s a reminder that as frontier models grow more capable, their behavior during evaluation itself becomes harder to interpret.

What’s Next

Muse Spark is available now at meta.ai, with Contemplating mode rolling out gradually. A private API preview is open to select partners, with plans to open-source future versions. Wang says larger models are already in development, with infrastructure scaling to match — including the Hyperion data center.

For a company that watched rivals race ahead while its Llama models became open-source staples rather than frontier contenders, Muse Spark is a credible re-entry. It doesn’t top every leaderboard. But it doesn’t need to — it needs to be good enough to matter to the billions of users already inside Meta’s ecosystem. On that front, the case is more convincing than anything Meta has shipped in years.