Meta is well and truly back in the AI game.
Muse Spark, the first frontier model out of Meta Superintelligence Labs, scores 52 on the Artificial Analysis Intelligence Index v4.0 — placing it fourth among all models benchmarked, behind only Gemini 3.1 Pro Preview (57), GPT-5.4 (57), and Claude Opus 4.6 (53). Artificial Analysis was given early access by Meta to independently benchmark the model.

For context, Llama 4 Maverick scored just 18 on the same index at launch — as a non-reasoning model. Muse Spark essentially closes the gap to the frontier in a single release.
The model is proprietary — notably, Meta’s first frontier release that is not open weights. There is no public API at the time of writing, though Meta has indicated one is coming. Muse Spark is already integrated into Meta AI, Facebook, Instagram, and Threads.
Token Efficiency: A Standout
One of the more striking findings from Artificial Analysis is how efficiently Muse Spark reaches its intelligence score. It used just 58M output tokens to complete the full Intelligence Index — comparable to Gemini 3.1 Pro Preview (57M), and significantly lower than Claude Opus 4.6 (157M), GPT-5.4 (120M), and GLM-5.1 (110M). On the intelligence-vs-tokens scatter plot, Muse Spark sits squarely in the attractive quadrant: high capability, low token usage.

For API cost and latency at scale, this matters. A model that reasons well without burning through tokens is a more practical deployment than one that requires extended thinking chains to hit the same scores.
Reasoning and Vision
Muse Spark performs strongly on reasoning benchmarks. On Humanity’s Last Exam — one of the hardest publicly available multidisciplinary evaluations — it scores 39.9%, trailing only Gemini 3.1 Pro Preview (44.7%) and GPT-5.4 (41.6%). On GPQA Diamond (scientific reasoning), it ranks near the top of the field. On CritPt, a physics research benchmark, it scores 11% — fifth highest overall, and well above Claude Sonnet 4.6 (3%) and Gemini 3 Flash (9%).
On vision, Muse Spark scores 80.5% on MMMU-Pro, making it the second-most capable multimodal model benchmarked — behind only Gemini 3.1 Pro Preview (82.4%). For a company whose products are built around images, video, and visual content, strong vision performance is as much a product requirement as a benchmark win.
On instruction-following (IFBench), Muse Spark sits in the upper tier. On AA-LCR (long-context reasoning), it is competitive with the top models. On AA-Omniscience, it holds its own on knowledge accuracy.
Where It Falls Slightly Short
Agentic performance is the gap. On GDPval-AA — Artificial Analysis’s evaluation of real-world work tasks — Muse Spark scores 1,427 ELO, behind Claude Sonnet 4.6 (1,648) and GPT-5.4 (1,676), though ahead of Gemini 3.1 Pro Preview (1,320). On Terminal-Bench Hard, it trails Claude Sonnet 4.6, GPT-5.4, and Gemini 3.1 Pro. These are the benchmarks that matter most for enterprise agentic use cases — autonomous coding, multi-step task completion, tool use in production environments.
Meta has acknowledged this directly, flagging long-horizon agentic systems and coding workflows as areas of continued investment.
The Bigger Picture
Muse Spark’s debut follows Meta’s reorganisation of its AI efforts under Meta Superintelligence Labs, led by Alexandr Wang. It is the first serious signal that the restructuring has produced something competitive. The frontier is crowded right now — Anthropic revealed Claude Mythos Preview numbers today, and Google’s Gemini 3.1 Pro Preview still tops this index — but fourth place, on an independently run benchmark, with strong token efficiency, is a credible result.
The open question is what comes next. Meta has said larger models are in development, with the Hyperion data center scaling to match. If the gap between Llama 4 Maverick (18) and Muse Spark (52) is any indication of the trajectory, the next release will be worth watching closely.