Grok 4.3 Scores 53 On Artificial Analysis Intelligence Index, Ahead Of Muse Spark, Claude Sonnet 4.6

xAI has dealt with a lot of churn in recent months, with all its 11 co-founders leaving the company, but it has come out with a reasonably capable new model.

xAI has launched Grok 4.3, which scores 53 on the Artificial Analysis Intelligence Index v4.0 — a composite of 10 evaluations including GDPval-AA, τ²-Bench Telecom, Terminal-Bench Hard, SciCode, AA-LCR, AA-Omniscience, IFBench, Humanity’s Last Exam, GPQA Diamond, and CritPt. That puts Grok 4.3 just ahead of Muse Spark and Claude Sonnet 4.6, and 4 points clear of the previous Grok 4.20 0309 v2. The release also comes with meaningful price cuts: input token prices are down ~37.5% and output token prices ~58.3% compared to Grok 4.20, making it one of the more cost-efficient models at its intelligence tier.

Grok 4.3 on Artificial Analysis Intelligence Index

Intelligence Index Performance

Grok 4.3 improves on Grok 4.20 0309 v2 both in score and in cost-to-run. It costs $395 to run the full Artificial Analysis Intelligence Index benchmark suite — roughly 20% less than Grok 4.20 0309 v2 — placing it comfortably on the Pareto frontier for intelligence vs. cost. Despite using ~44% more output tokens than Grok 4.20 0309 v2, its verbosity remains in line with models like MiniMax-M2.7 and well below the most verbose frontier models.

Grok 4.3 Benchmark Highlights

GDPval-AA (Agentic Tasks): The standout improvement. Grok 4.3 scores an ELO of 1500 on GDPval-AA, up 321 points from Grok 4.20 0309 v2’s 1179, surpassing Gemini 3.1 Pro Preview, Muse Spark, GPT-5.4 mini (xhigh), and Kimi K2.5. That said, it still trails GPT-5.5 (xhigh) by 276 Elo points — an expected win rate of roughly 17% under the standard Elo formula. The gap to the leader has narrowed, but it remains substantial.

τ²-Bench Telecom (Instruction Following & Agentic Customer Support): Grok 4.3 gains 5 points to reach 98%, putting it in line with GLM-5.1. This is a strong result and reflects well on the model’s practical utility in structured, real-world agentic workflows.

IFBench: Maintains an 81% score from Grok 4.20 0309 v2 — no regression on general instruction following.

AA-Omniscience Accuracy: Up 8 points from the previous version, showing improved factual performance across domains.

AA-Omniscience Non-Hallucination Rate: Down 8 points, meaning Grok 4.20 0309 v2 still leads on this metric, followed by MiMo-V2.5-Pro and then Grok 4.3. The accuracy gain has come at a cost to reliability — a tradeoff worth watching in production deployments.

GPQA Diamond, SciCode, Humanity’s Last Exam: The charts in the Artificial Analysis release show Grok 4.3 performing competitively across these graduate-level reasoning and scientific benchmarks, though GPT-5.5 (xhigh) holds the top spot overall on the composite index with a score of 60.

Context: Where xAI Stands

Grok 4 — released in July 2025 — briefly topped most benchmarks and marked a high point for the lab. Grok 4.3 is not a leap of that magnitude. It’s an incremental update: more cost-efficient, better at agentic tasks, and marginally stronger on the composite index. For developers and enterprises evaluating models on a cost-per-intelligence basis, though, those are exactly the improvements that matter.

The co-founder departures — nearly all of the original twelve have now left — have raised questions about institutional continuity at xAI. Elon Musk has framed the restructuring as a deliberate rebuild. Grok 4.3 doesn’t resolve those questions, but it does suggest the model development pipeline remains intact.

The broader competitive picture is unchanged: GPT-5.5 (xhigh) leads the Artificial Analysis Intelligence Index at 60, with Claude Opus 4.7 and Gemini 3.1 Pro Preview also scoring above Grok 4.3. But at $395 to run the full benchmark suite — compared to thousands of dollars for the top-ranked models — Grok 4.3 makes a credible value case for workloads that don’t require frontier-level performance.