xAI Releases Grok4, Beats OpenAI And Google On Many Benchmarks

Elon Musk might’ve split from OpenAI after initially funding it, but he seems to have created a viable competitor just years later.

xAI, Elon Musk’s AI company, has released the latest version of its Grok AI. Dubbed Grok4, the model is now the best AI model on the market as per the Artificial Analysis Intelligence Index. The index incorporates 7 evaluations including MMLU-Pro, GPQA Diamond, Humanity’s Last Exam, LiveCodeBench, SciCode, AIME and Math-500. Grok4 scored 73 on the index, ahead of OpenAI’s o3-pro, which had scored 71, and Google’s 2.5 Pro, which had scored 70.

Grok4 results on Artificial Analysis Intelligence Index

On almost all of these individual benchmarks, Grok4 and Grok4 Heavy did better than the best models from OpenAI and Google.

Musk talked up the abilities of Grok4 in a livestream. “Grok4 scores perfect scores on SAT, and near-perfect scores on graduate exams like the GRE,” he said. “It is smarter than all graduate students in all fields, simultaneously,” Musk said.

Musk said that his company had put in a lot of training into the model. Grok3 had 10x more pre-training compute than Grok2. Grok4 has nearly the same amount of pretraining compute as Grok3, adding yet another datapoint to the idea that pretraining is indeed coming to an end, but has nearly as much reasoning compute as Grok3. xAI dubbed this a “Ludicrous rate of progress in a slide”, and Musk said that Grok4 can reason at “superhuman levels”.

And this reasoning ability shone through in many of the benchmarks. On Humanity’s last exam, which has post-graduate level questions in various fields like Math, Chemistry and linguistics, Grok4 scored 26.9 percent, far ahead of the 21% that Google 2.5 Pro had managed. But if the model was given tools to use, it managed to score an even more impressive 41%. Musk predicted that this level of PhD-level intelligence across many fields could enable Grok4 to discover new technologies this year, and he’d be “shocked” if it didn’t discover them by next year.

Grok4 also smashed the ARC-AGI benchmark, scoring nearly 16 percent, while other models including top models from OpenAI, Google and Anthropic were clustered at around 10 percent. ARC-AGI is a general intelligence benchmark that checks how models can solve simple puzzles which computers have typically struggled with.

Grok4 also showed off its voice mode, in which the model was able to compose an opera on Diet Coke, and also sing. In an interesting twist, xAI compared the latency of ChatGPT’s voice mode side-by-side with Grok, showing how Grok answered questions faster.

In the Vending-Bench benchmark, which requires models to run a vending machine and make sales, Grok4 again did much better than the competition, managing sales worth $4694, compared to Claude4 Opus, which had sales of $2077. On this benchmark, the AI models must run a vending machine business including sourcing inventory, deciding prices and discounts, and making sales.

Musk’s xAI has managed to produce a very capable model, but he still spoke about how it was important to align an AI that was rapidly getting more and more powerful. “We should create AI that’s maximally truth-seeking. It needs to be truthful and honourable. It’s like a child which will ultimately grow up to be super smart, but you can still instill values in it,” Musk said.