xAI released Grok4 today, and beat OpenAI and Google on many benchmarks. But while the model itself is very capable, there’s also a bigger version of the model which does even better — Grok4 Heavy.
Grok4 Heavy uses several agents in parallel, and they discuss among themselves as to what the best solution is. “With Grok4 Heavy, what it does is it forms multiple agents in parallel, and all these agents work interpedently,” Musk said on the livestream. “And then they compare their work and they decide which one (has the correct solution), like a study group,” he added.

“It’s not as simple as a majority vote because often only one of the agents actually figures out the trick or figures out the solution,” Musk explained. “But once they share the trick or figure out what the real nature of the problem is, they share that solution with the other agents. And then they essentially compare notes and then yield an answer. So that’s the heavy part of Grok4, where we scale up the test time compute by roughly an order of magnitude, have multiple agents tackle the task, and then they compare their work. And they, they put forward what they think is the best result,” he said.
Grok4 Heavy did deliver solid results. On most benchmarks, Grok4 Heavy not only beat Grok4, but also top models from OpenAI and Google.
And using multiple agents running in parallel is an interesting new approach. Given how most people seem to agree that the era of pre-training is coming to an end, companies are coming up with new approaches to improve the performance of their models. Thus far, AI labs had made models think for longer in order to get better answers. xAI though seems to be creating many agents in parallel, getting them all to think, and them compare notes and give out the right answer. This approach will likely use more computational resources, but given how the cost of compute could eventually converge to the cost of electricity, it might be a viable approach to solving difficult problems.