GPT-5’s presentation showed off one of the most powerful AI models ever built, but some viewers weren’t impressed with some of the charts the company used in its demonstrations.
Some users have trolled OpenAI for “chart crimes” in its GPT-5 presentation. A graph which came under particular criticism compared the results of OpenAI’s models on the SWE-bench Verified benchmark. While GPT-5 had a score of 74.9 ‘with thinking’, and a score of 52.8 ‘without thinking’ on the benchmark, the graph didn’t quite represent them. The 52.9 figure was shown to be higher than the 69.1 number achieved by o3, possibly in a bid to make GPT-5’s results appear much better than they were. More bizarrely, the bar charts for GPT-4o and o3 were the same height, even though the values were markedly different at 69.1 and 30.8 respectively. This made GPT-5’s progress appear more dramatic over previous models, when in reality it had a much smaller jump compared to the previous ones.
“Im sorry but what is this graph,” said a user.
“Worst graph crime ive ever seen, what the fuck is this,” said another.
Stablity AI founder Emad Mostaque also called it a “chart crime”.
Another user attributed the mis-sized graphs to OpenAI’s marketing team.
And even rivals at Anthropic took note. “Before my current gig, I was (among other things) chief of the chart crime police for the reasoning team at OAI,” said Nat McAleese, who now works at Anthropic. “Moderately proud how many people have texted me today saying I would never have let that plot happen lol,” he said.
Now the AI community is obsessed with benchmarks, and GPT-5’s benchmarks were eagerly awaited for weeks. While OpenAI has given out the benchmark results, the way it’s presented them felt odd. It’s inconceivable that no one at OpenAI — which is full of academics and researchers — caught the oddly sized bar charts. If viewers weren’t paying attention, they’d have been misled into believing that GPT-5 was a bigger jump on previous models than it actually was. And in a world where progress is everything, misrepresenting these details often seems to elicit an immediate negative reaction from the AI community.