GPT 5.2 has created a new high score in AI benchmarks including ARC-AGI 2, but it still seems to trip up on problems that have plagued LLMs since they were first invented.
Vibe coding startup Lovable’s CEO Anton Osika has highlighted how GPT 5.2 can’t count the number of ‘r’s in a word like garlic. GPT 5.2, OpenAI’s flagship model, first said that there were two ‘r’s in garlic. After Oskia asked the model if it was braindead, GPT 5.2 accepted that it was wrong, but this time said that there were 0 ‘r’s in garlic. Garlic, of course, contains one ‘r’.
LLMs likely find it hard to calculate the number of letters in a word because of how they’re tokenized in AI systems. LLMs divide words into tokens, not single letters, which makes it hard for them to break up words into their constituent elements and count them. Also, models often perform better when they’re thinking for a while, and GPT 5.2’s answer that Anton Osika shared was from its Instant — and least capable — version. Interestingly, Anton Osika has been bearish on OpenAI, saying in August this year that if he had to, he’d invest in Grok and short OpenAI.
But GPT 5.2’s impressive performance on benchmarks, along with its inability to answer simple questions, could be the result of “benchmark-maxing” that former OpenAI Chief Scientist Ilya Sutskever had recently highlighted. Sutskever had said that extreme RL — in order to perform well on benchmarks — made models less general. He’d also implied that companies were being incentivized to improve the performance of models on evaluations, and they were focusing on them instead of making models better for real-world cases. It’s unclear what exactly is happening at top labs behind the scenes, but as long as models keep tripping up on trivial tasks, their broad-scale deployment in real world situations might remain limited.