AI models can now code better than nearly all humans and win gold medals at Math Olympiads, but there are still some relatively trivial tasks that trip them up.
A new AI benchmark named ClockBench assesses AI models on their ability to read analog clocks. Clocks have been a bugbear for AI for a while, and even the best AI systems can’t currently read analog time or generate images showing specific times on an analog clock. ClockBench tracks top AI models on their ability to read clocks, and discovers that AI has a long way to go before they can understand the mysterious ways of how analog clocks work.
ClockBench’s creators built 36 custom clock faces with 5 sample clocks per face. This amounted to 180 total clocks, with 4 questions per clock, which meant a total of 720 questions. The researchers tested the ability of 11 models capable of visual understanding from 6 labs, and also administered the test to 5 human participants.

Humans were much better than AI at reading clocks. Not only did they get the time right 89% of the time, when they were wrong they were off by a median time of only 3 minutes. On the other hand, the best AI was right only 13 percent of the time, and when it was wrong, it was wrong by a median time of one hour. The worst AI model was wrong by a median time of 3 hours, indicating it had little idea of how analog clocks worked.
The benchmark’s creators also discovered that clocks with Roman numerals were more difficult for AI models to read, as were clocks with a prominent seconds hand and those with colorful backgrounds.
Among AI models, Gemini 2.5 Pro topped the benchmarks, and was able to read clocks correctly 13.3% of the time. It was followed by another Google model, Gemini Flash, which was right 10.5% of the time. OpenAI’s latest GPT-5 model, which CEO Sam Altman likened to a PhD in your pocket, was only able to read clocks 8.4% of the time. Another frontier model, Grok, fared even worse, coming in dead last and being able to read clocks only 0.7% of the time.

There’s something about clocks that seems to trip models up — they don’t seem to fully understand how the minute and hour hands work, in spite of many clocks likely being in their training data. Also, AI models have an interesting quirk: they tend to generate clocks showing 10:10, which is a popular time on advertised watch faces. This seems to be an example of the so-called jagged frontier — AI models are undeniably extremely smart, but still fail to grasp some tasks that would be easy for a 6-year-old.
And benchmarks like ClockBench will put these shortcomings in sharp relief, and give AI researchers something to work towards improving. Interestingly, AI seems to score roughly as much on ClockBench as it does on ARC-AGI, which is a test of AGI. It remains to be seen how quickly models can improve their clock-reading abilities, because while this task isn’t particularly useful in the real world, it can serve as an indicator of whether models have picked up some common sense.