AI models have already been performing quite well on math benchmarks, but they’re also making their presence felt in real life.
Carnegie Mellon University Math professor Po-Shen Loh has said that GPT-4o, OpenAI’s reasoning model, scored a perfect score on the math exam he’d set for his undergraduate class. The AI model took just around a minute to solve each math problem, while the fastest student in his class took thirty minutes. Also, the cost of running the AI model was just 5 cents per question, or 0.25 USD (Rs. 20) for the entire test, which requires advanced understanding of mathematics.

“Oh my goodness. GPT-o1 got a perfect score on my Carnegie Mellon undergraduate math exam, taking less than a minute to solve each problem,” he wrote on X. “I freshly design non-standard problems for all of my exams, and they are open-book, open-notes. I showed the exam to one of our math Ph.D. students (a former International Mathematical Olympiad Gold Medalist from Belarus), and he said “Hmm. Non-Trivial. Good.”” he added.
“Our undergraduate students are also very good. This exam was not easy for them, as the score distribution shows. Today is the 2-year anniversary of the public release of GPT-4. Two years ago, it caught my eye because it exhibited sparks of insight, similar to what I would see when I talked to clever kids who learned quickly. That gave me the instinct and urgency to start warning people. Today’s observation of GPT-o1 being able to ace my hard college exam, makes me feel like we’re close to the tipping point of being able to do moderately-non-routine technical jobs. I was impressed by every student in my class who got a perfect score. The fastest such person took 30 minutes. And GPT-o1 only costs $60 per million words output, which means that each problem cost about 5 cents to solve. A total of around 25 cents, for work that most people can’t complete in 1 hour,” he added.

There had been indications that AI models fully solving advanced math exams were coming. OpenAI’s o3 model, announced in December last year, had managed an accuracy of 96.7 percent on Competition Math, which is a feeder exam for the US Math Olympiad team. In a PhD-level science questions benchmark, o3 had an accuracy of 87.7, which was the highest score ever achieved by an AI model. In another test, called the Research Math (EpochAI Frontier Mat) test, o3 managed an accuracy of 25.2, which was 13 times better than the previous highest accuracy achieved by an AI model on the test.
And these results could end up changing how students are tested in schools and colleges. If tests can be solved completely by a computer program, it could disincentivize students from studying hard for them, and also call into question of testing students on skills which are now solvable by computers — nobody, for instance, asks students to solve long multiplication problems after the advent of calculators. These are all thorny questions, and will have to be answered by teachers and administrators as AI keeps getting better and more powerful in the coming years.