OpenAI Announces o3 Model That Smashes AI Benchmarks: Is AGI Here?

There had been many predictions on when humanity will reach AGI or Artificial General Intelligence, but OpenAI’s latest release is leading some to speculate that it’s already here.

OpenAI has announced its latest o3 model. While the model isn’t yet available for the public — it is being tested by researchers for safety — it’s blown AI benchmarks out of the water.

In the SWE Bench benchmark, which consists of real-world software problems, the o3 model had an accuracy of 71.7. This was nearly 20 percentage points higher than OpenAI’s previous-best model, o1, which had an accuracy of 48.9. On Competition Code, which tests models on their real-world coding ability, o3 scored an ELO of 2727, again significantly higher than the 1891 that had been scored by OpenAI’s previous o1 model. For context, this meant that o3 is currently as good as the 175th ranked competitive human coder in the world. These are the highest-ever numbers ever achieved by an AI model.

On Mathematics benchmarks, o3 was again significantly better than previous iterations of OpenAI’s models. On Competition Math, which is a feeder exam for the US Math Olympiad, o3 had an accuracy of 96.7, which was significantly higher than the accuracy of 83.3 that o1 had managed. In a PhD-level science questions benchmark, o1 had an accuracy of 87.7, which was again higher than the accuracy of 78 that the o1 model had achieved. The o3 scores are again the highest scores ever achieved by an AI model.

AI models are now getting so good that they’re saturating traditional benchmarks — they’re getting close to perfect scores in these tests. So researchers have made newer tests which are harder for AIs to solve. In one such test, called the Research Math (EpochAI Frontier Mat) test, o3 managed an accuracy of 25.2. This was nearly 13 times better than the previous highest accuracy achieved by an AI model on the test, which was just 2. The test consists of novel, unpublished, and extremely hard math questions, which would typically take the best human professional mathematicians hours or days to solve.

But perhaps most interestingly, o3 also smashed the ARC-AGI benchmark. ARC is a non-profit that seeks to make benchmarks to be able to test and guide the development of AI. In 2019, it had announced a $1 million prize for an AI which could beat the benchmark. OpenAI’s o3 model managed a new high score on this test as well, which shows patterns that computers must learn from, and then predict outputs for new test inputs. In the last 5 years, the best AI models were able to go from an accuracy of 0 percent to 5 percent on this test. o3’s high-computer version scored a stunning 87 percent on this test, basically blowing the competition out of the water. Humans score around 85 percent on this benchmark, which led many to speculate that AGI was finally here. “This is new territory in ARC-AGI world. I need to fix my worldview and change my intuitions about what AI is capable of,” a representative from the ARC foundation said.

These stunning results caused many X users to speculate that AGI was already here. “ok well o3 is AGI hope everyone had fun,” wrote Jacob Andreou, a VC at Greylock.

“Independent evaluations of OpenAI’s o3 suggest that it passed benchmarks that were previously considered far out of reach for AI including achieving a score on ARC-AGI that was associated with actually achieving AGI (though the creators of the benchmark don’t think it o3 is AGI),” wrote AI academic Ethan Mollick.

“o3: 87.5%. Humans: 85%. AGI confirmed,” wrote Postmates VP Ben South.

While AGI doesn’t have a specific definition — it’s usually categorized as Artificial Intelligence that’s better than humans at most tasks, but the exact meaning can vary — the latest results do seem that a major breakthrough has been made. Not only is the o3 model significantly better than previous AI models, it’s also better than humans on a prominent AGI benchmark. The model is yet to be tested by the public at large, but OpenAI — and humanity — might’ve crossed a major milestone today.