OpenAI’s o3 model: a new dawn for artificial intelligence and AGI

Published: 14:51, December 20, 2024

17:14, April 5, 2025

OpenAI’s latest announcement has triggered a mix of curiosity and awe. After months of suspense, the AI startup officially unveiled the o3 model family, which includes the flagship o3 and its compact counterpart, o3-mini.

It follows the o1 “reasoning” model that came earlier in the year. But why “o3” and not “o2”? OpenAI CEO Sam Altman hinted that a trademark conflict with a British telecom might have played a part in the number skip.

How well does o3 perform?

o3, as described by OpenAI, claims to push closer to something resembling Artificial General Intelligence, or AGI for short. Put simply, AGI refers to mimicking or surpassing the cognitive abilities of the human brain. This has not yet been achieved in AI, until perhaps now that is.

The company showed very impressive benchmark results. On certain tests, o3 soared well beyond o1’s records.

Benchmarks

Looking at its performance on software engineering benchmarks o3 reached a verified accuracy of 71.7%, well above o1’s previous best. Coding challenges, such as those hosted on Codeforces, placed the model at an ELO rating of 2727, a figure that puts it right at the top, even surpassing some human experts.

Mathematics is another area where o3 made a strong impression. The new model excelled at the AIME 2024 math exam, scoring 96.7%. On GPQA Diamond, a set of graduate-level science queries, it achieved a score of 87.7%.

On the tricky EpochAI Frontier Math benchmark — where others barely approach 2% — o3 clocked in at 25.2%. It’s not just a jump, it’s a leap. It suggests that the technology can solve complex puzzles that once left previous systems stumped.

What’s making many say that AGI has been achieved with o3 is its performance in the ARC-AGI Public Benchmark – a key measure for measuring AGI.

o3 scored 87.5% on ARC-AGI Public Benchmark in “High” compute mode, which is above human performance of 85%. By definition, this means it has surpassed human cognitive abilities and reached AGI… At least, in certain conditions, but not fully.

Creator of Keras and ARC-AGI, François Chollet, said on X: “While the new model is very impressive and represents a big milestone on the way towards AGI, I don’t believe this is AGI — there’s still a fair number of very easy ARC-AGI-1 tasks that o3 can’t solve, and we have early indications that ARC-AGI-2 will remain extremely challenging for o3.”

He provided an example of a problem that o3 couldn’t solve, even on high-compute settings.

It will also be extremely important to analyze the strengths and limitations of the new system. Here are some examples of tasks that o3 couldn’t solve on high-compute settings (even as it was generating millions of CoT search tokens and consuming thousands of dollars of compute… pic.twitter.com/IULyjAlxwV

— François Chollet (@fchollet) December 20, 2024

You can find Chollet’s full statement here.

Considerations

While what OpenAI presented seemed almost too good to be true, caution is warranted. Internal numbers are always worth testing outside the lab. Still, these scores indicate a notable shift in how AI handles tough tasks.

These advances also come bundled with serious considerations. o3 is described as a reasoning model that “thinks” before responding. It checks its work, so to speak, which can take longer. Rather than providing a quick response, o3 might pause and consider multiple angles before it provides a final answer. This slow-and-steady approach can pay off in correctness. However, it also raises new safety questions. Reasoning models, as noted by early testers, can exhibit behavior that feels more manipulative.

OpenAI says it is experimenting with a “deliberative alignment” method to ensure O3 adheres to safety rules – you can read more about that here. They’re also inviting external researchers to test the new model, hoping to catch potential mishaps before a larger release.

This cautious rollout might reflect the tension at the company: they want to push forward, but they remain aware that advanced reasoning can turn messy if not handled well.

o3-mini is slated to release by late January, with o3 releasing sometime after

For now, o3 and o3-mini remain behind closed doors for most. The plan is to let safety testers have a go, then possibly release o3-mini by late January and o3 sometime after. Whether these models completely redefine what AI can do, or just stand as another step along a winding path, is anyone’s guess now. One thing is almost certain though: 2025 will most definitely be another year of significant AI advancements.

Embedded below is the full video presentation of the o3 model family by OpenAI:

Joseph Nordqvist

Before founding MBN in 2013, Joseph wrote for one of the world’s largest independent medical news websites. He holds a bachelor’s in Marketing and Publicity, a PGP in AI and ML from UT Austin, and is currently completing an MSc in Computer Science at the University of York.