Poetiq Goes Past Gemini 3 Pro On Arc-AGI At Half The Cost With Third-Party LLMs

It turns out that having your own model isn’t necessarily required to reach the top of AGI benchmarks — cleverly using other models can also get one there.

A six-person startup called Poetiq has achieved state-of-the-art results on the ARC-AGI-2 benchmark without training its own frontier AI model. Instead, the company’s meta-system orchestrates existing models from Google, OpenAI, Anthropic, and xAI to solve complex reasoning tasks more accurately and at lower cost than the underlying models themselves.

On December 5, Poetiq announced that its system achieved 54% accuracy on the ARC-AGI-2 Semi-Private Evaluation Set at a cost of $30.57 per problem. This surpasses Google’s Gemini 3 Deep Think, which scored 45.1% at $77.16 per problem. The result represents the first system to break through the 50% accuracy barrier on this benchmark, which tests abstract reasoning capabilities considered crucial for artificial general intelligence.

the ARC-AGI prize challenges AI systems with visual reasoning puzzles that require understanding core concepts like objects, counting, and basic physics. These tasks are deliberately designed to be easily solvable by humans but extremely difficult for current AI systems, making them a rigorous test of genuine intelligence rather than sophisticated pattern matching.

A Meta-System Approach

Rather than developing proprietary models, Poetiq built what it calls a “meta-system” that automatically discovers optimal strategies for extracting knowledge from existing large language models. The company’s approach treats the prompt as an interface rather than the intelligence itself, using an iterative problem-solving loop where models generate solutions, receive feedback, analyze results, and refine their approaches.

The system demonstrated remarkable flexibility by producing strong results across more than a dozen different models within hours of Gemini 3’s release on November 18, 2025. Notably, Poetiq’s adaptation work was completed before both Gemini 3 and GPT-5.1 became available, using only open-source models for development. The resulting systems then transferred successfully to newer, larger models across different families.

Establishing New Efficiency Frontiers

Poetiq’s results establish what the company calls “entirely new Pareto frontiers” on both ARC-AGI-1 and ARC-AGI-2 benchmarks, meaning they achieve better accuracy-cost tradeoffs than any previous system. The company demonstrated this across multiple configurations. The system makes this possible through efficiency optimizations, typically using fewer than two model requests on average and completing tasks in a single attempt rather than the two attempts ARC-AGI permits.

Improving Every Model It Touches

In tests across 12 different models from various families, Poetiq’s meta-system consistently improved both accuracy and cost compared to using the models directly. The company applied its technique to popular recent models from Google DeepMind, OpenAI, Anthropic, and xAI, demonstrating what it calls “substantial transference and generalization” across model versions, families, and sizes.

The ARC-AGI benchmark provides an ideal testing ground for this approach because it requires complex reasoning rather than simple knowledge retrieval. While LLMs contain extensive knowledge, their stochastic nature makes extracting that knowledge reliably difficult. Poetiq’s system addresses this by discovering adaptive reasoning strategies rather than following predetermined approaches.

The Road Ahead

The company, founded by six researchers and engineers with a combined 53 years of experience from Google DeepMind, positions this result as just the beginning. Poetiq indicates it has tackled several other benchmarks with similarly strong results and plans to demonstrate how its system can optimize AI components inside existing larger systems.

The approach raises intriguing questions about the future of AI development. If meta-systems can orchestrate existing models to outperform those same models individually, the path to advanced AI capabilities may not require every company to train its own frontier models. Instead, intelligence might emerge from discovering better ways to extract and combine knowledge already present in existing systems.

For now, Poetiq has open-sourced the code for its ARC-AGI systems, inviting the research community to build on its approach. Whether this meta-system strategy scales to more complex real-world tasks remains to be seen, but the benchmark results suggest that clever orchestration may prove as valuable as raw model capability.