OpenAI Launches GPT-5.6 Sol, Beats Mythos On TerminalBench

OpenAI has announced the GPT-5.6 series — Sol, Terra, and Luna — in a limited preview beginning today. Sol is the flagship, Terra is pitched as a capable mid-tier option at half the cost of Sol, and Luna is the economy option for high-volume, cost-sensitive workloads. Broad availability across ChatGPT, Codex, and the API is promised “in the coming weeks,” though OpenAI has chosen a phased rollout coordinated with the U.S. government rather than an open launch.

The benchmark that will draw the most attention is TerminalBench 2.1, which tests command-line workflows requiring multi-step planning, tool coordination, and iteration. GPT-5.6 Sol scores 88.8% on that benchmark — behind only GPT-5.6 Sol Ultra, a new compute-intensive mode that hits 91.9%. More significant for the competitive picture: Claude Mythos 5, Anthropic’s restricted frontier model, sits at 88.0% on the same benchmark, and Claude Fable 5 — Anthropic’s current publicly available flagship — scores 84.3%, tied with GPT-5.6 Terra. GPT-5.6 Sol clears Mythos by nearly a full point on TerminalBench. That’s a meaningful result given how much of Mythos’s reputation has rested on coding and agentic capability.

On biology, OpenAI says GPT-5.6 Sol outperforms GPT-5.5 on GeneBench v1, a genomics and quantitative biology benchmark, while using fewer tokens. GPT-5.5 had already set a strong mark when it launched in April, and GPT-5.6 Sol improving on it with greater efficiency points to a model that is running smarter rather than just longer.

Cybersecurity is where OpenAI is being most deliberate in its framing. On ExploitBench, GPT-5.6 Sol is described as competitive with Claude Mythos while using roughly a third of the output tokens. On ExploitGym, a benchmark developed with UC Berkeley researchers, all three GPT-5.6 models show “strong improvements” in cyber capabilities as reasoning effort increases. OpenAI does clarify that Sol does not cross what it calls the “Cyber Critical” threshold — in tests against Chromium and Firefox, it found bugs and exploitation primitives but did not autonomously produce a functional full-chain exploit. Whether that’s a limitation or a deliberate design choice is hard to tell from the outside, but OpenAI is clearly aware of the optics of putting a model with Mythos-level cyber capability into general distribution.

The safeguard stack OpenAI describes for GPT-5.6 is the most layered it has shipped. There are model-level refusals, real-time output classifiers for cyber and biology misuse, a “pause and review” mechanism where a larger reasoning model can evaluate flagged outputs before they reach the user, and account-level review that looks across conversations rather than just individual prompts. OpenAI says it dedicated over 700,000 A100-equivalent GPU hours to automated red-teaming specifically aimed at finding universal jailbreaks — attacks that generalize across many prompts rather than exploiting one narrow pattern. Human red-teaming is ongoing through the preview period, with a rapid-response process to turn discovered weaknesses into updated evaluations.

The new model also introduces two new operational modes. max reasoning effort gives Sol more time to reason deeply before responding — a toggle that maps to what competitors have called “extended thinking.” ultra mode goes further, coordinating multiple subagents in parallel to tackle complex work. GPT-5.6 Sol Ultra’s 91.9% on TerminalBench reflects that mode’s output.

On pricing, Sol comes in at $5 per million input tokens and $30 per million output tokens — matching GPT-5.5’s pricing exactly. Terra is $2.50/$15, and Luna is $1/$6. OpenAI has also redesigned how prompt caching works: cache writes now cost 1.25x the base input rate, cache reads remain at a 90% discount, and there’s a minimum 30-minute cache lifetime with support for explicit cache breakpoints — which should make costs more predictable for developers running long agentic sessions. In July, OpenAI is also launching Sol on Cerebras at up to 750 tokens per second for select customers, which would be a significant speed advantage for latency-sensitive deployments.

The naming system is new. GPT-5.6 is the generation identifier, while Sol, Terra, and Luna are intended as durable capability tiers that can each advance on their own release cadence — meaning future updates to any tier may not require bumping the generation number.

The release arrives at a competitive moment. Chinese models have been gaining ground on US labs across several key benchmarks, and Anthropic has held the coding model leadership position for much of the past several months. TerminalBench 2.1 gives OpenAI a specific, credible result to point to — one that puts Sol ahead of Mythos on a test that matters to the agentic developer market. Whether that translates into benchmark leadership on the broader Artificial Analysis Intelligence Index will become clearer once those results are published alongside the full model release.

For now, GPT-5.6 Sol is available to a limited group of partners and organizations through the API and Codex. General availability is expected in the coming weeks.

Posted in AI