By mid-2026, the choice for most teams building serious AI features comes down to two models: Anthropic's Claude Fable 5 — released June 9 as the generally available half of a two-model launch — and OpenAI's GPT-5.5, which shipped April 23 and is now the model a large share of engineering teams already run inside Codex. This is the head-to-head: who wins on capability, who wins on cost, and which one fits the work you are actually trying to ship.
The short version. On Anthropic's own published benchmark table — the only one that pits the two directly — Fable 5 leads GPT-5.5 on every row, often by wide margins. But GPT-5.5 costs half as much per token, ties or beats Opus-class models on terminal coding through its own Codex harness, leads on parts of long-context retrieval and abstract reasoning, and carries an enormous deployed ecosystem. And in a telling convergence, both labs now treat cybersecurity and biology as high-risk and gate those capabilities behind vetted-access programs.
For the full picture on each model, see our Claude Fable 5 & Mythos 5 release breakdown and our GPT-5.5 complete guide. This post is the comparison, with the benchmark caveats stated plainly and a routing guide at the end.
- 01Fable 5 leads every row of the only direct benchmark table.On Anthropic's published comparison, Fable 5 beats GPT-5.5 on agentic coding (SWE-Bench Pro 80.3% vs 58.6%; FrontierCode Diamond 29.3% vs 5.7%), knowledge work (GDPval-AA 1932 vs 1769), computer use (85.0% vs 78.7%), legal (13.3% vs 2.1%), tool use, vision, and multidisciplinary reasoning. It is Anthropic's table, so read it as Anthropic-reported — but where it overlaps with OpenAI's own numbers (SWE-Bench Pro, OSWorld, Humanity's Last Exam), the figures match.
- 02GPT-5.5 costs half as much — and that is the whole argument.GPT-5.5 is $5/$30 per million input/output tokens against Fable 5's $10/$50. On a typical 100K-in/20K-out task that is roughly $1.10 versus $2.00 before caching. GPT-5.5 Pro is $30/$180 for its highest-accuracy tier. The catch on GPT-5.5: a long-context surcharge kicks in above 272K input tokens (2x input, 1.5x output, applied to the whole session), where Fable 5 has no published surcharge.
- 03Coding leadership depends on the harness you measure with.GPT-5.5's headline is Terminal-Bench 2.0 at 82.7% (its strongest agentic-coding result) and 83.4% on Terminal-Bench 2.1 via its own Codex CLI harness — ahead of Opus 4.8 and close to Fable 5's 88.0%. But on SWE-Bench Pro, which tests end-to-end GitHub issue resolution, Fable 5's 80.3% is far ahead of GPT-5.5's 58.6%. The lesson the prior comparisons keep repeating: terminal coding favors GPT-5.5's Codex integration; codebase-resolution favors Claude. Test both on your real pipeline.
- 04Long-context retrieval splits by benchmark — no clean winner.GPT-5.5 posts strong OpenAI MRCR scores (74.0% at 512K-1M) and a genuine long-context strength. But on GraphWalks BFS at 1M tokens, Claude leads (Opus 4.8 at 68.1% vs GPT-5.5's 45.4%, Anthropic-reported). Context-window parity (both reach 1M) is not retrieval parity, and the two labs lead on different long-context evals. Match the benchmark to your actual retrieval pattern before deciding.
- 05Both labs gate cyber and biology — the safety posture converged.Fable 5 routes cybersecurity, biology, and distillation queries to Opus 4.8, with the unlocked Mythos 5 restricted to Project Glasswing partners. GPT-5.5 rates cyber and bio as High under OpenAI's Preparedness Framework, ships stricter classifiers, and offers Trusted Access for Cyber to verified defenders. If your work lives near those domains, expect more refusals or fallbacks from both — and a vetted-access path from both.
01 — The MatchupTwo flagships, two philosophies — and a seven-week age gap.
These models come from different release strategies. GPT-5.5 shipped on April 23, 2026 (API access the next day) as OpenAI's single flagship, positioned around Codex and computer use — a model built to take a messy, multi-part task and carry it across tools until it is finished. By June it is mature, widely deployed, and the default for a large share of engineering teams.
Claude Fable 5 shipped seven weeks later, on June 9, 2026, as the generally available half of a two-model launch. Its twin, Mythos 5, is the same underlying model with safeguards lifted, restricted to vetted partners. Fable 5 is the newer model on paper, and Anthropic's benchmarks reflect that — but newer is not automatically better for your use case, which is the entire point of running the comparison rather than reading the headline.
The honest framing: Fable 5 is the capability leader on the published numbers, and GPT-5.5 is the value-and-ecosystem play that has had two months to embed itself in real workflows. For the deeper Opus-versus-GPT history that informs this matchup, see our Claude Opus 4.8 vs GPT-5.5 comparison.
Newest flagship · capability leader
API ID claude-fable-5. Generally available with safeguards that route cyber/bio/distillation to Opus 4.8. $10/$50 per million tokens. Built for long-horizon agentic work: planning across stages, sub-agent delegation, self-validation at high effort.
Incumbent · value & ecosystem
API IDs gpt-5.5 and gpt-5.5-pro. $5/$30 (standard) and $30/$180 (Pro). 1M context in the API, 400K in Codex. Positioned around Codex, computer use, and token-efficient agentic coding. Mature and widely deployed by June.
Fable 5 is double GPT-5.5 on tokens
$10/$50 vs $5/$30. For reference, Claude Opus 4.8 — Fable 5's own safeguard fallback — is $5/$25, undercutting both on output. Price is the single biggest lever in this comparison.
Cyber & bio gated on each side
Anthropic gates via the Fable/Mythos split + Glasswing. OpenAI rates cyber/bio High under its Preparedness Framework with stricter classifiers and Trusted Access for Cyber. The frontier-lab safety posture has converged.
02 — Benchmark AnalysisHead to head — with the harness caveats spelled out.
The table below is Anthropic's published comparison — the only source that benchmarks Fable 5 and GPT-5.5 under the same conditions. Read it as Anthropic-reported. The reassuring part: where these numbers overlap with OpenAI's own published table (SWE-Bench Pro 58.6%, OSWorld-Verified 78.7%, Humanity's Last Exam 41.4% / 52.2%, Terminal-Bench via Codex CLI 83.4%), the two labs agree. Starred (*) rows are where Anthropic shows the restricted Mythos 5 score — the Fable 5 you can deploy falls back toward Opus 4.8 there.
| Benchmark | Claude Fable 5 | GPT-5.5 | Edge |
|---|---|---|---|
| Agentic codingSWE-Bench Pro | 80.3% | 58.6% | Fable 5 (+21.7) |
| Agentic codingFrontierCode (Diamond, xhigh) | 29.3% | 5.7% | Fable 5 (+23.6) |
| Agentic codingTerminal-Bench 2.1 | 88.0%* | 83.4% | Fable 5 — close on GPT-5.5's Codex CLI |
| Knowledge workGDPval-AA (ELO) | 1932 | 1769 | Fable 5 (+163) |
| Knowledge work (vision)GDP.pdf, no tools | 29.8% | 24.9% | Fable 5 (+4.9) |
| Computer useOSWorld-Verified | 85.0% | 78.7% | Fable 5 (+6.3) |
| Tool useAutomationBench | 17.4% | 12.9% | Fable 5 (+4.5) |
| LegalLegal Agent Benchmark | 13.3% | 2.1% | Fable 5 (+11.2) |
| Multidisciplinary reasoningHumanity's Last Exam, with tools | 64.5%* | 52.2% | Fable 5 (+12.3) |
| CybersecurityExploitBench (Cap%) | 78.0%* | 34.0% | Mythos 5 — Fable 5 ≈ Opus 4.8 (40%) |
| HealthHealthBench Professional | 66.0%* | 51.8% | Fable 5 (+14.2) |
Source: Anthropic Claude Fable 5 / Mythos 5 benchmark table, June 9, 2026. Starred (*) rows show the restricted Mythos 5 score; on those, the deployable Fable 5 performs closer to Claude Opus 4.8 because of its safeguard fallbacks. GPT-5.5's Terminal-Bench 2.1 figure is from its own Codex CLI harness and is not directly comparable to a public-harness run.
SWE-Bench Pro — Fable 5 vs GPT-5.5
Source: Anthropic Fable 5 benchmark table. Deltas are vs GPT-5.5. SWE-Bench Pro tests end-to-end GitHub issue resolution — Claude's strongest relative lead over GPT-5.5.Coding benchmarks are harness-sensitive, and this is where naive comparisons go wrong. GPT-5.5 leads Terminal-Bench 2.0 at 82.7% and reaches 83.4% on Terminal-Bench 2.1 through its own Codex CLI — a genuinely strong terminal-coding result that beats Opus 4.8's 82.7% on the same family. But SWE-Bench Pro, which measures end-to-end resolution of real GitHub issues, is where Fable 5's 80.3% pulls far ahead of GPT-5.5's 58.6%. The pattern every prior comparison has found holds here: GPT-5.5's Codex integration shines on terminal workflows; Claude leads on codebase-wide resolution. The only number that settles it for your team is the one you measure on your own pipeline. Source: Anthropic Fable 5 table and OpenAI's GPT-5.5 announcement.
03 — The Case for GPT-5.5Where GPT-5.5 wins — price, Codex, and a few real benchmarks.
A capability table that Fable 5 sweeps can make GPT-5.5 look beaten. It is not. GPT-5.5 wins the arguments that decide most real deployments.
Price, decisively. At $5/$30 per million tokens, GPT-5.5 is half the cost of Fable 5's $10/$50. For high-volume production traffic, that gap compounds into real money, and it means the capability difference has to actually matter for your task before the premium is worth paying.
Terminal coding and the Codex loop. Terminal-Bench 2.0 at 82.7% is GPT-5.5's headline, and its Codex CLI delivers 83.4% on Terminal-Bench 2.1. OpenAI reports more than 85% of its own staff use Codex weekly, and the model is tuned to finish the same Codex tasks with fewer tokens than GPT-5.4. If your team already lives in Codex, that integration is a real, compounding advantage.
Abstract reasoning and research math. OpenAI reports GPT-5.5 at 85.0% on ARC-AGI-2 and 35.4% on FrontierMath Tier 4 (39.6% for GPT-5.5 Pro) — strong results on the hardest reasoning and research-math evals, areas Anthropic's Fable 5 table does not directly cover.
Long-context retrieval, on its own benchmarks. GPT-5.5 posts 74.0% on OpenAI's MRCR v2 at 512K-1M tokens, a genuine long-context strength. The honest caveat is in the next section: on GraphWalks, Claude leads. The two labs win different long-context evals, so this is a strength for GPT-5.5, not a settled victory.
GPQA near-parity. On GPQA Diamond, GPT-5.5 scores 93.6% — within noise of the near-saturated frontier (Opus 4.7 94.2%, Gemini 3.1 Pro 94.3%). Anthropic did not include GPQA in the Fable 5 table, so there is no direct figure to beat, but GPT-5.5 is clearly competitive on graduate-level STEM reasoning. For the deeper cost modeling behind all of this, see our agentic coding cost analysis.
A benchmark table that one model sweeps does not settle a deployment decision. GPT-5.5 costs half as much, owns the Codex loop, and is competitive on reasoning and long-context retrieval. The premium for Fable 5 has to earn itself on your specific work.Digital Applied analysis, June 9, 2026
04 — The Case for Fable 5Where Fable 5 wins — the broad agentic lead.
Fable 5's advantages cluster around the hardest, longest, most multi-step work — exactly where the extra capability justifies the extra cost.
Codebase resolution. The 80.3% vs 58.6% SWE-Bench Pro gap is the headline. For end-to-end resolution of real GitHub issues — read the codebase, find the fix, carry it through related files, not just operate a terminal — Fable 5 is in a different tier. On the harder FrontierCode Diamond set the gap is starker still: 29.3% vs 5.7%.
Knowledge work and computer use. GDPval-AA 1932 vs 1769, OSWorld-Verified 85.0% vs 78.7%, AutomationBench 17.4% vs 12.9%, and a Legal Agent Benchmark gap of 13.3% vs 2.1%. For multi-stage analysis, document-heavy work, and agents that operate real software, Fable 5 leads consistently.
Long-context resolution, on GraphWalks. Where GPT-5.5 leads MRCR, Claude leads GraphWalks: at 1M tokens, Opus 4.8 scores 68.1% on GraphWalks BFS against GPT-5.5's 45.4% (Anthropic-reported). At those retrieval rates, the weaker model can miss roughly one in five facts the stronger one finds at the same context size. If your retrieval pattern looks like graph traversal over a large corpus, that gap matters.
Long-horizon autonomy. Fable 5 is built to run for days, plan across stages, delegate to sub-agents, and validate its own work at high effort. That maps directly onto the kind of sustained agentic delivery — large migrations, multi-day build-and-test loops — where a higher per-token price is dwarfed by the senior hours it replaces. See the full capability picture in our Fable 5 release breakdown.
05 — Price & Cost Per TaskThe real lever — cost per task, with the long-context fine print.
Capability decides whether a model can do the job; price decides whether you can afford to run it at scale. Here is the rate card for both, with Opus 4.8 included because it is Fable 5's own safeguard fallback and undercuts both on output.
| Model | Input / 1M | Output / 1M | Context | Notes |
|---|---|---|---|---|
| Claude Fable 5 | $10 | $50 | Not published | 90% input discount via prompt caching; US-only inference at 1.1x; no published long-context surcharge |
| GPT-5.5 | $5 | $30 | 1M (API) / 400K (Codex) | Batch & Flex at 50%; Priority at 2.5x; surcharge above 272K input (2x in / 1.5x out, whole session) |
| GPT-5.5 Pro | $30 | $180 | 1M (API) | Highest-accuracy tier; for hard math, deep research, high-stakes analysis |
| Claude Opus 4.8 | $5 | $25 | 1M | Reference: Fable 5's safeguard fallback; cheapest output of the four |
The per-task math. Take a representative 100K-input / 20K-output task. GPT-5.5: $0.50 input + $0.60 output = about $1.10. Fable 5: $1.00 input + $1.00 output = about $2.00 — roughly 80% more before caching. Prompt caching narrows the input side (Fable 5's 90% input discount on repeated context is strong), but the $50 vs $30 output rate keeps GPT-5.5 ahead on any output-heavy workload.
Where the math flips. Above 272K input tokens in a session, GPT-5.5 applies a surcharge — 2x input and 1.5x output — to the entire session, not just the tokens over the line. Fable 5 has no published long-context surcharge. So for genuinely large-context work (whole-repository passes, long document sets), GPT-5.5's headline price advantage erodes and Fable 5 becomes more competitive than the rate card first suggests. Model your real token distribution — the crossover point is workload-specific. Our agentic coding cost breakdown walks through the methodology.
The two releases tell the same safety story from different angles. Anthropic split Fable 5 (safeguarded, general) from Mythos 5 (unlocked, restricted to Project Glasswing), routing cyber, biology, and distillation queries to Opus 4.8. OpenAI rates GPT-5.5's cyber and biology capabilities as High under its Preparedness Framework, ships deliberately stricter classifiers (which it admits some users will find annoying while they are tuned), and opens Trusted Access for Cyber so verified defenders get fewer refusals. Both labs concluded the same thing: gate the dangerous capabilities, and build a vetted path for legitimate high-capability work. If your use case sits near those domains, plan for refusals or fallbacks on both — and an application process on both. Sources: Anthropic Fable 5 announcement and OpenAI GPT-5.5 announcement.
06 — Decision GuideWhich to choose — route by the job, not by the headline.
There is no single winner, and standardizing on one model for everything leaves value on the table. The disciplined approach is to route by task shape — and, for anything high-stakes, to test both on your own pipeline before committing. Here is the practical map.
Hardest long-horizon work
Large codebase migrations and end-to-end issue resolution (SWE-Bench Pro 80.3%), multi-day autonomous agentic sessions, complex multi-stage knowledge work, GraphWalks-style retrieval over large corpora, and legal/health analysis. When the capability gap changes the outcome, the 2x price is easy to justify.
Codex coding & high-volume work
Terminal-centric agentic coding inside Codex (Terminal-Bench 83.4% via Codex CLI), high-volume production traffic where $5/$30 vs $10/$50 compounds, abstract-reasoning and research-math tasks, and teams already invested in OpenAI tooling. The default value play.
Highest-accuracy deep work
Reserve the $30/$180 tier for the hardest math, deep multi-pass research, and high-stakes business, legal, or data-science analysis where accuracy is the only variable. BrowseComp jumps to 90.1% and FrontierMath Tier 4 to 39.6% at this tier.
Route by task shape
The mature answer for most teams: GPT-5.5 (or Opus 4.8 at $5/$25) as the cost-efficient default, Fable 5 for the hardest long-horizon work, GPT-5.5 Pro for accuracy-critical deep dives. Measure quality and token spend on a sample of your real tasks, then wire the routing.
Fable 5 is the more capable model; GPT-5.5 is the better-value one. The right answer is the shape of your work.
On the only benchmark table that pits them directly, Claude Fable 5 beats GPT-5.5 on every row — and where it overlaps with OpenAI's own numbers, the two labs agree. For the hardest codebase resolution, the longest agentic sessions, and multi-stage knowledge work, Fable 5 is the stronger tool, and its capability lead is real rather than a benchmark artifact.
But GPT-5.5 is half the price, owns the Codex coding loop, holds its own on abstract reasoning and parts of long-context retrieval, and has spent two months embedding itself in real workflows. For high-volume production work and terminal-centric coding, it is the sharper value — and Opus 4.8 at $5/$25 sits underneath both as the cheapest broadly-capable option.
Two things carry across both models. Coding leadership is harness-dependent, so the only number that decides it is the one you measure on your own pipeline. And both labs now gate cybersecurity and biology behind vetted access — a convergence that tells you the frontier has moved past raw capability into who is allowed to use it. Route by task, test before you commit, and treat the asterisks as the line between the demo and what you will ship.