GPT-5.5 Complete Guide: Thinking, Pro & 1M Context
OpenAI's GPT-5.5 ships April 23, 2026 with 1M context, Thinking and Pro variants, 82.7% Terminal-Bench, and same latency as GPT-5.4. Pricing inside.
Terminal-Bench 2.0 (SOTA)
API Context Tokens
OSWorld-Verified
GDPval (wins or ties)
Key Takeaways
OpenAI released GPT-5.5 on April 23, 2026, the next default frontier model in ChatGPT and Codex and the first OpenAI model to ship with a 1M-token API context window. GPT-5.5 leads agentic-coding benchmarks at 82.7% on Terminal-Bench 2.0, hits 73.1% on the internal Expert-SWE long-horizon benchmark, and reaches 84.9% on GDPval — all while matching GPT-5.4 per-token latency in real-world serving and using significantly fewer tokens to complete the same Codex tasks.
Alongside the standard model, OpenAI shipped GPT-5.5 Pro for the hardest research, math, and retrieval work, and announced API availability on the Responses and Chat Completions endpoints shortly. The release also moves cybersecurity into the model's High risk tier under OpenAI's Preparedness Framework, with new safeguards and an expanded Trusted Access for Cyber program. For teams that already deployed GPT-5.4, the migration story is straightforward — same API surface, lower token spend per task, and a meaningful jump in agentic capability.
Release snapshot: GPT-5.5 launched April 23, 2026 in ChatGPT (Plus, Pro, Business, Enterprise) and Codex. API model IDs: gpt-5.5 and gpt-5.5-pro, with rollout to Responses and Chat Completions endpoints shortly. Based on the official OpenAI announcement.
GPT-5.5 Release Overview
GPT-5.5 is positioned as a step change in agentic capability rather than a pure benchmark refresh. OpenAI's framing is consistent across the launch materials: the model understands user intent faster, uses tools more efficiently, and stays coherent across long multi-step tasks — coding, browsing, computer operation, document and spreadsheet work, and early scientific research. On Artificial Analysis's Coding Agent Index, OpenAI reports GPT-5.5 delivering state-of-the-art intelligence at roughly half the cost of competing frontier coding models on a token-spend basis.
Per-token latency parity: Larger, more capable models are typically slower to serve. GPT-5.5 matches GPT-5.4 per-token latency in real-world serving while operating at a higher level of intelligence — a result of co-design with NVIDIA GB200/GB300 NVL72 systems and inference improvements landed with help from Codex itself.
The release lands one month after the GPT-5.4 family rollout covered in our GPT-5.4 complete guide, and continues the cadence of frequent frontier-model updates documented in our twelve-models-in-a-week analysis. GPT-5.5 is also available immediately in Codex with a 400K-token window across Plus, Pro, Business, Enterprise, Edu, and Go plans, and a Fast mode that generates tokens 1.5x faster at 2.5x the cost.
Variants: Thinking, Pro, and Fast Mode
GPT-5.5 ships in two API SKUs and a few different surface configurations. In ChatGPT, the standard model is exposed as GPT-5.5 Thinking for Plus, Pro, Business, and Enterprise users — the general-purpose tier optimized for everyday work that benefits from reasoning. GPT-5.5 Pro is reserved for Pro, Business, and Enterprise users in ChatGPT and is targeted at the hardest questions and highest-accuracy outputs. In Codex, GPT-5.5 is available across all paid plans with a 400K context window, with an optional Fast mode for users who want lower latency at higher cost.
GPT-5.5 (Thinking)
$5 / $30 per 1M
Default frontier model. Faster, more concise answers than GPT-5.4 with state-of-the-art agentic coding and computer use. The right default for most production workloads.
GPT-5.5 Pro
$30 / $180 per 1M
Maximum-accuracy variant. Leads BrowseComp at 90.1% and FrontierMath Tier 4 at 39.6%. Best when the cost of a wrong answer dwarfs the cost of the call.
Fast mode
1.5× speed · 2.5× cost
Available in Codex for tight feedback loops in interactive coding sessions where latency dominates the developer experience and per-task cost is bounded.
For teams already running GPT-5.4 or earlier, the practical default is to swap to standard GPT-5.5 and reserve Pro for specific high-stakes pipelines — research synthesis, deep BrowseComp-style retrieval, multi-step math, or complex legal/financial reasoning. Pro shows clear gains on those evals: 90.1% BrowseComp (vs 84.4% standard), 52.4% FrontierMath Tier 1–3 (vs 51.7%), and 39.6% FrontierMath Tier 4 (vs 35.4%). For everything else, the standard model already leads on agentic coding, GDPval, OSWorld, and CyberGym.
Agentic Coding: 82.7% Terminal-Bench Lead
Agentic coding is where GPT-5.5 separates most clearly from prior generations and from competitors. On Terminal-Bench 2.0 — which tests complex command-line workflows requiring planning, iteration, and tool coordination — GPT-5.5 hits 82.7%, well ahead of GPT-5.4 at 75.1%, Claude Opus 4.7 at 69.4%, and Gemini 3.1 Pro at 68.5%. On the internal Expert-SWE eval, which targets long-horizon coding tasks with a median estimated human completion time of 20 hours, GPT-5.5 reaches 73.1% versus 68.5% for GPT-5.4. Across all three coding evals OpenAI publishes, GPT-5.5 improves on GPT-5.4 while using fewer tokens.
GPT-5.5 leads agentic coding; Opus 4.7 retains SWE-Bench Pro
| Benchmark | GPT-5.5 | GPT-5.4 | Opus 4.7 | Gemini 3.1 Pro |
|---|---|---|---|---|
| Terminal-Bench 2.0 | 82.7% | 75.1% | 69.4% | 68.5% |
| Expert-SWE (internal) | 73.1% | 68.5% | — | — |
| SWE-Bench Pro | 58.6% | 57.7% | 64.3%* | 54.2% |
*Anthropic reported signs of memorization on a subset of SWE-Bench Pro problems for Claude Opus 4.7.
OpenAI's leadership framed the agentic-coding gains as the headline story on the launch call.
"It's definitely our strongest model yet on coding, both measured by benchmarks and based on the feedback we've gotten from trusted partners, as well as our own experience."
Amelia Glaese·VP of Research, OpenAI
"It's way more intuitive to use. It can look at an unclear problem and figure out what needs to happen next."
Greg Brockman·President, OpenAI
The qualitative reports from early testers are consistent with the numbers. Dan Shipper of Every called GPT-5.5 "the first coding model I've used that has serious conceptual clarity," after using it to reproduce the kind of architectural rewrite a senior engineer had previously needed days to land. Pietro Schirano of MagicPath described GPT-5.5 merging a branch with hundreds of frontend and refactor changes into a substantially-changed main branch in one shot in about 20 minutes. Senior engineers who tested the model said GPT-5.5 was noticeably stronger than GPT-5.4 and Claude Opus 4.7 at reasoning, autonomy, and predicting testing and review needs without explicit prompting.
For agencies and product teams comparing options, the broader agentic-coding picture is well covered in our Claude Opus 4.7 vs GPT-5.4 agentic coding analysis — the GPT-5.5 numbers extend that gap further on Terminal-Bench and Expert-SWE while Anthropic's SWE-Bench Pro lead remains the main counterpoint, with the memorization caveat noted in Anthropic's own release.
Computer Use and Knowledge Work
GPT-5.5 extends its lead beyond pure coding into the broader knowledge-work loop: finding information, understanding what matters, using tools, checking outputs, and turning raw material into something useful. On OSWorld-Verified, the standard evaluation for computer-use agents, GPT-5.5 reaches 78.7% — up from 75.0% on GPT-5.4 and ahead of Claude Opus 4.7 at 78.0%. On GDPval, which measures agents producing well-specified knowledge work across 44 occupations, GPT-5.5 hits 84.9% (vs 83.0% GPT-5.4, 80.3% Opus 4.7, 67.3% Gemini 3.1 Pro).
BrowseComp · Pro 90.1%
Browse and retrieve
Web research, multi-source synthesis, and citation chains with full tool support across Search, URL Context, Code Execution, and File Search.
OSWorld-Verified
Operate software
Native ability to see the screen, click, type, and navigate browser and desktop interfaces. Brings reliable computer use into production-viable territory for many internal workflows.
Tau2-bench Telecom · Toolathlon 55.6%
Tool use at scale
Without prompt tuning. The model understands task intent better and is meaningfully more token-efficient than predecessors on customer-service and tool-orchestration workflows.
Domain-specific knowledge work shows the same pattern. On OfficeQA Pro — Databricks' benchmark for office-software tasks — GPT-5.5 scores 54.1% versus 43.6% for Opus 4.7 and 18.1% for Gemini 3.1 Pro. On OpenAI's internal Investment Banking Modeling Tasks eval, GPT-5.5 hits 88.5%. FinanceAgent v1.1 comes in at 60.0%. These are the workloads where agentic AI actually displaces an analyst hour rather than just summarizing a report — the kind of work agencies and consulting teams are now operationalizing through Codex-driven internal tools.
OpenAI shared concrete internal usage examples that frame the knowledge-work upside. Their Comms team built an automated Slack agent for low-risk speaking-request triage after using GPT-5.5 in Codex to analyze six months of historical data and design a scoring framework. Finance used Codex to review 24,771 K-1 tax forms (71,637 pages) — excluding personal information from the workflow — accelerating the task by two weeks compared to the prior year. A Go-to-Market employee automated weekly business reports for a 5–10 hour per week saving. OpenAI states that more than 85% of the company uses Codex weekly across software engineering, finance, comms, marketing, data science, and product.
The 1M-token context window is what makes many of these workflows tractable. On long-context retrieval evals, GPT-5.5 jumps to 74.0% on OpenAI MRCR v2 8-needle 512K-1M (up from 36.6% on GPT-5.4 and 32.2% on Claude Opus 4.7), 81.5% at 256K-512K, and 87.5% at 128K-256K. For agencies running AI transformation programs, the implication is direct: full-codebase analysis, entire-policy-corpus reasoning, and multi-document research start to behave like normal model calls rather than exotic capabilities that need careful chunking strategies.
Scientific Research Capabilities
Scientific research is the most surprising area of progress in the GPT-5.5 release. The model shows clear gains on multi-stage data analysis workflows that look more like multi-day research projects than standalone Q&A. On GeneBench — a new evaluation of multi-stage scientific data analysis in genetics and quantitative biology — GPT-5.5 scores 25.0% (Pro: 33.2%) versus 19.0% on GPT-5.4. On BixBench, designed around real-world bioinformatics and data analysis, GPT-5.5 reaches 80.5% (vs 74.0%), the leading published score among major frontier models.
Where GPT-5.5 sits across the hardest evals
- FrontierMath Tier 1–3
- 51.7% · Pro 52.4% · GPT-5.4 47.6%
- FrontierMath Tier 4
- 35.4% · Pro 39.6% · GPT-5.4 27.1%
- GPQA Diamond
- 93.6%
- Humanity's Last Exam (with tools)
- 52.2% · Pro 57.2%
- ARC-AGI-1
- 95.0%
- ARC-AGI-2
- 85.0% · GPT-5.4 73.3% · Opus 4.7 75.8%
OpenAI also disclosed that an internal version of GPT-5.5 with a custom harness contributed to a new asymptotic proof about off-diagonal Ramsey numbers — a longstanding combinatorics result later verified in Lean. Outside the lab, Bartosz Naskręcki, an assistant professor of mathematics at Adam Mickiewicz University, used GPT-5.5 in Codex to build an algebraic-geometry app from a single prompt in 11 minutes, visualizing the intersection of quadratic surfaces and converting the resulting curve into a Weierstrass model. Derya Unutmaz of the Jackson Laboratory used GPT-5.5 Pro to analyze a 62-sample, ~28,000-gene expression dataset and produce a research report he said would have taken his team months.
Brandon White, co-founder of Axiom Bio, summarized the shift: "It's incredibly energizing to use OpenAI's new GPT-5.5 model in our harness, have it reason over massive biochemical datasets to predict human drug outcomes, and then see it deliver significant accuracy gains on our hardest drug discovery evals." For agencies and operators in regulated, knowledge-dense industries, the implication is that GPT-5.5 Pro becomes a credible co-analyst on structured technical work — not just a copywriter or chatbot.
Inference Efficiency: NVIDIA GB200/GB300 Co-Design
The headline production fact about GPT-5.5 is that it serves at GPT-5.4 per-token latency despite being a more capable model. That isn't an accident of compiler tuning — it's the result of co-designing the model for, training it with, and serving it on NVIDIA GB200 and GB300 NVL72 systems. OpenAI describes inference here as an integrated system rather than a set of point optimizations, and explicitly credits Codex and GPT-5.5 itself with helping land a number of the key improvements in the serving stack.
Token generation
Codex-authored serving heuristics
Before GPT-5.5, OpenAI split requests on an accelerator into a fixed number of chunks so big and small requests could share the same GPU. A static split isn't optimal for all traffic shapes — so Codex analyzed weeks of production traffic and wrote custom partitioning heuristics, lifting token-generation speed by over 20%.
The reflexive loop is the more interesting story: GPT-5.5 helped optimize the infrastructure that serves it.
From an integration standpoint, this matters for cost modeling. GPT-5.5 at $5/$30 per 1M tokens is more expensive than GPT-5.4 on paper, but the company reports that in Codex it has carefully tuned the experience so GPT-5.5 delivers better results with fewer tokens than GPT-5.4 for most users — meaning real per-task spend often falls. For teams building on top of the API, the practical advice is to A/B test on representative tasks rather than extrapolate from per-token list price.
Cybersecurity Capabilities and Safeguards
OpenAI is treating GPT-5.5's biological/chemical and cybersecurity capabilities as High under its Preparedness Framework. While GPT-5.5 didn't reach the framework's Critical level for cyber, the evaluation results show meaningful step-ups: 81.8% on CyberGym (vs 79.0% on GPT-5.4 and 73.1% on Claude Opus 4.7) and 88.1% on the internal expanded Capture-the-Flag challenge tasks (vs 83.7%). The company shipped tighter classifiers around higher-risk activity, sensitive cyber requests, and protections against repeated misuse, with the explicit acknowledgement that some users will find the stricter posture noticeable as it's tuned over time.
Trusted Access for Cyber: Verified defenders meeting trust signals can access cyber-permissive capabilities through Codex with fewer restrictions for legitimate defensive work. Organizations defending critical infrastructure can apply to access cyber-permissive models like GPT-5.4-Cyber under stricter security requirements. The aim is to democratize defensive capability while keeping the most dual-use workflows behind verification.
For agencies and platforms building security products, the architectural takeaway is that GPT-5.5 is now genuinely useful for triage, vulnerability scanning, fix-suggestion, and SOC workflows — but expect more refusals on edge-case prompts that resemble offensive testing, especially in the early weeks as classifiers are tuned. Teams with legitimate defensive use cases should evaluate Trusted Access via Codex rather than fighting against standard-tier guardrails.
Pricing, 1M Context, and Availability
GPT-5.5's API pricing positions it as the premium standard frontier tier rather than a cost-leader. At $5 per 1M input tokens and $30 per 1M output tokens, it's 2x the input price of GPT-5.4 Standard and 2x the output price. Pro at $30/$180 per 1M tokens sits at the same headline rate as GPT-5.4 Pro. Both ship with a 1M-token context window, Batch and Flex pricing at half the standard rate, and Priority processing at 2.5x.
Standard, Pro, and Codex tiers · April 23, 2026
| Model | Input / Output (per 1M) | Context | Best fit |
|---|---|---|---|
gpt-5.5 | $5.00 / $30.00 | 1M | Default for agentic coding, computer use, knowledge work |
gpt-5.5-pro | $30.00 / $180.00 | 1M | Deep research, math, BrowseComp-style retrieval |
| Codex (GPT-5.5) | Subscription tiers | 400K | Interactive coding; Fast mode 1.5× speed at 2.5× cost |
Today, GPT-5.5 is rolling out to Plus, Pro, Business, and Enterprise users in ChatGPT and Codex. GPT-5.5 Pro is rolling out to Pro, Business, and Enterprise users in ChatGPT. API access on the Responses and Chat Completions endpoints is coming shortly — OpenAI cited additional safety and security work needed before serving the model at API scale, especially for partners integrating it into agent platforms. For Codex specifically, the new model is available across Plus, Pro, Business, Enterprise, Edu, and Go plans with a 400K-token context window.
For developers already integrated with Codex, the recommended starting point is documented in our Codex for almost everything release guide — same surface, lower token spend per task with GPT-5.5, and the option to flip Fast mode on for tight feedback loops.
Choosing GPT-5.5 vs. Alternatives
The frontier-model choice is increasingly task-shaped rather than vendor-shaped. GPT-5.5 leads on agentic coding, computer use, and cybersecurity; Claude Opus 4.7 remains strong on SWE-Bench Pro and certain autonomy-heavy refactors; Gemini 3.1 Pro leads on raw ARC-AGI-1 and competes hard on price for large-context workloads. Our broader frontier-model comparison is documented in the GPT-5.4 vs Opus 4.6 vs Gemini 3.1 Pro analysis — most of those decisions hold, with GPT-5.5 strengthening OpenAI's position on the agentic and computer-use axes.
Agentic coding and Codex
Where GPT-5.5 belongs by default.
State-of-the-art on Terminal-Bench + Expert-SWE
82.7% Terminal-Bench 2.0 (vs 75.1% on GPT-5.4), 73.1% Expert-SWE on 20-hour median tasks.
Lower per-task token spend than GPT-5.4
Same per-token latency, fewer tokens to complete the same Codex tasks.
Codex Fast mode available
1.5× token speed at 2.5× cost — for tight interactive feedback loops.
Deep research and hard math
Where Pro earns the 6× price premium.
SOTA BrowseComp at 90.1%
Deepest research-grade retrieval published among generally-available frontier models.
Hardest math tier
39.6% FrontierMath Tier 4 (vs 22.9% for Opus 4.7, 16.7% for Gemini 3.1 Pro).
Regulated-domain decisions
57.2% HLE with tools, 33.2% GeneBench. Use when error cost ≫ call cost.
Computer use automation
Where the operating-system loop becomes shippable.
78.7% OSWorld-Verified
Native browser + desktop operation. Strongest tool-orchestration scores OpenAI has published.
Tau2-bench Telecom 98.0% (no prompt tuning)
Customer-service workflows working out of the box.
Sandbox + human-in-the-loop required
Production rollout still needs guardrails — model capability outpaced ops practice in early 2026.
Multi-vendor routing
The pattern most production stacks land on.
- Default
- GPT-5.5 — agentic coding, computer use, long-context retrieval.
- Refactor
- Opus 4.7 — second opinion on SWE-Bench-style and MCP-heavy work.
- Bulk
- Gemini 3.1 Pro — cost-sensitive long-context retrieval.
For teams currently on GPT-5.4, the migration is straightforward — same API contract, same Codex surface, lower per-task token spend on most workflows, and a meaningful jump in agentic capability. For teams primarily on Claude or Gemini, the question is whether GPT-5.5's lead on Terminal-Bench, Expert-SWE, GDPval, and OSWorld translates to lift on your specific evals — the answer is usually yes for agentic coding and computer use, often more nuanced for code generation and long-context retrieval where individual model strengths and price points still matter.
Conclusion
GPT-5.5 is the most consequential frontier-model release of the quarter. State-of-the-art agentic-coding scores, a 1M-token context window with strong long-context retrieval, native computer use that competes with the best published numbers, and per-token latency parity with GPT-5.4 add up to a model that materially changes what production agent systems can do — without changing their cost shape much for most workflows. Codex itself helped land the inference improvements that make this possible, which is increasingly the pattern: frontier models built and served with help from the previous generation of frontier models.
For most teams, the practical move is simple: standardize on GPT-5.5 for agentic coding, computer use, and knowledge-work agents, reserve GPT-5.5 Pro for deep research and the hardest evaluation-grade tasks, and keep a multi-model router in place to cover edge cases where Claude or Gemini still win on a specific metric. The cybersecurity posture change — High under Preparedness, stricter classifiers, Trusted Access for Cyber — is worth flagging to security teams now so they can route legitimate defensive use through the right channel rather than fight standard-tier guardrails. For an extended look at OpenAI's recent direction, our GPT-5.4 complete guide and Claude Opus 4.7 complete guide give the surrounding context.
One forward-looking note worth taking seriously came from OpenAI chief scientist Jakub Pachocki on the launch call.
"We actually still have headroom to train significantly smarter models than this."
Jakub Pachocki·Chief Scientist, OpenAI
For agencies and platforms architecting agentic stacks today, that's the planning signal. The right move is not to wait — it's to ship the routing layer, the evals, and the production observability now, so that when the next jump lands, the only thing that changes is which model the router picks.
Ready to Deploy Frontier AI in Production?
Choosing the right frontier model — and routing the right tasks to it — is now an architecture decision with measurable cost and capability impact. Our team helps businesses evaluate, integrate, and operate frontier models for agentic coding, computer use, and knowledge-work automation.
Frequently Asked Questions
Related Guides
Continue exploring frontier AI releases and agentic coding.