The pendulum swing
For most of the last two quarters Anthropic owned the coding frontier. Two weeks ago the new Codex landed. Eight days ago GPT 5.5 followed. We spent the week working on both, and the verdict in our hands is unambiguous. OpenAI has the throne back.
What OpenAI shipped
Two releases, two weeks, one direction. On April 16, the same day Anthropic shipped Opus 4.7, OpenAI rolled out the “Codex for almost everything” update. Computer use on macOS. An in-app browser. gpt-image-2 native inside the desktop app. Memory preview. Ninety odd plugins covering Jira, Linear, Notion, Slack, Salesforce, Microsoft 365, GitHub, and the rest of the work software you actually live in. Same week they killed per-message billing and went to per-token.
Then on April 23, GPT 5.5 went live. ChatGPT and Codex on day one, the API the next morning. The pitch was specific. Better at coding, better at using a computer, dramatically more efficient through problems. Same bill, more work done.
The narrative on launch day was that this was OpenAI playing catch up after a quarter of getting outshipped by Claude Code. It is not catch up. The numbers, the tooling, and the workflow shift on our team this week all point the same way. The lead has changed hands.
Where the numbers actually moved
The benchmark table tells one story if you only look at SWE-bench, and a very different story if you look at the work agents actually do.
GPT 5.5 vs Opus 4.7
Terminal-Bench 2.0 (long horizon shell work)
GPT 5.5 █████████████████ 82.7% <-- SOTA
Mythos Pv █████████████████ 82.0%
Opus 4.7 ██████████████ 69.4%
OSWorld-Verified (computer use)
GPT 5.5 █████████████████ 78.7%
Opus 4.7 █████████████████ 78.0%
CyberGym (offensive security agents)
GPT 5.5 █████████████████ 81.8%
Opus 4.7 ███████████████ 73.1%
MRCR v2 (long context retrieval)
GPT 5.5 ███████████████ 74.0%
Opus 4.7 (no clean number)
SWE-bench Pro
Opus 4.7 █████████████ 64.3%
GPT 5.5 ███████████ 58.6%Terminal-Bench 2.0 is the headline. 82.7 percent is state of the art and it edges out the unreleased Mythos preview. More importantly it is more than thirteen points clear of Opus 4.7 on the benchmark that most directly maps to what an autonomous coding agent has to do all day. Plan, run a tool, recover from failure, stay coherent across hundreds of turns. This is the fight that matters and it is not close.
CyberGym at 81.8 versus 73.1 says the same thing in a different domain. OSWorld at 78.7 says it for browser and GUI work. MRCR at 74 says it for long context. Four agentic benchmarks, four wins.
Anthropic’s remaining wins are on SWE-bench Pro and SWE-bench Verified, both of which Anthropic itself flagged for memorization concerns in the 4.7 release notes. The cleanest reads on real agentic work, the ones nobody has had time to overfit on yet, all favour GPT 5.5. That is the actual data.
The 72 percent number
The most important number from the release is not on any leaderboard. GPT 5.5 ships roughly 72 percent fewer output tokens than Opus 4.7 on equivalent work. Same task, same outcome, less than a third of the output bill.
Stack that against the tokenizer tax that Opus 4.7 quietly introduced ten days ago, where the new tokenizer made the same prompt cost thirty to forty percent more on input alone, and the gap on a real agent loop is enormous. Conservative comparison on our own workloads this week landed at roughly two and a half to three times more cost on Opus 4.7 for tasks where GPT 5.5 produced equivalent or better output.
Opus 4.7’s output is priced at 25 dollars a million, GPT 5.5 at 30. On paper Anthropic looks cheaper. In practice, when you multiply by tokens actually consumed, GPT 5.5 is the cheaper model by a wide margin on agentic work. This is the part that has stopped being theoretical and started showing up on bills.
Codex stopped being a CLI
GPT 5.5 is the engine. Codex is what makes it dangerous. The April 16 update was the first version of Codex that does not feel like a research artefact, and the gap to where Claude Code was sitting closed in a single release.
The pieces that genuinely shifted our workflow:
- Computer use, on macOS. The agent can drive native applications, exercise simulator flows, and run GUI heavy QA paths that previously needed a human. Until last month this category did not have a serious tool. It does now.
- In-app browser. The agent has its own browser context inside the desktop app. No more handing it links and praying it scraped the right thing. It just opens the page and reads.
- Native image generation. gpt-image-2 inside Codex, no separate API key, no separate bill. For anyone whose work touches design at all, this collapses an entire round trip.
- Ninety plus plugins on day one. Jira, Confluence, Microsoft 365, Notion, Slack, HubSpot, Salesforce, Google Workspace, GitHub, Linear, Zendesk. The integration story OpenAI used to lose on is now the integration story they are winning on, decisively.
- Memory preview. Persistent per-project memory the agent can pin and forget. Closer to how a senior who knows your repo actually behaves.
- Honest billing. April 2 they killed per-message and went to per-token. Usage is finally legible. You can budget. You can compare. You can build a real cost model.
Codex now has more than four million weekly active users. A year ago it was a niche tool that had effectively lost the developer mindshare war. It is no longer.
GPT 5.5 in the hand
Three observations from the week.
First, on long horizon work it is dramatically better than Opus 4.7. The Terminal-Bench gap translates immediately. We had a database migration that Claude Code chased for forty minutes before getting confused about which step it had completed. Codex with GPT 5.5 finished it in fourteen minutes on the first attempt with cleaner commits.
Second, the long context behaviour is the best we have used. The 1M window is real, not nominal. We pasted in the bulk of a mid sized backend, asked for a dead code report, and got a list that was correct on the second try. Opus 4.7 with the same input started losing the thread around 400k tokens. This is where the MRCR number is coming from. It is not a benchmark trick.
Third, the model is more efficient and more direct. Tight brief, clear acceptance criteria, and it lands the change in a fraction of the tokens. The flip side, the one some users on X are calling laziness, is real if you give it a vague prompt and walk away. The model rewards specificity. If you write the brief properly it ships. If you do not, it ships less than you wanted. That is not a bug, it is a different model of collaboration than Claude Code trained you on.
What the working developers are saying
The most repeated take across X this week, almost verbatim from a dozen different builders, is the one Eric Provencher posted: use Codex if you have a detailed plan and want to walk away for thirty minutes. Use Claude Code if you are not sure where you are going and want to iterate.
Read that twice. The autonomous coding workflow, the one every team is actually trying to build their stack around, is the one Codex now wins. Claude Code has been pushed back into the iterative pair programming role, which is a respectable role, but it is not the role that defines the frontier in 2026. The frontier is the agent loop. The agent loop is now Codex.
Other patterns from the week:
- Cost screenshots are flooding the timelines. People who were paying real money on Claude Code are reporting their Codex bills coming in at a third or less for equivalent agent runs. The 72 percent token efficiency number is showing up in real wallets.
- Computer use on macOS is the feature being talked about most by people building consumer apps. It collapses a class of QA work that nothing else can do today.
- The 1M context behaviour is what is winning over the long context skeptics. People who had given up on big context windows because they degraded past 200k are reporting clean retrieval well past 700k on GPT 5.5.
- The plugin ecosystem is the quiet kingmaker. Codex shipped with the integrations enterprise teams need, on day one. Claude Code is going to spend the next two quarters catching up.
The vibes have moved. Not the polite split decision the comparison blogs are writing. Actually moved.
How we are using it now
- Codex with GPT 5.5 is the default. Greenfield work, migrations, refactors, repo wide changes, anything where the agent runs longer than ten minutes. This was the Claude Code seat for most of the last year. It is not anymore.
- Computer use for QA flows. Native app smoke tests, simulator runs, form heavy paths, anything where you used to need a human clicker. Codex does it now and the bill is small.
- Plug into your work systems on day one. Linear, Slack, Notion, GitHub. The plugin layer is the value, not the model. Wire it up properly and the agent stops asking you for context.
- Write detailed briefs. GPT 5.5 rewards specificity more than any model we have used. Spell out scope, list acceptance criteria, name the files, hand it the constraints up front. The tighter the brief, the better the output, by a wider margin than on Opus 4.7.
- Keep Claude Code for iteration. When you do not yet know what you want, when the conversation is the work, Opus 4.7 is still the right tool. It is just no longer the dominant tool. It is one role in a stack now.
- Skip GPT 5.5 Pro. Six times the output cost for a marginal capability bump on a narrow band of tasks. Stay on base. The efficiency story is on the base tier.
The bigger picture
Anthropic spent the last six months in the lead. They earned it. Opus 4.6 in October and 4.7 two weeks ago were the two best public coding models on earth at the moment they shipped. The Mythos preview implied the lead was about to get bigger. Then Mythos got pulled for safety, 4.7 came out instead, and a tokenizer change made every Anthropic bill 30 to 40 percent worse on the same work.
OpenAI walked through the open door. They shipped a frontier model that uses 72 percent fewer output tokens, dominates the agentic benchmarks that are not memorization tainted, and arrived inside a desktop product with computer use, ninety plugins, and four million weekly users. The lead changed hands in a fortnight.
Pendulums swing. They have swung roughly every six to nine months for three years. What is different this time is that the swing is not narrative. It is in the bills, in the benchmarks that matter, and in the muscle memory of the people doing the work. By Friday next week, every serious engineering team that has not put Codex with GPT 5.5 in the rotation will be doing more expensive, slower, and worse work than the team next to them that has.
The pendulum will swing again. It always does. The job is not to predict the swing. It is to be the kind of team that picks up the new tool the week it ships and beats every team that did not.
Let's talk.
30 minutes is enough to scope most builds. Pick whichever feels easier. Book a call, or drop us an email.
30 min, no decks
A working conversation. Bring the problem, leave with the scope.
BOOK ON CALENDLYGot more to say first?
Drop us a line. We read everything and reply within a day.
SEND AN EMAIL