WritingApril 20, 20268 min read

We tried Opus 4.7 so you didn't have to

Four days with Anthropic's new flagship, a week of our real workload, and a mountain of token receipts. Here is what is actually different, what is worth the money, and what to turn off before your bill arrives.

What Anthropic shipped on April 16

Opus 4.7 went live last Thursday across the Claude apps, the API, Bedrock, Vertex, and Microsoft Foundry. Same price card as 4.6. Five dollars per million input tokens, twenty five for output. On paper a routine point release.

It is not a routine point release. Opus 4.7 is the most capable publicly available model right now on the coding benchmarks that people actually use to evaluate these things. It is also a quiet concession that the real frontier, Claude Mythos, is not coming out. Anthropic described 4.7 in the launch as the part of Mythos they felt safe putting in your hands.

We put it in our hands for four days. Claude Code, agentic harnesses, real client repos, a few deliberately nasty debugging sessions. The headline: it is meaningfully smarter, it costs noticeably more to use, and the shape of how you get value out of it has changed.

The benchmark jumps are real

The numbers that matter for anyone shipping software:

     Opus 4.7 vs Opus 4.6

  SWE-bench Verified
    4.7  ████████████████████████████████████  87.6%
    4.6  █████████████████████████████████     80.8%

  SWE-bench Pro
    4.7  ██████████████████████████            64.3%
    4.6  █████████████████████                 53.4%

  CursorBench (Cursor's internal eval)
    4.7  ████████████████████████████          70%
    4.6  ███████████████████████               58%

  Visual acuity (pentest screenshots)
    4.7  ████████████████████████████████████  98.5%
    4.6  █████████████████████                 54.5%

The SWE-bench Verified number is the one we care about most. Opus 4.6 was already a serious engineer. 4.7 closes most of the remaining gap to a competent senior on real GitHub issues. Seven points in one release, on a benchmark that usually moves in ones and twos.

SWE-bench Pro, the harder contamination resistant variant, moved almost eleven points. That is the more honest read. When you strip out the benchmarks the model might have memorised, it still got dramatically better.

The vision jump is the sleeper. Images up to 2,576 pixels on the long edge, roughly 3.75 megapixels, more than three times the resolution Claude could previously see. The visual acuity benchmark nearly doubled in one release. In practice this is the difference between Claude reading a Figma frame and Claude understanding your actual production screenshot.

The silent price increase

Here is the part that did not make the keynote. Opus 4.7 ships with a new tokenizer. Same prompt, more tokens. Independent measurements are converging on the same numbers.

     Token inflation on identical inputs (4.7 vs 4.6)

  Typical Claude Code context   ████████████████  1.33x
  CLAUDE.md file                █████████████████  1.44x
  Technical docs                █████████████████  1.47x
  Some prompts in the wild                     up to 1.35x input,
                                                  more at output

Pricing is unchanged. Cost is not. Per request, expect a thirty to forty percent increase on the same work, before you factor in that 4.7 also thinks harder at higher effort levels, which pushes output tokens up further.

Stack all of that together and the numbers users on X are sharing, one and a half to three times the real cost of Opus 4.6 for the same agent loop, are not exaggerations. They are what happens when a new tokenizer, a new default effort level, and a more verbose reasoning trace all land in the same release.

This is the first frontier release where we think you actually have to budget for it. Our recommendation: before you flip your production agents to 4.7, run a day of shadow traffic and read the bill.

xhigh and /ultrareview are the real features

The launch buried two changes that matter more than the benchmark table.

First, a new effort level called xhigh, sitting between high and max. It is now the default in Claude Code. In practice this means every session starts the model in a more deliberative mode than 4.6 ever ran in. If your old intuition was “turn up effort for the hard stuff,” throw it out. The hard stuff is now the default.

Second, /ultrareview. A new Claude Code command that boots a dedicated review session and behaves like a genuinely skeptical senior engineer reading a PR. It catches design issues, not just typos. We pointed it at a month old client codebase that had passed human review and it found two logic bugs and one subtle race condition in the first ten minutes.

Both features change the unit economics in the same direction. Better output, more tokens. The question is no longer whether Claude can do the task. It is whether you are comfortable paying for the extra thinking.

The backlash, and why we think it is wrong

Within 48 hours the power user backlash was louder than we have ever seen for a Claude release. Token burn was the loudest complaint. A smaller but vocal group said 4.7 had “lost its spark,” felt combative on creative work, and regressed on brainstorming. An AMD engineering post calling it untrustworthy got picked up everywhere. Several developers rolled back to 4.6.

We think the token critique is fair and the vibe critique is mostly operator error.

The model rewards a different kind of prompting than 4.6 did. The 4.6 pattern was chatty pair programming. Many short turns, tight feedback, lots of back and forth. The 4.7 pattern is delegation. Long, specific task briefs with acceptance criteria, relevant paths, and constraints up front, then you let it run. Every user turn adds reasoning overhead. At xhigh, that overhead is not cheap.

When we rewrote our prompts in that shape, the combative edge disappeared. When we did not, it felt exactly like the complaints on X said it felt.

How we actually use it now

Four days of real work, condensed to what we would tell a friend.

Write briefs, not prompts. Intent, constraints, acceptance criteria, relevant files, known gotchas. The worst dollar you will spend on 4.7 is the one that pays it to figure out context you could have pasted in.
Give it a way to verify itself. Tests it can run. A linter. A type checker. A script that exercises the path. 4.7 genuinely uses them. 4.6 mostly performed using them.
Batch your questions. Each user turn costs real thinking tokens now. Ten little nudges is two or three times the cost of one well written handoff.
Drop effort for routine work. Not everything needs xhigh. Renames, boilerplate, docs updates. Medium is fine. Save the reasoning budget for the stuff that actually needs it.
Run /ultrareview on anything going to production. It is the highest leverage feature in the release. Treat it as the last gate before merge.
Feed it your real screenshots. The vision upgrade is not hype. Production dashboards, error states, Figma exports at full resolution. This is the first Claude that can actually see them.

Should you switch

If you are building agents, doing serious code review, shipping to a real codebase, or working with screenshots and visual context, yes. The capability gap between 4.6 and 4.7 is large enough that the extra token spend pays for itself on any task that previously required a second pass by a human.

If you are using Claude as a chat buddy, a brainstorming partner, or for creative writing, the answer is less obvious. Some of what people loved about 4.6 got sanded off. We would stay on 4.6 for that work, and use 4.7 where the extra rigor actually matters.

And if you are budget sensitive, do not switch blind. Shadow traffic it. Measure the bill. Turn effort down on anything that does not need xhigh. The model rewards the teams that tune it.

The bigger picture

Opus 4.7 is the first release in a while where the dominant complaint is price, not capability. Two years ago the debate was whether these things could do real engineering. Last year it was whether they could do it reliably. This year we are arguing about the electricity bill.

That is a tell. It means the frontier has crossed a line where the question for most teams is no longer “is this good enough to use.” It is “are we organised to use it well.”

And that is the gap we keep seeing in the field. The teams that get value out of 4.7 are the teams that treat it like a new hire you are ramping up. Context docs. Clear tasks. Ways to verify. Real tooling. The teams that complain the loudest are the ones still using it like a slightly better autocomplete.

Mythos is still sitting behind glass somewhere. When the next one lands, the gap between those two groups is going to get very uncomfortable, very fast.

GET IN TOUCH

Let's talk.

30 minutes is enough to scope most builds. Pick whichever feels easier. Book a call, or drop us an email.

BOOK A CALL

30 min, no decks

A working conversation. Bring the problem, leave with the scope.

BOOK ON CALENDLY

EMAIL US

Got more to say first?

Drop us a line. We read everything and reply within a day.

SEND AN EMAIL

← ALL WRITING