Model Drop

Claude Opus 4.8

May 28, 2026

Anthropic dropped Claude Opus 4.8 an hour ago, and on paper it’s the kind of release that’s easy to shrug at. Benchmarks tick up. Price stays flat ($5 per million input tokens, $25 per million out, same as 4.7). Anthropic’s own post calls it “a modest but tangible improvement on its predecessor,” which is refreshingly honest. Nobody’s claiming this thing cures cancer.

But: Opus 4.8 is roughly four times less likely than 4.7 to let a flaw in its own code pass without flagging it, and for those familiar with Claude that’s the massive upgrade.

Model: Claude Opus 4.8 (claude-opus-4-8 on the Claude API). Effort levels: high (default), xhigh (extra), and max. Adaptive thinking is the only thinking mode.

Model type: Text + vision input, text output. Same multimodal input stack as Opus 4.7. No native image, audio, or video output.

Ship date: May 28, 2026

Maker: Anthropic

Pricing: $5 / $25 per million input / output tokens, flat from Opus 4.7. Fast mode (research preview on the API) runs up to 2.5x output tokens per second at $10 / $50 per million, which Anthropic says is roughly three times cheaper than fast inference on previous models.

Available on: claude.ai, Claude Code (Team, Enterprise, and Max plans for the dynamic-workflows preview), Cowork, the Claude API, Amazon Bedrock, Google Vertex AI, and Microsoft Foundry (200k context there). GitHub Copilot added it the same morning for Pro+, Business, and Enterprise (15x premium multiplier until usage-based billing lands June 1), and Cursor shipped it in the model picker at launch.

Headline benchmarks: SWE-Bench Pro at 69.2% (Opus 4.7: 64.3%, GPT-5.5: 58.6%, Gemini 3.1 Pro: 54.2%), the agentic-coding lead. OSWorld-Verified computer use at 83.4% (Opus 4.7: 82.8%, GPT-5.5: 78.7%). GDPval-AA knowledge work at 1890 Elo (GPT-5.5: 1769, Opus 4.7: 1753). Online-Mind2Web web agents at 84%, what Anthropic calls a meaningful jump over both Opus 4.7 and GPT-5.5. Where it trails: Terminal-Bench 2.1 at 74.6%, behind GPT-5.5’s 78.2% (still well ahead of Opus 4.7’s 66.1%). Humanity’s Last Exam at 49.8% without tools, 57.9% with.

Other info: 1M token context window by default on the API, Bedrock, and Vertex (200k on Microsoft Foundry), 128k max output. Adaptive thinking only, no extended-thinking budgets. Trained to flag uncertainty in its own work and roughly four times less likely than 4.7 to let a flaw in code it wrote pass without comment. Alignment assessment hits new highs on prosocial traits with misaligned-behavior rates close to Claude Mythos Preview, Anthropic’s best-aligned model. Full system card published with pre-deployment safety testing (Anthropic has deployed every Opus 4.x model under its ASL-3 standard).

More details: Introducing Claude Opus 4.8 (Anthropic)

Thanks for reading Handy AI! This post is public so feel free to share it.

What shipped

Anthropic released Claude Opus 4.8 on Thursday morning and, in a move that’s almost disorienting from a model lab, undersold it. The post calls it “a modest but tangible improvement on its predecessor.” Price holds flat at $5 / $25 per million tokens, the architecture and tool surface carry over from 4.7, and the benchmark gains are real but incremental. The story Anthropic actually leans on is behavioral: Opus 4.8 is trained to flag uncertainty about its own work, push back on plans that don’t hold up, and stop claiming progress it can’t support. The headline number is that it’s around four times less likely than 4.7 to let a flaw in code it produced slip by unremarked. Ben Sherry at Inc. ran with the framing the launch was built to earn, calling it Anthropic’s “most honest” model yet.

What’s new

Opus 4.8 reads as a post-training and behavior upgrade on the 4.7 base rather than a new model from scratch, so the benchmark deltas are narrow. What changed is how it behaves under load and what you can do with it, and five things stand out.

It tells you when it’s not sure. Anthropic trained 4.8 to surface uncertainty about its own output instead of papering over it, and it’s roughly four times less likely than 4.7 to let a flaw in code it wrote pass without flagging it. The same training shows up in the alignment numbers, where misaligned behavior lands near Mythos Preview (Anthropic’s “best-behaved “ model).
Effort is a dial you control. Effort defaults to high on every surface, including Claude Code, with xhigh and max available for hard problems and long async runs. Anthropic says high spends about the same tokens as 4.7’s default on coding tasks while scoring better, with adaptive thinking deciding per turn whether to reason at all.