Claude Opus 4.7: The Good, The Weird, and The Why Your Prompts Just Broke

Anthropic dropped a new AI model and—plot twist—they admitted it’s not their best work. That’s either the most honest PR move ever or the smartest marketing move in AI. Let’s go with “both.”

Claude Opus 4.7 is what happens when a company builds something powerful enough to genuinely concern the FBI, decides it’s too scary to release, and gives us the “safe” version instead. It’s the model equivalent of your mom taking away your car keys because you “drive too fast” but also won’t let you take the bus because it’s “unsafe.” You’re just… stuck.

I’ve been using it for a week. Here’s the deal: it’s better at coding, actually reads your garbage 2 AM screenshots now, and will absolutely break your old prompts in ways that’ll make you question your sanity.

TL;DR

SWE-bench Pro: 64.3% — Up from 53.4% on 4.6. It’s not benchmark hacking; it’s actually useful now.
Vision: 2,576px / 3.75MP — Triple what it was. Your blurry screenshots of error messages finally make sense.
Prompts are dead — “Make it better” does nothing now. 4.7 needs a spec, not vibes.
Same price — $5/$25 per million tokens. But the tokenizer uses 0-35% more, so your bill going up isn’t a bug—it’s a feature.
Cyber superpowers locked — They’ve intentionally neutered security features. Pentesters need to apply to a waitlist now.

The Glasswing Thing (aka Why This Model Exists)

Last week, Anthropic announced Project Glasswing—which is their fancy way of saying “we built something that worries us.” Claude Mythos Preview is so good at cybersecurity (the offensive kind) that they’re only giving it to 11 organizations. Everyone else gets Opus 4.7.

The Scenario: Imagine you’re a game developer who created a weapon so powerful it can crash servers. Your company says “cool, but we’re only giving the beta to esports organizations and locking it from public release because we’re scared what random teenagers will do with it.” That’s Anthropic with Claude Mythos and Opus 4.7.

They call this “differentially reducing” cyber capabilities—which is a fancy way of saying they trained it to forget certain things. The model has automatic guardrails now. Ask it to help with an exploit and it’ll either refuse or give you the “I can’t help with that” response. It’s like having a coworker who silently judges everything you do but only speaks up when you’re about to do something dumb.

The Numbers (That Actually Matter)

Let’s skip the fluff and look at what matters for actual work.

SWE-bench Pro — The Only Benchmark That Counts

Model	SWE-bench Pro	SWE-bench Verified
Claude Opus 4.7	64.3%	87.6%
GPT-5.4	57.7%	—
Gemini 3.1 Pro	54.2%	80.6%
Claude Opus 4.6	53.4%	80.8%

The gap between 4.6 and 4.7 is 10.9 points. That’s bigger than the gap between 4.6 and GPT-5.4 combined. On SWE-bench Verified (human-checked real GitHub issues), 4.7 hits 87.6%—nearly 7 points ahead of its predecessor.

SWE-bench matters because it’s not some abstract test. It’s real GitHub issues from Django, scikit-learn, matplotlib. The model has to understand a codebase, find bugs, write fixes, and verify they work. This is the “can this actually help me ship code” benchmark.

The Benchmark That Everyone “Solved”

GPQA Diamond—graduate-level science questions—has every frontier model at ~94%. Opus 4.7 hits 94.2%, GPT-5.4 gets 94.4%, Gemini 3.1 Pro gets 94.3%. The differences are basically noise at this point.

What this tells us: raw reasoning is maxed out. The real competition is now applied work—coding, agentic workflows, multi-step tasks that actually matter for your job.

Knowledge Work (aka “Don’t Use It For Your Essay”)

On GDPval-AA (fancy speak for “does actual professional work”), Opus 4.7 scores 1753 Elo vs GPT-5.4’s 1674 and Gemini 3.1 Pro’s 1314. For financial analysis specifically—the “Finance Agent” test—4.7 produces more rigorous analyses and presentations that don’t look like they were made by a chain-smoking analyst at 3 AM.

Can It Actually Code For Hours Without You Watching?

This is the main event. 4.7 claims a 14% improvement in multi-step agentic tasks with one-third the tool errors. It can now coordinate multiple workstreams in parallel instead of doing everything sequentially, and it passes “implicit-need tests” where it figures out what tools it needs without you explicitly telling it.

The Scenario: It’s 11 PM on a Tuesday. You’ve been avoiding refactoring your auth system for three months because it’s 200+ files of spaghetti that nobody understands—not even the original developer who left last year. You fire up Claude Code with Opus 4.7, describe what you want the architecture to look like, and go to bed.

With 4.6, you’d wake up to one of three outcomes: (1) a half-finished disaster, (2) the model got confused and started rewriting your database schema for some reason, or (3) it seemed done but introduced subtle bugs you’d find in production at the worst possible moment.

With 4.7, it sustains focus across the entire codebase, coordinates changes across multiple files at once, catches its own mistakes before reporting back, and—most importantly—keeps going through tool failures that would have stopped 4.6 dead. You wake up to code that actually passes your tests.

The “sustained focus” claim is the big one. Previous models would hallucinate APIs that don’t exist, forget context from earlier in the session, or just wander off-task after 20 minutes. Anthropic says 4.7 is “engineered to sustain focus over hours-long workflows.” I tested a 4-hour session—it actually held coherent context the entire time. The degradation that used to happen? Gone.

This isn’t just Anthropic’s marketing. Caitlin Colgrove, CTO of HEX, put it bluntly: “Correctly reports when data is missing instead of providing plausible-but-incorrect fallbacks.” That’s the change—4.7 knows when it doesn’t know, rather than confidently inventing nonsense.

Cursor, Warp, and Notion all verified this in pre-release testing. The difference isn’t just that 4.7 produces results—it’s that 4.7 produces results customers can actually ship. Previous models would generate code that looked right but fell apart under real use. 4.7 is the first Claude model where the “it works” rate matches the “it looks right” rate.

Will It Finally Read My Screenshots Properly?

Previous Claude models capped at 1,568 pixels—about 1.15 megapixels. Opus 4.7 handles 2,576 pixels / 3.75 megapixels. That’s dense enough to read fine print in diagrams, extract data from complex charts, and work with screenshots where the text is actually small.

The Scenario: It’s 2 AM. Production is on fire. Your error logs are a wall of red text cascading across three monitors. You’re running on caffeine and desperation. You screenshot the error, paste it to Claude, and pray.

At 4.6’s resolution, it might miss the one critical line buried in the noise—the database timeout causing the cascade, not the auth errors that look more prominent. At 4.7’s resolution, it reads the dense screenshot properly, identifies the actual root cause, and tells you exactly which config to change to fix it.

There’s a nice side effect: coordinate mapping is now 1:1 with actual pixels. No more scale-factor math.

The Scenario: You’re building an automation that clicks through a web app. With previous models, you’d need to calculate coordinates with some weird scaling factor because the model’s internal representation didn’t match reality. Now you just say “click the submit button” and it works. It’s the difference between a GPS that requires you to do math in your head versus one that just says “turn left.”

Fair warning: higher resolution = more tokens. Downsample if you don’t need the detail.

It Actually Remembers Things Now

This one’s subtle but important for long-running work. Opus 4.7 is better at using file-system-based memory. If your agent maintains a scratchpad, notes file, or structured memory store across conversation turns, it now actually improves at jotting down notes to itself and leveraging those notes in future tasks.

The Scenario: You’re working on a complex feature that spans three days. With previous models, you’d start each session by re-explaining the architecture, the decisions you’d made yesterday, and what you were planning to do next. By day three, you’d given up and just accepted that the model had no memory of day one.

With 4.7, you can tell it to maintain a notes.md file. It actually writes useful context to that file—architecture decisions, open questions, next steps—and reads it at the start of each new session. When you return on day three, it picks up exactly where you left off. No re-explaining. No “what were we doing again?”

Anthropic also offers a client-side memory tool if you don’t want to build your own scratchpad system. Either way, this is the difference between a model that pretends to remember and one that actually writes things down.

Why Your Old Prompts Are Now Broken

Here’s the change that’s going to annoy everyone initially. Opus 4.7 follows instructions literally. Previous models interpreted prompts loosely—they filled in gaps, inferred what you probably meant, generalized instructions. 4.7 does exactly what you say. Nothing more. Nothing less.

The Scenario: You have a prompt that’s worked for months: “Make this code better.”

With 4.6, that vague instruction triggered a comprehensive refactoring—breaking up functions, adding comments, improving variable names, optimizing imports. It “knew” what better meant.

With 4.7, that same prompt does… nothing useful. Because “better” isn’t a specification. It’s a judgment call. 4.7 won’t make judgments it wasn’t asked to make.

Now you need to write: “Refactor this 200-line function into smaller functions under 40 lines each, add JSDoc comments explaining parameters and return values, rename variables to describe their purpose rather than their type, and remove unused imports.”

It’s more work upfront. You actually have to think about what you want. But the payoff is exactly what you asked for—no creative additions you didn’t request, no hallucinated “improvements” that break your code.

The Scenario: You used to write “fix the bug” and 4.6 would do a full security audit, refactor the surrounding code, add tests, and write documentation. Now with 4.7, “fix the bug” might just… fix the bug. And nothing else. Your “comprehensive review” prompt now produces exactly what you said—and you’re either thrilled or frustrated depending on what you actually wanted.

Anthropic explicitly warns that old prompts might produce unexpected results. They’re not kidding. Budget time to re-tune your prompt library if you’re upgrading from 4.6.

The Cyber Thing (aka “Why Pentesters Are Frustrated”)

This is the interesting part. Anthropic didn’t just reduce cyber capabilities—they built 4.7 specifically to test safety guardrails for eventual Mythos release.

The Intentional Limitations

During training, Anthropic “experimented with efforts to differentially reduce” 4.7’s cyber capabilities. The model has automatic guardrails that detect and block requests for:

Vulnerability exploitation assistance
Automated attack generation
Social engineering content creation
Detailed security bypass instructions

For regular users, this is invisible protection. The guardrails work automatically. Attempts at harmful use get refused with explanation. Anthropic claims the model can’t be reliably jailbroken—but smart attackers will always find edge cases.

How Regular Users Stay Safe

If you’re using Claude for legitimate development, these restrictions make the model safer, not less useful. The literal instruction following actually helps here—4.7 is less likely to hallucinate dangerous code or suggest security anti-patterns that previous models might have “helpfully” invented.

The Scenario: You accidentally type “make auth optional for testing” at 2 AM. With 4.6, it might have “helpfully” commented out all your auth checks and created a security nightmare. With 4.7, it either refuses or asks for clarification. It’s the difference between a helpful-but-reckless coworker and one who actually has your back.

Benefits For Defensive Security

Security teams get genuine value from 4.7 within the guardrails:

Code review and vulnerability detection (SQL injection, XSS, insecure dependencies, CVE patterns)
Security architecture planning and threat modeling
Incident response documentation and playbook generation
Security policy writing and compliance documentation
Log analysis (the vision upgrade helps read dense SIEM dashboards)
Security awareness training materials

The vision upgrade specifically helps security analysts. Reading dense dashboards, parsing network diagrams, extracting indicators of compromise from screenshots—all work better at 2,576px.

The Dark Side (What Attackers Might Try)

Even with reduced capabilities:

Open-source intelligence gathering (analyzing public data)
High-res vision could read accidentally leaked credentials from blurry screenshots
Long-running task capabilities could support extended malicious research
Multi-agent coordination could parallelize recon across multiple targets
Jailbreak attempts specifically targeting 4.7’s restrictions
They’ll just use older models (4.5, 4.6) or competitors instead

The Reality Check

Here’s the uncomfortable truth: capability restrictions affect legitimate security researchers more than threat actors. The pentester with a Friday deadline gets blocked. The actual criminals use their own tools and don’t care about Claude’s restrictions.

The Scenario: You’re a penetration tester. Your client needs a report by Friday. You used Claude to speed up recon and document findings. Now half your workflow gets blocked, you’re filling out forms for a waitlist that won’t approve in time, and the actual attackers couldn’t care less because they’re not using Claude in the first place.

The Cyber Verification Program exists for this reason. Security professionals can apply at claude.com/form/cyber-use-case with institutional affiliation and use case documentation. It’s a pain, but it’s the trade-off.

Is this security theater? Partially. The restrictions inconvenience the good guys more than they stop bad guys. But it’s also a genuine attempt at responsible deployment. I don’t love the friction, but I get why they’re doing it.

The Effort Levels (Explained Properly)

Opus 4.7 introduces “xhigh” effort between high and max. Claude Code now defaults to xhigh for all plans. The old extended thinking budgets are gone—replaced with adaptive thinking that allocates reasoning based on task complexity.

Effort Level	When to Use
Low	Quick questions, simple transforms, when speed matters more
High	Most coding tasks, general reasoning, good quality/speed balance
Xhigh	Default in Claude Code. Complex multi-step tasks where thoroughness matters
Max	When you need the absolute best reasoning and don’t care about latency

Adaptive thinking is off by default. To enable it:

python

thinking = {"type": "adaptive"}

Task Budgets (Beta)

New feature: tell the model how many tokens to target for a full agentic loop. It sees a countdown and prioritizes work to finish gracefully within budget.

python

response = client.beta.messages.create(
    model="claude-opus-4-7",
    output_config={
        "effort": "high",
        "task_budget": {"type": "tokens", "total": 128000},
    },
    betas=["task-budgets-2026-03-13"],
)

Minimum is 20k tokens. It’s not a hard cap—it’s guidance. Use it for scoped work where you need to control costs, but skip it for open-ended creative tasks.

New Claude Code Features

If you use Claude Code (Anthropic’s VS Code extension), 4.7 brings two genuinely useful additions:

`/ultrareview` — The Code Review You Actually Want

Type /ultrareview and Claude Code spins up a dedicated review session that reads through your changes and flags bugs, design issues, and problems a careful human reviewer would catch. It’s not a surface-level lint check—it’s looking for logic errors, edge cases you missed, and architectural problems.

The scenario: You’ve been coding for six hours. You’re too tired to review your own work properly, but you know if you push now, you’ll regret it tomorrow. Instead of shipping broken code or staying up another two hours, run /ultrareview. It finds the off-by-one error you introduced, the missing error handling, and the function you forgot to update after renaming a parameter. You fix them, sleep, and ship clean code in the morning.

Pro and Max users get three free ultrareviews to try it out. After that, it’s part of your normal usage.

Auto Mode — Trust, But Verify (Less)

Auto mode is a new permissions setting where Claude makes decisions on your behalf without asking every time. This lets you run longer tasks with fewer interruptions—but unlike skipping all permissions (which is terrifying), auto mode maintains guardrails. It won’t delete production databases or push to main without your approval, but it will handle the hundred minor decisions that normally clog up your workflow.

The scenario: You’re refactoring a large codebase. Normally, every file move, every import update, every test fix requires a “yes/no” prompt. With auto mode, Claude handles the routine stuff and only pauses for genuinely consequential decisions. You get the speed of “yes to all” with the safety of “ask when it matters.”

Max users get auto mode now. It’s rolling out to other plans over time.

Breaking Changes (What Developers Need to Know)

Upgrading from 4.6 to 4.7 isn’t drop-in. Four things matter:

1. Extended Thinking Budgets Are Gone

The thinking: {"type": "enabled", "budget_tokens": N} pattern returns a 400 error. Use this instead:

python

# Before (Opus 4.6)
thinking = {"type": "enabled", "budget_tokens": 32000}

# After (Opus 4.7)
thinking = {"type": "adaptive"}
output_config = {"effort": "high"}

2. Sampling Parameters Are Gone

Setting temperature, top_p, or top_k to non-default values returns a 400 error. Remove them entirely and use prompting to guide behavior instead.

3. Thinking Content Is Hidden By Default

The model still thinks, but you won’t see the reasoning stream unless you opt in:

python

thinking = {
    "type": "adaptive",
    "display": "summarized",  # or "omitted" (default)
}

If your product streams reasoning to users, this looks like a long pause before output starts.

4. Tokenizer Changed

Opus 4.7 uses a new tokenizer. Same content maps to 1.0-1.35× more tokens—up to 35% more. Same price per token, but your bill might creep up. Update your max_tokens for headroom.

Behavior Changes (Not API, But Matters)

More literal instruction following at lower effort levels
Response length calibrates to task complexity
Fewer tool calls by default (raising effort increases them)
More direct, opinionated tone with fewer emoji
More regular progress updates throughout long tasks
Fewer subagents spawned by default (prompt to steer this)

Pricing and Availability

Price: Same as 4.6 — $5/million input, $25/million output
Where: Claude.ai, Claude API, Amazon Bedrock, Google Vertex AI, Microsoft Foundry
GitHub Copilot: Rolling out for Pro+, replaces 4.5/4.6
Context: 1M tokens (no premium)
Caching: Up to 90% savings
Batch: 50% discount

Competitors are cheaper—Gemini 3.1 Pro is $2/$12. The premium is justified if you’re doing serious engineering work where the SWE-bench improvements matter.

Using Opus 4.7 on AWS Bedrock

If you’re running on AWS infrastructure, Bedrock offers zero operator access—meaning your prompts and responses are never visible to Anthropic or AWS operators. Here’s how to invoke the model:

Python SDK (AnthropicBedrockMantle)

python

from anthropic import AnthropicBedrockMantle

# Initialize the Bedrock Mantle client (uses SigV4 auth automatically)
mantle_client = AnthropicBedrockMantle(aws_region="us-east-1")

# Create a message using the Messages API
message = mantle_client.messages.create(
    model="us.anthropic.claude-opus-4-7",
    max_tokens=2048,
    messages=[
        {"role": "user", "content": "Design a distributed architecture on AWS in Python that should support 100k requests per second across multiple geographic regions"}
    ]
)
print(message.content[0].text)

AWS CLI

bash

aws bedrock-runtime invoke-model \
  --model-id us.anthropic.claude-opus-4-7 \
  --region us-east-1 \
  --body '{"messages": [{"role": "user", "content": "Design a distributed architecture on AWS in Python that should support 100k requests per second across multiple geographic regions."}], "max_tokens": 512}' \
  --cli-binary-format raw-in-base64-out \
  invoke-model-output.txt

Capacity and scaling: Bedrock’s new inference engine dynamically allocates capacity and queues requests during high demand rather than rejecting them. You get 10,000 requests per minute per account per Region immediately, with more available upon request.

FAQ

Is upgrading from 4.6 worth it?

Yes if you do long-running coding tasks, vision-heavy work, or complex agentic workflows. The 10+ point SWE-bench jump translates to fewer “please try again” moments. No if you’re happy with 4.6 for simple chat—you won’t notice much difference.

Why is Anthropic releasing a “less capable” model than Mythos?

Mythos is too powerful to release safely. 4.7 is the test case—Anthropic is validating guardrails and deployment procedures before considering broader release of stronger systems. It’s responsible caution, even if it’s frustrating.

Will my old prompts work on 4.7?

Probably not. 4.7 is substantially more literal. Prompts that relied on “filling in the blanks” may produce unexpected or minimal results. Budget time to re-tune your prompt library before migrating production systems.

How much will 4.7 cost vs 4.6?

Same per-token price, but the new tokenizer means 0-35% more tokens for equivalent content. Your costs may increase proportionally. Monitor usage after migration and adjust prompts for conciseness if needed.

Can I use 4.7 for penetration testing?

Only through the Cyber Verification Program. Standard 4.7 has intentionally reduced cyber capabilities and automatic guardrails blocking high-risk requests. Apply at claude.com/form/cyber-use-case with institutional affiliation.

Is 4.7 safer than 4.6?

For general users, yes—reduced cyber capabilities and literal instruction following make it less likely to generate harmful content. But security professionals face access restrictions that may impede legitimate work.

Summary

Coding is actually better: 64.3% on SWE-bench Pro means it can actually help refactor your production code overnight instead of giving up halfway.
Vision finally works: 2,576px means screenshots actually make sense now.
Your prompts are dead: Rewrite them. “Make it better” does nothing now.
Cyber restrictions are real: They affect security researchers more than attackers.
Pricing unchanged: $5/$25 per million tokens, but tokenizer uses 35% more.
This is the test run: What they learn here determines if Mythos ever sees broad release.

The Glasswing Thing (aka Why This Model Exists)

The Numbers (That Actually Matter)

SWE-bench Pro — The Only Benchmark That Counts

The Benchmark That Everyone “Solved”

Knowledge Work (aka “Don’t Use It For Your Essay”)

Can It Actually Code For Hours Without You Watching?

Will It Finally Read My Screenshots Properly?

It Actually Remembers Things Now

Why Your Old Prompts Are Now Broken

The Cyber Thing (aka “Why Pentesters Are Frustrated”)

The Intentional Limitations

How Regular Users Stay Safe

Benefits For Defensive Security

The Dark Side (What Attackers Might Try)

The Reality Check

The Effort Levels (Explained Properly)

Task Budgets (Beta)

New Claude Code Features

/ultrareview — The Code Review You Actually Want

Auto Mode — Trust, But Verify (Less)

Breaking Changes (What Developers Need to Know)

1. Extended Thinking Budgets Are Gone

2. Sampling Parameters Are Gone

3. Thinking Content Is Hidden By Default

4. Tokenizer Changed

Behavior Changes (Not API, But Matters)

Pricing and Availability

Using Opus 4.7 on AWS Bedrock

Python SDK (AnthropicBedrockMantle)

AWS CLI

FAQ

Is upgrading from 4.6 worth it?

Why is Anthropic releasing a “less capable” model than Mythos?

Will my old prompts work on 4.7?

How much will 4.7 cost vs 4.6?

Can I use 4.7 for penetration testing?

Is 4.7 safer than 4.6?

Summary

What to Read Next

System_Continuity

MCP (Model Context Protocol): The Complete Developer's Guide

AI Agent Architecture Patterns: ReAct, Planning & Memory

Cursor vs Claude Code in 2026: Which One Should You Use?

`/ultrareview` — The Code Review You Actually Want