SWE-bench Pro Just Killed the "Proprietary Models Are Better" Argument

Kimi K2.6 (58.6%) and GLM-5.1 (58.4%) now match or beat GPT-5.4 (57.7%) on the only uncontaminated coding benchmark. OpenAI itself stopped reporting SWE-bench Verified. The 6-point gap to Opus 4.7 comes at 10x the price.

The Benchmark Nobody Can Game

For two years, the AI industry pointed to SWE-bench Verified as proof that proprietary models were better at coding. Claude Opus 4.5 scored 80.9%. GPT-5.2 hit 79.4%. Open-source models trailed at 70-75%. The narrative wrote itself: pay more, get more.

Then OpenAI itself broke the story. On February 23, 2026, OpenAI published "Why SWE-bench Verified no longer measures frontier coding capabilities." The findings were devastating: frontier models had memorized the test data. GPT-5.2, given only a hint and a task ID, reproduced the exact gold patch verbatim — including specific function names and inline comments. Claude Opus 4.5 quoted inline comments word-for-word. Gemini 3 Flash reproduced complete patches from task IDs alone.

"Improvements on SWE-bench Verified no longer reflect meaningful improvements in models' real-world software development abilities. Instead, they increasingly reflect how much the model was exposed to the benchmark at training time." — OpenAI, "Why SWE-bench Verified no longer measures frontier coding capabilities" (February 2026)

OpenAI's recommendation: stop using SWE-bench Verified. Use SWE-bench Pro instead.

Why SWE-bench Pro Changes Everything

SWE-bench Pro was designed specifically to address Verified's failures. The differences are not incremental:

1,865
Tasks across 41 repositories — vs. 500 tasks in 12 repos on Verified
107.4
Average lines changed per task — vs. median 4 lines on Verified
4.1
Average files changed per task — vs. ~1 on Verified

Verified tested whether a model could fix a single Python file by changing a few lines. Pro tests whether a model can coordinate changes across multiple files in multiple languages — Python, Go, TypeScript, JavaScript — in real codebases with real complexity. Where Verified's easiest tasks require changing just 1–2 lines, Pro's minimum is 10.

Most importantly, Pro is contamination-resistant. It draws from GPL-licensed repositories (creating a legal barrier to inclusion in training data) and proprietary codebases that definitively weren't in any training set. An 858-task held-out set detects overfitting. No model tested on Pro has ever reproduced a verbatim gold patch — the exact failure that discredited Verified.

The Numbers That Matter

On SWE-bench Pro, the leaderboard tells a very different story than the one the industry has been selling:

Claude Opus 4.7 — 64.3%

The proprietary leader, at $5/$25 per million tokens. Anthropic-reported with their own scaffolding. Under Scale AI's standardized SEAL harness, the predecessor Opus 4.5 scored 45.9% — suggesting the 64.3% reflects scaffolding advantage as much as model capability.

Kimi K2.6 — 58.6%

Open-weight, $0.60/$2.50 per million tokens. Moonshot AI's 1T parameter MoE model with 32B active parameters. Within 5.7 points of Opus 4.7 at 1/10th the per-token cost. Runs on a single Mac Studio.

GLM-5.1 — 58.4%

MIT-licensed, $0.95/$3.15 per million tokens. Z.AI's 754B parameter MoE with 40B active. Essentially tied with Kimi K2.6. The most permissive commercial license of any frontier model.

GPT-5.4 — 57.7%

Proprietary, $2.50/$15.00 per million tokens. OpenAI's flagship. Now behind both Kimi K2.6 and GLM-5.1 on the benchmark OpenAI itself recommends.

The gap between the best proprietary model (Opus 4.7 at 64.3%) and the best open models (~58.5%) is roughly 6 points. The gap between GPT-5.4 and the open models is negative — open is ahead. And the open models achieve this at roughly 1/5th to 1/10th the per-token cost.

The 35-Point Reality Check

The gap between SWE-bench Verified and Pro scores reveals how inflated the old benchmark was:

SWE-bench Verified (Contaminated)

  • Claude Opus 4.5: 80.9%
  • Claude Opus 4.6: 80.8%
  • Kimi K2.6: 80.2%
  • Qwen 3.6 Plus: 78.8%
  • GLM-5.1: 77.8%
  • Six models clustered within 0.8 points

SWE-bench Pro (Uncontaminated)

  • Claude Opus 4.7: 64.3%
  • Kimi K2.6: 58.6%
  • GLM-5.1: 58.4%
  • GPT-5.4: 57.7%
  • Qwen 3.6 Plus: 56.6%
  • Real differentiation, no memorization

On Verified, the top six models were packed within 3 percentage points of each other — a "statistical noise zone" where the ranking was essentially random. On Pro, models spread across a 7.7-point range, creating meaningful separation. The 35-point drop from Verified to Pro (for Claude Opus 4.5: 80.9% vs. 45.9% under standardized testing) is the contamination premium — the credit models received for tasks they'd already memorized.

What Scaffolding Reveals

SWE-bench Pro also exposes a truth that Verified obscured: scaffolding — the agent framework wrapped around the model — matters enormously. The same model can score very differently depending on how it's deployed:

45.9%
Claude Opus 4.5 on SWE-bench Pro with Scale AI's standardized SEAL harness
55.4%
Same model running through Claude Code's custom agent scaffolding
+9.5pp
Performance swing from scaffolding alone — more than the gap between most models

This means the 6-point gap between Opus 4.7 and the open models could narrow further with better agent frameworks for local models. Open-source projects like OpenCode and OpenClaw are already building exactly this — agentic coding tools that run open models locally with sophisticated scaffolding, at zero per-token cost.

For Faraday Machines customers, this is the key insight: on-premises deployment gives you control over both the model and the scaffolding. You can optimize the entire stack for your specific workload, rather than accepting whatever agent design a cloud provider ships.

The Cost-Quality Gap Is Now Absurd

Benchmarks only matter in the context of what you pay. Here's the cost to process 10 million input tokens and 2 million output tokens — roughly a day of heavy agentic coding for a small team:

Proprietary Models

  • Claude Opus 4.7: $100/day
  • GPT-5.4: $55/day
  • Annual (250 days): $13,750–$25,000
  • Data leaves your network
  • Usage capped by plan limits

Open Models (API)

  • Kimi K2.6: $11/day
  • GLM-5.1: $15.80/day
  • Annual (250 days): $2,750–$3,950
  • Optional: self-host for $0
  • Unlimited usage on-premises

On-premises deployment eliminates per-token costs entirely. A Faraday Machines cluster running Kimi K2.6 or GLM-5.1 pays only for hardware amortization and electricity. For a team currently paying $100/day for Opus 4.7 API access, the annual savings exceed $20,000 — and you get unlimited usage, zero data leakage, and the freedom to switch models whenever a better one is released.

Where Proprietary Still Leads

Honesty matters more than narrative. Claude Opus 4.7 still leads SWE-bench Pro by 6 points, and it maintains edges on certain reasoning benchmarks:

  • GPQA Diamond: Opus 4.7 scores 94.2%, vs. Kimi K2.6's 90.5% — a 3.7-point gap on graduate-level science reasoning
  • AIME 2026: GPT-5.4 scores 99.2%, vs. K2.6's 96.4% — mathematical reasoning still favors proprietary
  • Safety and compliance controls: Anthropic's Constitutional AI approach provides more transparent safety architecture than most open models
  • Ecosystem maturity: Claude Code, Cursor integration, and established enterprise tooling

But these advantages are narrowing, not widening. Every benchmark cycle closes the gap further. And for the specific task that matters most to engineering teams — writing and reviewing code in real codebases — the gap has effectively closed. A 6-point benchmark difference does not justify a 10x price difference for most use cases.

The Practical Implication

The "proprietary = better" assumption drove two years of purchasing decisions. Teams paid premium API prices because they believed the quality gap was large and permanent. SWE-bench Pro has shown that neither is true.

The smart play for engineering teams in 2026 is the hybrid approach:

  • Default to open models for the 80% of coding tasks where quality is indistinguishable — refactoring, test writing, code review, documentation, bug fixing
  • Reserve proprietary APIs for the rare tasks that genuinely need Opus-level reasoning — architectural decisions, security audits, complex cross-system debugging
  • Run open models on-premises to eliminate per-token costs, data leakage, and usage caps for the majority of daily work
  • Reinvest the savings into better agent scaffolding, fine-tuning on your codebase, and 24/7 automated batch workloads

SWE-bench Pro didn't just change the leaderboard. It changed the economics. When open models match proprietary performance at 1/10th the cost — and run on hardware you own — the case for paying cloud AI premiums collapses for all but the most specialized workloads.

References

[1] OpenAI. (2026). "Why SWE-bench Verified no longer measures frontier coding capabilities." February 23, 2026. Available at: openai.com

[2] Scale AI. (2026). SWE-bench Pro Leaderboard (Public Dataset). Available at: scale.com

[3] Morph Labs. (2026). "SWE-bench Pro Leaderboard (2026): Why 46% Beats 81%." Available at: morphllm.com

[4] BenchLM.ai. (2026). SWE-bench Pro Benchmark 2026. Available at: benchlm.ai

[5] Grey Newell. (2026). "SWE-bench Verified Is Broken: 5 Things I Found in the Source Code." Available at: greynewell.com

[6] Verdent AI. (2026). "Kimi K2.6 vs Claude Opus 4.6 vs GPT-5.4: Agentic Coding Benchmarks." Available at: verdent.ai

[7] OfficeChai. (2026). "China's Z.AI Releases GLM-5.1, Beats All US Models On SWE-Bench Pro." Available at: officechai.com

Run the Numbers Yourself

See how open models perform on your codebase. Faraday Machines clusters run Kimi K2.6 and GLM-5.1 at full speed with zero per-token costs. Benchmark them against your current cloud AI — the results may surprise you.

Schedule a Benchmark Session
Free benchmark comparison and cost analysis