Artificial Intelligence

SWE-bench, GPQA, BrowseComp: What AI Benchmarks Really Measure (and What They Hide)

Every model release comes with a benchmark table. But what do SWE-bench Pro, GPQA Diamond, CursorBench, and BrowseComp actually test? And why might a model improve in coding while regressing in research in the same version?

SWE-bench, GPQA, BrowseComp: What AI Benchmarks Really Measure (and What They Hide)

On April 16, 2026, Anthropic released Claude Opus 4.7 with a benchmark table that tells two stories at once: the model improved by 10.9 points on SWE-bench Pro and dropped 4.4 points on BrowseComp.

How can the same model advance and regress in the same version? The answer lies in what each benchmark actually measures—and what none of them measure correctly. If you read the news about Opus 4.7 and were left wondering what these names mean, this article is the glossary you were missing.

SWE-bench Verified — the standard that almost died

Created by Princeton researchers and released in 2024, SWE-bench Verified contains 500 tasks taken from real GitHub issues. The model receives the bug description and the code from that time, and must produce a patch that fixes the problem without breaking anything that was already working.

The score is the percentage of patches that pass both types of tests: those that should pass after the fix and those that were already passing and must continue to pass.

Even so, Anthropic and other labs continue to report the number. Claude Opus 4.7 reached 87.6%—but this number needs to be read with skepticism.

SWE-bench Pro — the version that tries to fix the problem

Released by Scale AI in September 2025, SWE-bench Pro was created specifically to address its predecessor's flaws. It has 1,865 tasks across 41 repositories, includes code in Python, Go, TypeScript, and JavaScript, and each task requires at least 10 modified lines—in practice, the average solution changes 107 lines distributed across 4 files.

The anti-contamination differentiator is the three-layer structure:

— Part of the dataset uses code under a copyleft license (GPL), which legally disincentivizes inclusion in proprietary training.

— A permanently private portion uses code from early-stage startups that was never public—inaccessible to any training crawler.

SWE-bench Verified 500 tarefas · GitHub público 1-2 linhas em média Só Python ⚠ Contaminação confirmada Opus 4.7: 87,6% SWE-bench Pro 1.865 tarefas · 41 repos 107 linhas em 4 arquivos (média) Python, Go, TypeScript, JS ✓ Dataset privado anti-contaminação Opus 4.7: 64,3%
SWE-bench Verified vs SWE-bench Pro — comparison

CursorBench — the real-world benchmark

CursorBench is the most honest proposal: instead of creating synthetic tasks, it uses code that the Cursor engineering team itself produced during real work in the IDE.

The mechanism is Cursor Blame: it tracks the code that was committed and associates each block with the prompt that generated it. The resulting tasks have an average of 352 lines across 8 files, with short and underspecified descriptions—exactly how developers actually ask models for things on a daily basis.

Four dimensions are evaluated: correctness, code quality, efficiency, and interaction behavior. An agentic grader recognizes multiple valid solutions for the same problem.

GPQA Diamond — the benchmark that is dying of success

GPQA Diamond is the most difficult subset of the Graduate-Level Google-Proof Q&A Benchmark: 198 multiple-choice questions in PhD-level biology, physics, and chemistry. The construction methodology is the strong point—each question is written by a PhD expert, validated by other experts, and only enters the Diamond if two expert validators get it right and non-experts with access to Google get it wrong.

The problem: the benchmark is almost dead due to saturation. The trajectory tells the story:

0% 50% 94% GPT-4 39% Claude 3 55% DeepSeek-R1 72% o3 83% Gemini 3 94% Opus 4.7 94,2% zona saturada
Evolution of GPQA Diamond — from challenging to saturated

When three different models reach 94% on 198 questions, the difference between them is statistically noise—not real performance. GPQA Diamond has stopped discriminating who is better at the top. That is why successors like Humanity's Last Exam (where even the best model scores only 41%) and FrontierMath have emerged.

BrowseComp — where Opus 4.7 worsened

Created by OpenAI in April 2025, BrowseComp tests the ability of agents to find information that is adversarially difficult on the open web. There are 1,266 questions built by the 'inverted question' method: the writer starts from a verifiable fact and creates a question that combines multiple restrictive characteristics in a huge search space.

Example of the logic: instead of asking 'who is the CEO of company X', the question would be 'which tech company founded in 2019, headquartered in Austin, that underwent an acquisition in 2023, has a CEO who studied at a specific university?'. Hard to find, easy to verify once you find it.

GDPVal-AA — measuring real economic value

GDPVal is the most conceptually ambitious benchmark: instead of academic tests, it measures the ability of models to produce real work deliverables. There are 1,320 tasks across 44 occupations in the 9 sectors that contribute most to the US GDP—legal memos, Excel financial models, engineering blueprints, nursing care plans.

The evaluation is blind head-to-head: experienced human judges compare the model's deliverable with that of a real expert and classify it as win, tie, or loss. GDPVal-AA is the reimplementation by Artificial Analysis using ELO ranking.

MCP-Atlas — agents using real tools

MCP-Atlas evaluates tool use via Model Context Protocol—the open standard that became the reference in 2025. There are 1,000 tasks executed against 36 real MCP servers with 220 tools (GitHub, Slack, Notion, Airtable, MongoDB, etc.). The agent receives a prompt without knowing which servers are available and must discover which tools to call among 10-25 exposed options, including distractors.

The scoring uses partial credit: the answer must contain specific verifiable factual claims. The dominant errors in the models are incorrect server selection and incorrect parameterization—precisely the failures of agents in real production.

Terminal-Bench 2.0 and OSWorld — what models do on a computer

Terminal-Bench 2.0 measures execution in a Linux terminal: 89 carefully curated tasks in software engineering, bioinformatics, security, and gaming, each with a Docker environment and verification via automated tests. The model must solve the complete problem autonomously in the terminal.

OSWorld goes further: it evaluates multimodal agents on real operating systems. The model receives screenshots and must produce mouse and keyboard actions. Verification is by the final state of the system—without an LLM as a judge. The human baseline is 72%. In April 2026, Opus 4.7 reached 78%—frontier models have already surpassed humans on this benchmark.

Finance Agent v1.1 — where Opus 4.7 advanced

The Finance Agent Benchmark from Vals AI contains 537 questions created with QC from Goldman Sachs and Citadel analysts. The agent has access to real tools: EDGAR via SEC_API, web search, HTML parser. Tasks range from extracting data from 10-Ks to financial M&A modeling.

337 of the 537 questions are permanently private—they were never public and never will be. Opus 4.7 reached 64.4% on this benchmark, +4.7pp over 4.6. The reason is structural: the Finance Agent rewards methodical execution of a predictable pipeline, where the greater literal instruction-following and self-verification of 4.7 shine—exactly the opposite of what BrowseComp requires.

Why numbers lie (partially)

Three structural problems affect all AI benchmarks, and the industry has started talking about them openly.

Data contamination. Public benchmarks end up in training corpora. Research shows that 144 exposures during training already produce detectable overfitting. OpenAI detected that models could reproduce the correct solution for SWE-bench Verified just with the task ID—without seeing the code.

Goodhart's Law. When a measure becomes a target, it ceases to be a good measure. An analysis of 2 million Chatbot Arena battles revealed that Meta tested 27 private variants of Llama-4 before publishing the best one, and that fine-tuning with Arena data produced gains of 112% on ArenaHard without improving any other capability.

Gap between benchmark and real world. METR demonstrated that about half of the patches that pass automated tests on SWE-bench would not be accepted by real maintainers. And in a controlled RCT with open-source developers, AI tools made devs 19% slower.

What really matters when choosing a model

In practice, professionals who use AI in production look at variables that public benchmarks rarely capture:

Cost per token and latency. A model that is 10% better on a benchmark but costs 3x more may be the wrong choice for your use case.

Reliability in your specific domain. Finance and legal experts report that the domain ranking is different from the general ranking—Sonnet 4.5 outperforms Opus in some specific legal tasks.

Real behavior with your data. The only way to know is to test. Teams like Cursor, Perplexity, and Windsurf perform dynamic routing between models with A/B testing on production telemetry—they don't choose by leaderboard.

Cada benchmark mede uma dimensão diferente Coding SWE-bench Verified SWE-bench Pro CursorBench Terminal-Bench 2.0 Raciocínio GPQA Diamond Humanity's Last Exam MMLU Trabalho Real GDPVal-AA Finance Agent v1.1 Uso de Computador OSWorld OSWorld-Verified Uso de Ferramentas MCP-Atlas Pesquisa na Web BrowseComp
Benchmark map — what each one actually measures

Benchmarks are maps, not territories. A map that shows only roads says nothing about the terrain. What the 2026 benchmarks are learning—at the cost of much public embarrassment—is that no number captures everything, and that the gap between 'passes the benchmark' and 'works in your code' will never go to zero.

The best benchmark remains your own use case.