OpenAI's GDPval Benchmark Shows AI Models Are Now Performing Nearly As Well As Experts At Economically Viable Tasks

There are plenty of benchmarks on how AI models perform in specific fields like math and programming, but a new benchmark by OpenAI aims to see how well AI models can perform knowledge work tasks that are vital to the economy.

OpenAI has introduced GDPval, a groundbreaking evaluation framework that measures how well AI models perform on real-world, economically valuable tasks across 44 professional occupations. The results reveal that today’s frontier AI models are rapidly approaching expert-level performance on the kinds of knowledge work that drives economic productivity.

Beyond Academic Benchmarks

While previous AI evaluations have focused on academic tests and coding challenges, GDPval takes a different approach by measuring performance on actual work products from experienced professionals. The evaluation spans nine major industries contributing significantly to U.S. GDP, including healthcare, finance, manufacturing, legal services, and software development.

The benchmark includes 1,320 specialized tasks designed and vetted by professionals with an average of 14 years of experience. Unlike traditional text-based prompts, these tasks mirror real workplace scenarios, complete with reference files, context documents, and deliverables ranging from legal briefs and engineering blueprints to customer support conversations and nursing care plans.

Expert-Level Performance Within Reach

In blind evaluations where industry experts compared AI-generated outputs against human professional work, OpenAI found that the best models are producing outputs rated as good as or better than experts in a substantial portion of tasks. Claude Opus 4.1 led the pack, with outputs matching or exceeding expert quality in just under half of the 220 tasks in the public gold set. The model particularly excelled in aesthetic elements like document formatting and slide layout.

GPT-5, released in summer 2025, demonstrated the strongest performance on accuracy-focused tasks requiring domain-specific knowledge. The progression from GPT-4o to GPT-5 over roughly one year showed more than a tripling of performance on these economically valuable tasks, following a clear linear improvement trend.

Speed and Cost Advantages

Beyond quality, frontier models completed GDPval tasks approximately 100 times faster and 100 times cheaper than industry experts, based on pure inference time and API costs. However, OpenAI notes these figures don’t account for the human oversight, iteration, and workplace integration required in real-world applications.

A Representative Cross-Section of Knowledge Work

The occupations included in GDPval were selected through a rigorous process. OpenAI started with industries contributing over five percent to U.S. GDP, then identified the five highest-wage occupations within each industry that qualified as predominantly knowledge work. An occupation made the cut if at least 60 percent of its component tasks didn’t require physical labor.

The resulting mix spans software developers, lawyers, accountants, registered nurses, mechanical engineers, financial analysts, pharmacists, producers, journalists, and dozens more. Each task underwent an average of five rounds of expert review to ensure it accurately represented real professional work.

The Grading Challenge

Evaluating open-ended professional work products presents unique challenges compared to multiple-choice tests. OpenAI addressed this by recruiting expert graders from each occupation who blindly compared AI and human deliverables without knowing which was which. These experts ranked outputs and classified AI work as better than, as good as, or worse than human benchmarks.

Task writers also created detailed scoring rubrics to add consistency. OpenAI developed an experimental automated grading system that attempts to predict human expert judgments, though the company acknowledges it’s not yet reliable enough to replace human graders.

Implications for the Future of Work

The results suggest AI is becoming capable of handling certain routine, well-specified tasks that previously required expert attention. OpenAI frames this as an opportunity for human workers to focus on creative, judgment-intensive aspects of their roles while delegating more repetitive work to AI systems.

The company emphasizes that most jobs involve complexities beyond discrete tasks that can be easily specified, and that its goal is to ensure broad access to these productivity tools while supporting workers through transitions in the job market.

Current Limitations and Next Steps

OpenAI acknowledges GDPval represents an early step. The current version uses one-shot evaluations that don’t capture iterative workflows common in real professional settings, such as revising documents after feedback or refining analyses. It also doesn’t measure how models handle the ambiguity professionals often face when deciding what deliverable to create in the first place. But with top AI models already scoring nearly as well as experts in many jobs, it appears that AI-led job losses among knowledge will only accelerate in the coming years.