New OpenAI Benchmark Finds AI Can Do 40% Of Software Engineering Tasks On Upwork

While most people agree that AI is getting exponentially better at coding, a company at the forefront of creating AI models is looking to set up a measure of figuring out how good it’s really getting.

OpenAI has launched a new benchmark named SWE-Lancer, which seeks to test the real-world coding abilities of LLMs. “Today we’re launching SWE-Lancer—a new, more realistic benchmark to evaluate the coding performance of AI models. SWE-Lancer includes over 1,400 freelance software engineering tasks from Upwork, valued at $1 million USD total in real-world payouts,” OpenAI said on X.

“SWE-Lancer tasks span the full engineering stack, from UI/UX to systems design, and include a range of task types, from $50 bug fixes to $32,000 feature implementations. SWE-Lancer includes both independent engineering tasks and management tasks, where models choose between technical implementation proposals,” it added.

The benchmark has 1,488 software engineering tasks from popular freelancing site Upwork. These tasks cumulatively had $1 million in payouts, which meant that real people were willing to pay $1 million to have them completed. Human freelancers typically took an average of 21 days to complete these tasks. OpenAI also tested top LLMs in completing these tasks.

Surprisingly, the best performance was observed with Claude 3.5 Sonnet, made by OpenAI’s rival Anthropic. Claude Sonnet managed to finish tasks worth $403,325, but couldn’t finish the other tasks. Next came OpenAI’s o1, which completed tasks worth $380,350. Then came GPT-4o, which was able to finish tasks worth $303,525.

Image

This benchmark suggests that OpenAI believes that LLMs will soon get quite good at finishing programming tasks that were posted on freelancing websites. Sam Altman has previously said that internally it’s developed an AI system that can rank 50th in the world at coding, and added that he expected the best programmer in the world by the end of the year would be an AI. And with OpenAI looking to calculate just how many real-world software engineering tasks LLMs can do — and them already being able to do nearly 40 percent of them — it appears that the job market for software engineers could end up looking very different by the end of the year.

Posted in AI