With Evals, OpenAI hopes to crowdsource AI model testing

Alongside GPT-4, OpenAI has open sourced a software framework to evaluate the performance of its AI models. Called Evals, OpenAI says that the tooling will allow anyone to report shortcomings in its models to help guide improvements.

It’s a sort of crowdsourcing approach to model testing, OpenAI explains in a blog post.

“We use Evals to guide development of our models (both identifying shortcomings and preventing regressions), and our users can apply it for tracking performance across model versions and evolving product integrations,” OpenAI writes. “We are hoping Evals becomes a vehicle to share and crowdsource benchmarks, representing a maximally wide set of failure modes and difficult tasks.”

OpenAI created Evals to develop and run benchmarks for evaluating models like GPT-4 while inspecting their performance. With Evals, developers can use datasets to generate prompts, measure the quality of completions provided by an OpenAI model and compare performance across different datasets and models.

Evals, which is compatible with several popular AI benchmarks, also supports writing new classes to implement custom evaluation logic. As an example to follow, OpenAI created a logic puzzles evaluation that contains 10 prompts where GPT-4 fails.

It’s all unpaid work, very unfortunately. But to incentivize Evals usage, OpenAI plans to grant GPT-4 access to those who contribute “high-quality” benchmarks.

“We believe that Evals will be an integral part of the process for using and building on top of our models, and we welcome direct contributions, questions, and feedback,” the company wrote.

Techcrunch event

San Francisco | October 27-29, 2025

REGISTER NOW

With Evals, OpenAI — which recently said it would stop using customer data to train its models by default — is following in the footsteps of others who’ve turned to crowdsourcing to robustify AI models.

In 2017, the Computational Linguistics and Information Processing Laboratory at the University of Maryland launched a platform dubbed Break It, Build It, which let researchers submit models to users tasked with coming up with examples to defeat them. And Meta maintains a platform called Dynabench that has users “fool” models designed to analyze sentiment, answer questions, detect hate speech and more.

Topics

AI, AI, Dev, open source, OpenAI, Startups

Kyle Wiggers

AI Editor

Kyle Wiggers was TechCrunch’s AI Editor until June 2025. His writing has appeared in VentureBeat and Digital Trends, as well as a range of gadget blogs including Android Police, Android Authority, Droid-Life, and XDA-Developers. He lives in Manhattan with his partner, a music therapist.

View Bio

Topics

More from TechCrunch

With Evals, OpenAI hopes to crowdsource AI model testing

Join 10k+ tech and VC leaders for growth and connections at Disrupt 2025

Join 10k+ tech and VC leaders for growth and connections at Disrupt 2025

Salesforce launches enterprise vibe-coding product, Agentforce Vibes

AI recruiter Alex raises $17M to automate initial job interviews

Vibe-coding startup Anything nabs a $100M valuation after hitting $2M ARR in its first two weeks

The AI services transformation may be harder than VCs think

Join 10k+ tech and VC leaders for growth and connections at Disrupt 2025

Join 10k+ tech and VC leaders for growth and connections at Disrupt 2025

Famed roboticist says humanoid robot bubble is doomed to burst

Electronic Arts will reportedly be acquired for $50B

Spotify to label AI music, filter spam and more in AI policy change

With Evals, OpenAI hopes to crowdsource AI model testing

Join 10k+ tech and VC leaders for growth and connections at Disrupt 2025

Join 10k+ tech and VC leaders for growth and connections at Disrupt 2025

Most Popular

Salesforce launches enterprise vibe-coding product, Agentforce Vibes

AI recruiter Alex raises $17M to automate initial job interviews

Vibe-coding startup Anything nabs a $100M valuation after hitting $2M ARR in its first two weeks

The AI services transformation may be harder than VCs think

Join 10k+ tech and VC leaders for growth and connections at Disrupt 2025

Join 10k+ tech and VC leaders for growth and connections at Disrupt 2025

Famed roboticist says humanoid robot bubble is doomed to burst

Electronic Arts will reportedly be acquired for $50B

Spotify to label AI music, filter spam and more in AI policy change