Millions of homes have voice-enabled devices, but when was the last time you heard a piece of synthesized speech longer than a handful of seconds? WellSaid Labs has pushed the field ahead with a voice engine that can easily and quickly generate hours of voice content that sounds just as good or better than the snippets we hear every day from Siri and Alexa.
The company has been working since its public debut last year to advance its tech from impressive demo to commercial product, and in the process found a lucrative niche that it can build from.
CTO Michael Petrochuk explained that early on, the company had essentially based its technology on prior research — Google’s Tacotron project, which established a new standard for realism in artificial speech.
“Despite being released two years ago, Tacotron 2 is still state of the art. But it has a couple issues,” explained Petrochuk. “One, it’s not fast — it takes three minutes to produce one second of audio. And it’s built to model 15 seconds of audio. Imagine that in a workflow where you’re generating 10 minutes of content — it’s orders of magnitude off where we want to be.”
Google’s Tacotron 2 simplifies the process of teaching an AI to speak
WellSaid completely rebuilt their model with a focus on speed, quality and length, which sounds like “focusing” on everything at once, but there are always plenty more parameters to optimize for. The result is a model that can generate extremely high-quality speech with any of 15 voices (and several languages) at about half real-time — so a minute-long clip would take about 36 seconds to generate instead of a couple hours.
Join 10k+ tech and VC leaders for growth and connections at Disrupt 2025
Netflix, Box, a16z, ElevenLabs, Wayve, Hugging Face, Elad Gil, Vinod Khosla — just some of the 250+ heavy hitters leading 200+ sessions designed to deliver the insights that fuel startup growth and sharpen your edge. Don’t miss the 20th anniversary of TechCrunch, and a chance to learn from the top voices in tech. Grab your ticket before doors open to save up to $444.
Join 10k+ tech and VC leaders for growth and connections at Disrupt 2025
Netflix, Box, a16z, ElevenLabs, Wayve, Hugging Face, Elad Gil, Vinod Khosla — just some of the 250+ heavy hitters leading 200+ sessions designed to deliver the insights that fuel startup growth and sharpen your edge. Don’t miss a chance to learn from the top voices in tech. Grab your ticket before doors open to save up to $444.
This seemingly basic capability has plenty of benefits. Not only is it faster, but it makes working with the results simpler and easier. As a producer of audio content, you can just drop in a script hundreds of words long, listen to what it puts out, then tweak its pronunciation or cadence with a few keystrokes. Tacotron changed the synthetic speech space, but it has never really been a product. WellSaid builds on its advances with its own to create both a usable piece of software, and arguably a better speech system overall.
As evidence, clips generated by the model — 15-second ones, so they can compete with Tacotron and others — reached a milestone of being equally well-rated as human voices in tests organized by WellSaid. There’s no objective measure for this kind of thing, but asking lots of humans to weigh in on how human something sounds is a good place to start.
As part of the team’s work to achieve “human parity” under these conditions, they also released a number of audio clips demonstrating how the model can produce much more demanding content.