The emergence in the last week of a particularly effective voice synthesis machine learning model called VALL-E has prompted a new wave of concern over the possibility of deepfake voices made quick and easy — quickfakes, if you will. But VALL-E is more iterative than breakthrough, and the capabilities aren’t so new as you might think. Whether that means you should be more or less worried is up to you.
Voice replication has been a subject of intense research for years, and the results have been good enough to power plenty of startups, like WellSaid, Papercup and Respeecher. The latter is even being used to create authorized voice reproductions of actors like James Earl Jones. Yes: from now on Darth Vader will be AI generated.
VALL-E, posted on GitHub by its creators at Microsoft last week, is a “neural codec language model” that uses a different approach to rendering voices than many before it. Its larger training corpus and some new methods allow it to create “high-quality personalized speech” using just three seconds of audio from a target speaker.
That is to say, all you need is an extremely short clip like the following (all clips from Microsoft’s paper):
Join 10k+ tech and VC leaders for growth and connections at Disrupt 2025
Netflix, Box, a16z, ElevenLabs, Wayve, Hugging Face, Elad Gil, Vinod Khosla — just some of the 250+ heavy hitters leading 200+ sessions designed to deliver the insights that fuel startup growth and sharpen your edge. Don’t miss the 20th anniversary of TechCrunch, and a chance to learn from the top voices in tech. Grab your ticket before doors open to save up to $444.
Join 10k+ tech and VC leaders for growth and connections at Disrupt 2025
Netflix, Box, a16z, ElevenLabs, Wayve, Hugging Face, Elad Gil, Vinod Khosla — just some of the 250+ heavy hitters leading 200+ sessions designed to deliver the insights that fuel startup growth and sharpen your edge. Don’t miss a chance to learn from the top voices in tech. Grab your ticket before doors open to save up to $444.