I Read It On The Internet

Microsoft’s New AI Can Clone Your Voice In Just 3 Seconds

Image Credit: Shutterstock.com/Tancha

AI is being used to generate everything from images to text to artificial proteins, and now another thing has been added to the list: speech. Last week researchers from Microsoft released a paper on a new AI called VALL-E that can accurately simulate anyone’s voice based on a sample just three seconds long. VALL-E isn’t the first speech simulator to be created, but it’s built in a different way than its predecessors—and could carry a greater risk for potential misuse.

This article was written by Vanessa Bates Ramirez and originally published by Singularity Hub.

Most existing text-to-speech models use waveforms (graphical representations of sound waves as they move through a medium over time) to create fake voices, tweaking characteristics like tone or pitch to approximate a given voice. VALL-E, though, takes a sample of someone’s voice and breaks it down into components called tokens, then uses those tokens to create new sounds based on the “rules” it already learned about this voice. If a voice is particularly deep, or a speaker pronounces their A’s in a nasal-y way, or they’re more monotone than average, these are all traits the AI would pick up on and be able to replicate.

The model is based on a technology called EnCodec by Meta, which was just released this part October. The tool uses a three-part system to compress audio to 10 times smaller than MP3s with no loss in quality; its creators meant for one of its uses to be improving the quality of voice and music on calls made over low-bandwidth connections.

To train VALL-E, its creators used an audio library called LibriLight, whose 60,000 hours of English speech is primarily made up of audiobook narration. The model yields its best results when the voice being synthesized is similar to one of the voices from the training library (of which there are over 7,000, so that shouldn’t be too tall of an order).

Besides recreating someone’s voice, VALL-E also simulates the audio environment from the three-second sample. A clip recorded over the phone would sound different than one made in person, and if you’re walking or driving while talking, the unique acoustics of those scenarios are taken into account.

Some of the samples sound fairly realistic, while others are still very obviously computer-generated. But there are noticeable differences between the voices; you can tell they’re based on people who have different speaking styles, pitches, and intonation patterns.

The team that created VALL-E knows it could very easily be used by bad actors; from faking sound bites of politicians or celebrities to using familiar voices to request money or information over the phone, there are countless ways to take advantage of the technology. They’ve wisely refrained from making VALL-E’s code publicly available, and included an ethics statement at the end of their paper (which won’t do much to deter anyone who wants to use the AI for nefarious purposes).

It’s likely just a matter of time before similar tools spring up and fall into the wrong hands. The researchers suggest the risks that models like VALL-E will present could be mitigated by building detection models to gauge whether audio clips are real or synthesized. If we need AI to protect us from AI, how do know if these technologies are having a net positive impact? Time will tell.

U Cast Studios

Next Average Work Week Has Peaked And Total Aggregate Hours Is Rolling Over »

Previous « Corporate Insiders Embark On A Buyers’ Strike

4 years ago

Spotting Market Bubbles: Why History Says It’s Nearly Impossible

If you knew you were standing inside a stock market bubble, you wouldn’t be standing… Read More

2 hours ago

News

Credit Card Outage Disrupts Payments Across Japan

A nationwide credit card outage in Japan disrupted shop payments, rail pass purchases and Mobile… Read More

2 hours ago

Business

“Asymmetric Warfare Boom” Accelerates As US Army Awards Neros $500 Million For FPV Attack Drones

Defense startup Neros has secured a US Army contract valued at up to $500 million… Read More

2 hours ago

Lifestyle

Who Should Own AI And Your Data?

Ownership of AI and control of personal data are no longer abstract questions for techies… Read More

1 day ago

Lifestyle

America’s Most Overlooked Developer: Local Churches

Faith institutions already own land and want to help address community needs — can this… Read More

1 day ago

Business

These Are The States Driving America’s Economic Growth

The U.S. economy grew 2.1% in real terms in 2025, but that national figure tells… Read More

3 days ago

This website uses cookies.

Microsoft’s New AI Can Clone Your Voice In Just 3 Seconds

Related Post

Recent Posts

Spotting Market Bubbles: Why History Says It’s Nearly Impossible

Credit Card Outage Disrupts Payments Across Japan

“Asymmetric Warfare Boom” Accelerates As US Army Awards Neros $500 Million For FPV Attack Drones

Who Should Own AI And Your Data?

America’s Most Overlooked Developer: Local Churches

These Are The States Driving America’s Economic Growth