I Read It On The Internet

Microsoft’s New AI Can Clone Your Voice In Just 3 Seconds

AI is being used to generate everything from images to text to artificial proteins, and now another thing has been added to the list: speech. Last week researchers from Microsoft released a paper on a new AI called VALL-E that can accurately simulate anyone’s voice based on a sample just three seconds long. VALL-E isn’t the first speech simulator to be created, but it’s built in a different way than its predecessors—and could carry a greater risk for potential misuse.

This article was written by Vanessa Bates Ramirez and originally published by Singularity Hub.

Most existing text-to-speech models use waveforms (graphical representations of sound waves as they move through a medium over time) to create fake voices, tweaking characteristics like tone or pitch to approximate a given voice. VALL-E, though, takes a sample of someone’s voice and breaks it down into components called tokens, then uses those tokens to create new sounds based on the “rules” it already learned about this voice. If a voice is particularly deep, or a speaker pronounces their A’s in a nasal-y way, or they’re more monotone than average, these are all traits the AI would pick up on and be able to replicate.

The model is based on a technology called EnCodec by Meta, which was just released this part October. The tool uses a three-part system to compress audio to 10 times smaller than MP3s with no loss in quality; its creators meant for one of its uses to be improving the quality of voice and music on calls made over low-bandwidth connections.

To train VALL-E, its creators used an audio library called LibriLight, whose 60,000 hours of English speech is primarily made up of audiobook narration. The model yields its best results when the voice being synthesized is similar to one of the voices from the training library (of which there are over 7,000, so that shouldn’t be too tall of an order).

Besides recreating someone’s voice, VALL-E also simulates the audio environment from the three-second sample. A clip recorded over the phone would sound different than one made in person, and if you’re walking or driving while talking, the unique acoustics of those scenarios are taken into account.

Some of the samples sound fairly realistic, while others are still very obviously computer-generated. But there are noticeable differences between the voices; you can tell they’re based on people who have different speaking styles, pitches, and intonation patterns.

The team that created VALL-E knows it could very easily be used by bad actors; from faking sound bites of politicians or celebrities to using familiar voices to request money or information over the phone, there are countless ways to take advantage of the technology. They’ve wisely refrained from making VALL-E’s code publicly available, and included an ethics statement at the end of their paper (which won’t do much to deter anyone who wants to use the AI for nefarious purposes).

It’s likely just a matter of time before similar tools spring up and fall into the wrong hands. The researchers suggest the risks that models like VALL-E will present could be mitigated by building detection models to gauge whether audio clips are real or synthesized. If we need AI to protect us from AI, how do know if these technologies are having a net positive impact? Time will tell.

Share
U Cast Studios

Recent Posts

  • Lifestyle

How Out-Of-Work Fisherman Saved The American Revolution

George Washington knew his forces could not win the American Revolutionary War without some measure… Read More

2 days ago
  • Lifestyle

The Cost Of The Grain That Feeds Half The World Just Posted Biggest Monthly Surge Since 2008

Asian rice prices logged their biggest monthly gain in nearly two decades in May, as… Read More

2 days ago
  • I Read It On The Internet

AI Can Chart A Course To Disaster Faster Than Humans Can Notice

Earlier this year, researchers at King’s College London gave three commercial AI models—GPT-5.2, Claude Sonnet 4,… Read More

2 days ago
  • Lifestyle

How Sleep And Dementia May Be Linked

A new article digs into how sleep, the brain’s process for clearing waste, and dementia… Read More

6 days ago
  • Business

Data Centers Now Consume 6% Of US Electricity—And The Backlash Has Begun

Strong opposition kicks in when data center demand surpasses 5% of a country's power supply.… Read More

7 days ago
  • Business

Oklo COO Says Nuclear Waste Could Power America For 150 Years

Earlier this week, we covered Oklo’s approval by Chris Wright’s DOE to convert plutonium previously set for… Read More

7 days ago

This website uses cookies.