cryptogon.com

AI Clones Your Voice After Listening for 5 Seconds

November 13th, 2019

Must listen examples at the link.

We describe a neural network-based system for text-to-speech (TTS) synthesis that is able to generate speech audio in the voice of many different speakers, including those unseen during training. Our system consists of three independently trained components: (1) a speaker encoder network, trained on a speaker verification task using an independent dataset of noisy speech from thousands of speakers without transcripts, to generate a fixed-dimensional embedding vector from seconds of reference speech from a target speaker; (2) a sequence-to-sequence synthesis network based on Tacotron 2, which generates a mel spectrogram from text, conditioned on the speaker embedding; (3) an auto-regressive WaveNet-based vocoder that converts the mel spectrogram into a sequence of time domain waveform samples. We demonstrate that the proposed model is able to transfer the knowledge of speaker variability learned by the discriminatively-trained speaker encoder to the new task, and is able to synthesize natural speech from speakers that were not seen during training. We quantify the importance of training the speaker encoder on a large and diverse speaker set in order to obtain the best generalization performance. Finally, we show that randomly sampled speaker embeddings can be used to synthesize speech in the voice of novel speakers dissimilar from those used in training, indicating that the model has learned a high quality speaker representation.

Posted in [???], Media, Rise of the Machines, Technology | Top Of Page

2 Responses to “AI Clones Your Voice After Listening for 5 Seconds”

dale says:

November 13, 2019 at 5:13 pm

I am, speechless. Amazing/Frightening. How can this power not be abused?
Kevin says:

November 13, 2019 at 8:06 pm

Even the best/current text to speech systems are pretty bad. But this…

What if, instead of training for five seconds, it trained for a minute or an hour?

I wonder if they have some other AI that is context aware, so that this thing can add different qualities to the synthesis??? For example, angry, happy, frightened, etc.

The deep fake thing is going to go into afterburner.

Hmm. How about convincing point and click audiobook productions? Possible?

You must be logged in to post a comment.

The New Zealand Copyright Act 1994 specifies certain circumstances where all or a substantial part of a copyright work may be used without the copyright owner's permission. A "fair dealing" with copyright material does not infringe copyright if it is for the following purposes: research or private study; criticism or review; or reporting current events. If you are a legal copyright holder, or a designated agent for such, and you believe a post on this website falls outside the boundaries of "fair dealing," and legitimately infringes on your or your client's copyright, please contact Kevin Flaherty. Cryptogon contains both original material and material from external sources. Original material: Copyright Kevin Flaherty. Material from external sources: Copyright the respective owners / authors.

Design by Andreas Viklund | Ported by Matteo Turchetto

news – analysis – conspiracies

AI Clones Your Voice After Listening for 5 Seconds

2 Responses to “AI Clones Your Voice After Listening for 5 Seconds”

Leave a Reply

Cryptogon Reader Support in December

Header Image