Problem-agnostic speech embeddings for multi-speaker text-to-speech with SampleRNN

dc.contributor
Universitat Politècnica de Catalunya. Doctorat en Arquitectura de Computadors
dc.contributor
Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions
dc.contributor
Universitat Politècnica de Catalunya. IDEAI-UPC - Intelligent Data sciEnce and Artificial Intelligence Research Group
dc.contributor.author
Álvarez Robert, David
dc.contributor.author
Pascual de la Puente, Santiago
dc.contributor.author
Bonafonte Cávez, Antonio
dc.date.accessioned
2026-02-11T06:23:49Z
dc.date.available
2026-02-11T06:23:49Z
dc.date.issued
2019
dc.identifier
Alvarez, D.; Pascual, S.; Bonafonte, A. Problem-agnostic speech embeddings for multi-speaker text-to-speech with SampleRNN. A: Speech Synthesis Workshop. «10th ISCA Speech Synthesis Workshop, SSW 2019: Vienna, Austria, September 20-22, 2019». Baixas: International Speech Communication Association (ISCA), 2019, p. 35-39. DOI 10.21437/SSW.2019-7 .
dc.identifier
https://hdl.handle.net/2117/454391
dc.identifier
10.21437/SSW.2019-7
dc.identifier.uri
http://hdl.handle.net/2117/454391
dc.description.abstract
Text-to-speech (TTS) acoustic models map linguistic features into an acoustic representation out of which an audible waveform is generated. The latest and most natural TTS systems build a direct mapping between linguistic and waveform domains, like SampleRNN. This way, possible signal naturalness losses are avoided as intermediate acoustic representations are discarded. Another important dimension of study apart from naturalness is their adaptability to generate voice from new speakers that were unseen during training. In this paper we first propose the use of problem-agnostic speech embeddings in a multi-speaker acoustic model for TTS based on SampleRNN. This way, we feed the acoustic model with speaker acousticallydependent representations that enrich the waveform generation more than embeddings unrelated to these factors. Our first results suggest that the proposed embeddings lead to better quality voices than those obtained with one-hot embeddings. Furthermore, as we can use any speech segment as an encoded representation during inference, the model is capable to generalize to new speaker identities without retraining the network. We finally show that, with a small increase of speech duration in the embedding extractor, we dramatically reduce the spectral distortion to close the gap towards the target identities.
dc.description.abstract
This research was supported by the project TEC2015-69266-P (MINECO/FEDER, UE).
dc.description.abstract
Peer Reviewed
dc.description.abstract
Postprint (published version)
dc.format
5 p.
dc.format
application/pdf
dc.language
eng
dc.publisher
International Speech Communication Association (ISCA)
dc.relation
https://www.isca-archive.org/ssw_2019/alvarez19_ssw.html
dc.relation
info:eu-repo/grantAgreement/MINECO//TEC2015-69266-P/ES/TECNOLOGIAS DE APRENDIZAJE PROFUNDO APLICADAS AL PROCESADO DE VOZ Y AUDIO/
dc.rights
Open Access
dc.subject
Àrees temàtiques de la UPC::Enginyeria de la telecomunicació::Processament del senyal::Processament de la parla i del senyal acústic
dc.subject
Àrees temàtiques de la UPC::Informàtica::Intel·ligència artificial::Aprenentatge automàtic
dc.subject
Speech synthesis
dc.subject
Text-to-speech
dc.subject
Problem-agnostic speech embeddings
dc.subject
Speaker adaptation
dc.title
Problem-agnostic speech embeddings for multi-speaker text-to-speech with SampleRNN
dc.type
Conference report


Files in this item

FilesSizeFormatView

There are no files associated with this item.

This item appears in the following Collection(s)

E-prints [72263]