Multi-speaker neural vocoder

dc.contributor
Universitat Politècnica de Catalunya. Doctorat en Intel·ligència Artificial
dc.contributor
Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions
dc.contributor
Universitat Politècnica de Catalunya. ROBiri - Grup de Percepció i Manipulació Robotitzada de l'IRI
dc.contributor
Universitat Politècnica de Catalunya. IDEAI-UPC - Intelligent Data sciEnce and Artificial Intelligence Research Group
dc.contributor.author
Barbany Mayor, Oriol
dc.contributor.author
Bonafonte Cávez, Antonio
dc.contributor.author
Pascual de la Puente, Santiago
dc.date.accessioned
2026-02-13T08:05:44Z
dc.date.available
2026-02-13T08:05:44Z
dc.date.issued
2018
dc.identifier
Barbany, O.; Bonafonte, A.; Pascual, S. Multi-speaker neural vocoder. A: International Conference on Advances in Speech and Language Technologies for Iberian Languages. «Fourth International Conference, IberSPEECH 2018: Barcelona, Spain, 21-23 November 2018: proceedings». Baixas: International Speech Communication Association (ISCA), 2018, p. 30-34. DOI 10.21437/IberSPEECH.2018-7 .
dc.identifier
https://hdl.handle.net/2117/454510
dc.identifier
10.21437/IberSPEECH.2018-7
dc.identifier.uri
http://hdl.handle.net/2117/454510
dc.description.abstract
Statistical Parametric Speech Synthesis (SPSS) offers more f lexibility than unit-selection based speech synthesis, which was the dominant commercial technology during the 2000s decade. However, classical SPSS systems generate speech with lower naturalness than unit-selection methods. Deep learning based SPSS, thanks to recurrent architectures, surpasses classical SPSS limits. These architectures offer high quality speech while preserving the desired flexibility in choosing the parameters such as the speaker, the intonation, etc. This paper exposes two proposals conceived to improve deep learning-based text-to-speech systems. First a baseline model, obtained by adapting SampleRNN, making it as a speaker-independent neural vocoder that generates the speech waveform from acoustic parameters. Then two approaches are proposed to improve the quality, applying speaker dependent normalization of the acoustic features, and the look ahead, consisting on feeding acoustic features of future frames to the network with the aim of better modeling the present waveform and avoiding possible discontinuities. Human listeners prefer the system that combines both techniques, which reaches a rate of 4 in the mean opinion score scale (MOS) with the balanced dataset and outperforms the other models.
dc.description.abstract
This research was supported by the project TEC2015-69266-P (MINECO/FEDER, UE).
dc.description.abstract
Peer Reviewed
dc.description.abstract
Postprint (published version)
dc.format
5 p.
dc.format
application/pdf
dc.language
eng
dc.publisher
International Speech Communication Association (ISCA)
dc.relation
https://www.isca-archive.org/iberspeech_2018/barbany18_iberspeech.html
dc.relation
info:eu-repo/grantAgreement/MINECO//TEC2015-69266-P/ES/TECNOLOGIAS DE APRENDIZAJE PROFUNDO APLICADAS AL PROCESADO DE VOZ Y AUDIO/
dc.rights
Open Access
dc.subject
Àrees temàtiques de la UPC::Enginyeria de la telecomunicació::Processament del senyal::Processament de la parla i del senyal acústic
dc.subject
Àrees temàtiques de la UPC::Informàtica::Intel·ligència artificial::Aprenentatge automàtic
dc.subject
Deep learning
dc.subject
Speech synthesis
dc.subject
Recurrent neural networks
dc.subject
Text-to-speech
dc.subject
SampleRNN
dc.subject
Time series
dc.title
Multi-speaker neural vocoder
dc.type
Conference report


Ficheros en el ítem

FicherosTamañoFormatoVer

No hay ficheros asociados a este ítem.

Este ítem aparece en la(s) siguiente(s) colección(ones)

E-prints [72263]