Multi-speaker neural vocoder

Barbany Mayor, Oriol; Bonafonte Cávez, Antonio; Pascual de la Puente, Santiago

Multi-speaker neural vocoder

dc.contributor

Universitat Politècnica de Catalunya. Doctorat en Intel·ligència Artificial

dc.contributor

Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions

dc.contributor

Universitat Politècnica de Catalunya. ROBiri - Grup de Percepció i Manipulació Robotitzada de l'IRI

dc.contributor

Universitat Politècnica de Catalunya. IDEAI-UPC - Intelligent Data sciEnce and Artificial Intelligence Research Group

dc.contributor.author

Barbany Mayor, Oriol

dc.contributor.author

Bonafonte Cávez, Antonio

dc.contributor.author

Pascual de la Puente, Santiago

dc.date.accessioned

2026-02-13T08:05:44Z

dc.date.available

2026-02-13T08:05:44Z

dc.date.issued

2018

dc.identifier

Barbany, O.; Bonafonte, A.; Pascual, S. Multi-speaker neural vocoder. A: International Conference on Advances in Speech and Language Technologies for Iberian Languages. «Fourth International Conference, IberSPEECH 2018: Barcelona, Spain, 21-23 November 2018: proceedings». Baixas: International Speech Communication Association (ISCA), 2018, p. 30-34. DOI 10.21437/IberSPEECH.2018-7 .

dc.identifier

https://hdl.handle.net/2117/454510

dc.identifier

10.21437/IberSPEECH.2018-7

dc.identifier.uri

http://hdl.handle.net/2117/454510

dc.description.abstract

Statistical Parametric Speech Synthesis (SPSS) offers more f lexibility than unit-selection based speech synthesis, which was the dominant commercial technology during the 2000s decade. However, classical SPSS systems generate speech with lower naturalness than unit-selection methods. Deep learning based SPSS, thanks to recurrent architectures, surpasses classical SPSS limits. These architectures offer high quality speech while preserving the desired flexibility in choosing the parameters such as the speaker, the intonation, etc. This paper exposes two proposals conceived to improve deep learning-based text-to-speech systems. First a baseline model, obtained by adapting SampleRNN, making it as a speaker-independent neural vocoder that generates the speech waveform from acoustic parameters. Then two approaches are proposed to improve the quality, applying speaker dependent normalization of the acoustic features, and the look ahead, consisting on feeding acoustic features of future frames to the network with the aim of better modeling the present waveform and avoiding possible discontinuities. Human listeners prefer the system that combines both techniques, which reaches a rate of 4 in the mean opinion score scale (MOS) with the balanced dataset and outperforms the other models.

dc.description.abstract

This research was supported by the project TEC2015-69266-P (MINECO/FEDER, UE).

dc.description.abstract

Peer Reviewed

dc.description.abstract

Postprint (published version)

dc.format

5 p.

dc.format

application/pdf

dc.language

eng

dc.publisher

International Speech Communication Association (ISCA)

dc.relation

https://www.isca-archive.org/iberspeech_2018/barbany18_iberspeech.html

dc.relation

info:eu-repo/grantAgreement/MINECO//TEC2015-69266-P/ES/TECNOLOGIAS DE APRENDIZAJE PROFUNDO APLICADAS AL PROCESADO DE VOZ Y AUDIO/

dc.rights

Open Access

dc.subject

Àrees temàtiques de la UPC::Enginyeria de la telecomunicació::Processament del senyal::Processament de la parla i del senyal acústic

dc.subject

Àrees temàtiques de la UPC::Informàtica::Intel·ligència artificial::Aprenentatge automàtic

dc.subject

Deep learning

dc.subject

Speech synthesis

dc.subject

Recurrent neural networks

dc.subject

Text-to-speech

dc.subject

SampleRNN

dc.subject

Time series

dc.title

Multi-speaker neural vocoder

dc.type

Conference report

Ficheros en el ítem

Ficheros	Tamaño	Formato	Ver
No hay ficheros asociados a este ítem.

Este ítem aparece en la(s) siguiente(s) colección(ones)

E-prints [72263]