Multi-speaker neural vocoder

Barbany Mayor, Oriol; Bonafonte Cávez, Antonio; Pascual de la Puente, Santiago; Barbany Mayor, Oriol; Bonafonte Cávez, Antonio; Pascual de la Puente, Santiago

Multi-speaker neural vocoder

Per accedir als documents amb el text complet, si us plau, seguiu el següent enllaç: http://hdl.handle.net/2117/454510

Autor/a

Barbany Mayor, Oriol

Bonafonte Cávez, Antonio

Pascual de la Puente, Santiago

Altres autors/es

Universitat Politècnica de Catalunya. Doctorat en Intel·ligència Artificial

Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions

Universitat Politècnica de Catalunya. ROBiri - Grup de Percepció i Manipulació Robotitzada de l'IRI

Universitat Politècnica de Catalunya. IDEAI-UPC - Intelligent Data sciEnce and Artificial Intelligence Research Group

Data de publicació

2018

Resum

Statistical Parametric Speech Synthesis (SPSS) offers more f lexibility than unit-selection based speech synthesis, which was the dominant commercial technology during the 2000s decade. However, classical SPSS systems generate speech with lower naturalness than unit-selection methods. Deep learning based SPSS, thanks to recurrent architectures, surpasses classical SPSS limits. These architectures offer high quality speech while preserving the desired flexibility in choosing the parameters such as the speaker, the intonation, etc. This paper exposes two proposals conceived to improve deep learning-based text-to-speech systems. First a baseline model, obtained by adapting SampleRNN, making it as a speaker-independent neural vocoder that generates the speech waveform from acoustic parameters. Then two approaches are proposed to improve the quality, applying speaker dependent normalization of the acoustic features, and the look ahead, consisting on feeding acoustic features of future frames to the network with the aim of better modeling the present waveform and avoiding possible discontinuities. Human listeners prefer the system that combines both techniques, which reaches a rate of 4 in the mean opinion score scale (MOS) with the balanced dataset and outperforms the other models.

This research was supported by the project TEC2015-69266-P (MINECO/FEDER, UE).

Peer Reviewed

Postprint (published version)

Tipus de document

Conference report

Llengua

Anglès

Matèries i paraules clau

Àrees temàtiques de la UPC::Enginyeria de la telecomunicació::Processament del senyal::Processament de la parla i del senyal acústic; Àrees temàtiques de la UPC::Informàtica::Intel·ligència artificial::Aprenentatge automàtic; Deep learning; Speech synthesis; Recurrent neural networks; Text-to-speech; SampleRNN; Time series

Publicat per

International Speech Communication Association (ISCA)

Documents relacionats

https://www.isca-archive.org/iberspeech_2018/barbany18_iberspeech.html

info:eu-repo/grantAgreement/MINECO//TEC2015-69266-P/ES/TECNOLOGIAS DE APRENDIZAJE PROFUNDO APLICADAS AL PROCESADO DE VOZ Y AUDIO/

Citació recomanada

Aquesta citació s'ha generat automàticament.

Exportar

DIDL MARC MARC_CCUC METS OAI_DC ORE QDC RDF

Drets

Open Access

Aquest element apareix en la col·lecció o col·leccions següent(s)

E-prints [72263]

Multi-speaker neural vocoder

Autor/a

Altres autors/es

Data de publicació

Compartir

Resum

Tipus de document

Llengua

Matèries i paraules clau

Publicat per

Documents relacionats

Citació recomanada

Exportar

Drets

Aquest element apareix en la col·lecció o col·leccions següent(s)