2026-04-05T10:47:32Zhttps://recercat.cat/oai/request

oai:recercat.cat:2117/1238522026-01-21T04:13:13Zcom_2072_1033col_2072_452950

Spanish statistical parametric speech synthesis using a neural vocoder Bonafonte Cávez, Antonio Pascual de la Puente, Santiago Dorca, G. Àrees temàtiques de la UPC::Enginyeria de la telecomunicació::Processament del senyal::Processament de la parla i del senyal acústic Automatic speech recognition Neural vocoder SampleRNN Spanish TTS SPSS Deep learning Linguistics Signal processing Speech communication Speech synthesis Vocoders Commercial technology Research communities SampleRNN Spanish TTS SPSS Statistical parametric speech synthesis Subjective evaluations Waveform generation Recurrent neural networks Reconeixement automàtic de la parla During the 2000s decade, unit-selection based text-to-speech was the dominant commercial technology. Meanwhile, the TTS research community has made a big effort to push statistical-parametric speech synthesis to get similar quality and more flexibility on the synthetically generated voice. During last years, deep learning advances applied to speech synthesis have filled the gap, specially when neural vocoders substitute traditional signal-processing based vocoders. In this paper we propose to substitute the waveform generation vocoder of MUSA, our Spanish TTS, with SampleRNN, a neural vocoder which was recently proposed as a deep autoregressive raw waveform generation model. MUSA uses recurrent neural networks to predict vocoder parameters (MFCC and logF0) from linguistic features. Then, the Ahocoder vocoder is used to recover the speech waveform out of the predicted parameters. In the first system SampleRNN is extended to generate speech conditioned on the Ahocoder generated parameters (mfcc and logF0), where two configurations have been considered to train the system. First, the parameters derived from the signal using Ahocoder are used. Secondly, the system is trained with the parameters predicted by MUSA, where SampleRNN and MUSA are jointly optimized. The subjective evaluation shows that the second system outperforms both the original Ahocoder and SampleRNN as an independent neural vocoder. Peer Reviewed Postprint (published version) 2018 Conference report https://www.isca-speech.org/archive/Interspeech_2018/pdfs/2417.pdf info:eu-repo/grantAgreement/MINECO//TEC2015-69266-P/ES/TECNOLOGIAS DE APRENDIZAJE PROFUNDO APLICADAS AL PROCESADO DE VOZ Y AUDIO/ Open Access International Speech Communication Association (ISCA)