<?xml version="1.0" encoding="UTF-8"?><?xml-stylesheet type="text/xsl" href="static/style.xsl"?><OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd"><responseDate>2026-04-05T10:47:32Z</responseDate><request verb="GetRecord" identifier="oai:www.recercat.cat:2117/123852" metadataPrefix="rdf">https://recercat.cat/oai/request</request><GetRecord><record><header><identifier>oai:recercat.cat:2117/123852</identifier><datestamp>2026-01-21T04:13:13Z</datestamp><setSpec>com_2072_1033</setSpec><setSpec>col_2072_452950</setSpec></header><metadata><rdf:RDF xmlns:rdf="http://www.openarchives.org/OAI/2.0/rdf/" xmlns:ow="http://www.ontoweb.org/ontology/1#" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:ds="http://dspace.org/ds/elements/1.1/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:doc="http://www.lyncode.com/xoai" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/rdf/ http://www.openarchives.org/OAI/2.0/rdf.xsd">
   <ow:Publication rdf:about="oai:recercat.cat:2117/123852">
      <dc:title>Spanish statistical parametric speech synthesis using a neural vocoder</dc:title>
      <dc:creator>Bonafonte Cávez, Antonio</dc:creator>
      <dc:creator>Pascual de la Puente, Santiago</dc:creator>
      <dc:creator>Dorca, G.</dc:creator>
      <dc:subject>Àrees temàtiques de la UPC::Enginyeria de la telecomunicació::Processament del senyal::Processament de la parla i del senyal acústic</dc:subject>
      <dc:subject>Automatic speech recognition</dc:subject>
      <dc:subject>Neural vocoder</dc:subject>
      <dc:subject>SampleRNN</dc:subject>
      <dc:subject>Spanish TTS</dc:subject>
      <dc:subject>SPSS</dc:subject>
      <dc:subject>Deep learning</dc:subject>
      <dc:subject>Linguistics</dc:subject>
      <dc:subject>Signal processing</dc:subject>
      <dc:subject>Speech communication</dc:subject>
      <dc:subject>Speech synthesis</dc:subject>
      <dc:subject>Vocoders</dc:subject>
      <dc:subject>Commercial technology</dc:subject>
      <dc:subject>Research communities</dc:subject>
      <dc:subject>SampleRNN</dc:subject>
      <dc:subject>Spanish TTS</dc:subject>
      <dc:subject>SPSS</dc:subject>
      <dc:subject>Statistical parametric speech synthesis</dc:subject>
      <dc:subject>Subjective evaluations</dc:subject>
      <dc:subject>Waveform generation</dc:subject>
      <dc:subject>Recurrent neural networks</dc:subject>
      <dc:subject>Reconeixement automàtic de la parla</dc:subject>
      <dc:description>During the 2000s decade, unit-selection based text-to-speech was the dominant commercial technology. Meanwhile, the TTS research community has made a big effort to push statistical-parametric speech synthesis to get similar quality and more flexibility on the synthetically generated voice. During last years, deep learning advances applied to speech synthesis have filled the gap, specially when neural vocoders substitute traditional signal-processing based vocoders. In this paper we propose to substitute the waveform generation vocoder of MUSA, our Spanish TTS, with SampleRNN, a neural vocoder which was recently proposed as a deep autoregressive raw waveform generation model. MUSA uses recurrent neural networks to predict vocoder parameters (MFCC and logF0) from linguistic features. Then, the Ahocoder vocoder is used to recover the speech waveform out of the predicted parameters. In the first system SampleRNN is extended to generate speech conditioned on the Ahocoder generated parameters (mfcc and logF0), where two configurations have been considered to train the system. First, the parameters derived from the signal using Ahocoder are used. Secondly, the system is trained with the parameters predicted by MUSA, where SampleRNN and MUSA are jointly optimized. The subjective evaluation shows that the second system outperforms both the original Ahocoder and SampleRNN as an independent neural vocoder.</dc:description>
      <dc:description>Peer Reviewed</dc:description>
      <dc:description>Postprint (published version)</dc:description>
      <dc:date>2018</dc:date>
      <dc:type>Conference report</dc:type>
      <dc:relation>https://www.isca-speech.org/archive/Interspeech_2018/pdfs/2417.pdf</dc:relation>
      <dc:relation>info:eu-repo/grantAgreement/MINECO//TEC2015-69266-P/ES/TECNOLOGIAS DE APRENDIZAJE PROFUNDO APLICADAS AL PROCESADO DE VOZ Y AUDIO/</dc:relation>
      <dc:rights>Open Access</dc:rights>
      <dc:publisher>International Speech Communication Association (ISCA)</dc:publisher>
   </ow:Publication>
</rdf:RDF></metadata></record></GetRecord></OAI-PMH>