Quality-driven synthetic text generation for multilingual speech translation with audio large language models

;
Quality-Driven Synthetic Text Generation for Multilingual Speech Translation with Audio Large Language Models;

Otros/as autores/as

Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions

Hernando Pericás, Francisco Javier

Fecha de publicación

2025-10-28



Resumen

.


This thesis explores quality-driven synthetic data generation as a scalable solution for mul- tilingual speech-to-text translation (S2TT), focusing on Iberian languages with limited natural resources. Leveraging large language models (LLMs) and rigorous reference-free quality filtering via BLASER 2.0, an end-to-end pipeline was implemented to generate millions of high-quality synthetic translations. The approach demonstrates substantial improvements in translation quality and semantic similarity for low-resource languages such as Asturian and Occitan, while enabling efficient scaling to diverse linguistic do- mains. Experimental results reveal that models trained on filtered synthetic data achieve competitive and often state-of-the-art performance in S2TT tasks, and narrow the gap between direct and Chain-of-Thought cascade architectures. This work lays foundational evidence that scalable, quality-centric synthetic data pipelines are powerful enablers for inclusive, robust multilingual speech technologies, especially where manual annotation remains costly or infeasible.

Tipo de documento

Master thesis

Lengua

Inglés

Publicado por

Universitat Politècnica de Catalunya

Citación recomendada

Esta citación se ha generado automáticamente.

Derechos

S'autoritza la difusió de l'obra mitjançant la llicència Creative Commons o similar 'Reconeixement-NoComercial- SenseObraDerivada'

Open Access

Este ítem aparece en la(s) siguiente(s) colección(ones)