Exploring the integration of large language models for automatic emotion labeling in speech

Yun Chien, Yi

Exploring the integration of large language models for automatic emotion labeling in speech

dc.contributor.author

Yun Chien, Yi

dc.date.accessioned

2025-10-22T19:46:56Z

dc.date.available

2025-10-22T19:46:56Z

dc.date.issued

2025-10-20T16:01:56Z

dc.date.issued

2025-10-20T16:01:56Z

dc.date.issued

2025

dc.identifier

http://hdl.handle.net/10230/71584

dc.identifier.uri

http://hdl.handle.net/10230/71584

dc.description.abstract

Treball fi de màster de: Master in Intelligent Interactive Systems

dc.description.abstract

Supervisor: Prof. María Inés Torres Barañano

dc.description.abstract

In this work, we present a comprehensive comparison of methodologies for speech emotion recognition (SER), with a focus on evaluating the effectiveness of large language models (LLMs) in this domain. Our study is structured into three parts. First, we extract audio embeddings using models such as WavLM, HuBERT, and Dasheng, and use classical machine learning classifier-Support Vector Machine (SVM) and Multilayer Perceptron (MLP) for emotion prediction. These approach serves as a baseline for comparison. Second, we investigate the capacity of LLMs like GPT-4o, Qwen2-Audio, and Amazon Nova Sonic to analyze audio features, including speaker attributes such as gender, thereby extending their application beyond traditional natural language processing. Third, we explore a more integrated approach that directly inputs raw audio into LLM for audio processing, such as Qwen2-Audio7B-Instruct, for end-to-end emotion classification, without the need for traditional signal-processing-based feature extraction. We evaluate and compare the performance of these methodologies based on various metrics, such as accuracy, precision, recall, and F1-score. A key aspect of this study is the primary focus on the results obtained from LLM-based models. Our results reveal several key insights: (1) data distribution significantly affects classifier performance; (2) different audio embeddings shows different results even with the same classifier and dataset; and (3) despite their capability, current LLMs still underperform compared to classical classifiers such as SVM and MLP in emotion prediction tasks.

dc.format

application/pdf

dc.language

eng

dc.rights

Llicència CC Reconeixement-NoComercial-CompartirIgual 4.0 Internacional (CC BY-NC-SA 4.0)

dc.rights

https://creativecommons.org/licenses/by-nc-sa/4.0/

dc.rights

info:eu-repo/semantics/openAccess

dc.subject

Emocions

dc.title

Exploring the integration of large language models for automatic emotion labeling in speech

dc.type

info:eu-repo/semantics/masterThesis

Ficheros en el ítem

Ficheros	Tamaño	Formato	Ver
No hay ficheros asociados a este ítem.

Este ítem aparece en la(s) siguiente(s) colección(ones)

Treballs d'estudiants [4945]