Comparison of Audio Encoders for Audio-Text Contrastive Learning Representations

Cárdenas Gracia, Sergio

Comparison of Audio Encoders for Audio-Text Contrastive Learning Representations

dc.contributor.author

Cárdenas Gracia, Sergio

dc.date.accessioned

2026-02-07T20:30:31Z

dc.date.available

2026-02-07T20:30:31Z

dc.date.issued

2026-02-06T13:11:55Z

dc.date.issued

2026-02-06T13:11:55Z

dc.date.issued

2025

dc.identifier

https://hdl.handle.net/10230/72481

dc.identifier.uri

https://hdl.handle.net/10230/72481

dc.description.abstract

Treball fi de màster de: Master in Sound and Music Computing

dc.description.abstract

Supervisor: Pablo Alonso Jiménez

dc.description.abstract

Co-Supervisor: Dmitry Bogdanov

dc.description.abstract

This project investigates contrastive learning techniques for aligning audio and text representations in the music domain, focusing on scenarios with limited data and computational resources. We provide a comprehensive review of existing methods relevant to music-text contrastive learning. Two audio encoders, HTSAT and MAEST, initialized with pretrained weights, are integrated with a frozen RoBERTa text encoder within the LAION-AI CLAP framework and fine-tuned on the MTGJamendo dataset. Model performance is evaluated on three tasks: zero-shot genre classification on the GTZAN dataset, multi-label tag classification on the MagnaTagATune dataset, and text-to-music retrieval on the Song Describer dataset. Results show that HTSAT generalizes better in low-data settings, while MAEST tends to overfit, highlighting the impact of encoder complexity in resource-constrained environments. Attempts to mitigate MAEST’s overfitting with weight decay and learning rate decay were unsuccessful. Additionally, the study highlights the critical role of data volume and batch size in contrastive learning effectiveness. The source code for this work is publicly available at https://github.com/SerX610/smc-master-thesis

dc.format

application/pdf

dc.language

eng

dc.rights

Creative Commons license AttributionNonCommercial- NoDerivs 4.0 International

dc.rights

Attribution-NonCommercial-NoDerivatives 4.0 International

dc.rights

https://creativecommons.org/licenses/by-nc-nd/4.0/

dc.rights

info:eu-repo/semantics/openAccess

dc.subject

Música per ordinador

dc.title

Comparison of Audio Encoders for Audio-Text Contrastive Learning Representations

dc.type

info:eu-repo/semantics/masterThesis

Ficheros en el ítem

Ficheros	Tamaño	Formato	Ver
No hay ficheros asociados a este ítem.

Este ítem aparece en la(s) siguiente(s) colección(ones)

Treballs d'estudiants [4946]