Exploring morphology-aware tokenization: a case study on Spanish language modeling

Táboas García, Alba; Przybyła, Piotr; Wanner, Leo; Táboas García, Alba; Przybyła, Piotr; Wanner, Leo

Exploring morphology-aware tokenization: a case study on Spanish language modeling

Per accedir als documents amb el text complet, si us plau, seguiu el següent enllaç: https://hdl.handle.net/10230/72720

Autor/a

Táboas García, Alba

Przybyła, Piotr

Wanner, Leo

Data de publicació

2026-03-06T15:20:12Z

2025

2026-03-06T15:20:12Z

Resum

This paper investigates to what extent the integration of morphological information can improve subword tokenization and thus also language modeling performance. We focus on Spanish, a language with fusional morphology, where subword segmentation can benefit from linguistic structure. Instead of relying on purely data-driven strategies like Byte Pair Encoding (BPE), we explore a linguistically grounded approach: training a tokenizer on morphologically segmented data. To do so, we develop a semi-supervised segmentation model for Spanish, building gold-standard datasets to guide and evaluate it. We then use this tokenizer to pre-train a masked language model and assess its performance on several downstream tasks. Our results show improvements over a baseline with a standard tokenizer, supporting our hypothesis that morphology-aware tokenization offers a viable and principled alternative for improving language modeling.

The work presented in this paper has been partially supported by the European Commission in the framework of the Horizon Europe program (contract number 101070278). We sincerely appreciate the anonymous reviewers for their valuable suggestions and thoughtful feedback, which greatly contributed to enhancing the quality of this paper. We also acknowledge the use of the MareNostrum 5 supercomputer at the Barcelona Supercomputing Center (BSC) for model training.

Tipus de document

Capítol o part de llibre

Versió publicada

Llengua

Català

Matèries i paraules clau

Morphology-aware tokenization; Spanish language modeling

Publicat per

ACL (Association for Computational Linguistics)

Documents relacionats

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP); 2025 Nov 4-9. Suzhou, China. Kerrville: ACL; 2025.

Citació recomanada

Aquesta citació s'ha generat automàticament.

Exportar

DIDL MARC MARC_CCUC METS OAI_DC ORE QDC RDF

Drets

http://creativecommons.org/licenses/by/4.0/

Aquest element apareix en la col·lecció o col·leccions següent(s)

Recerca: articles, congressos, llibres [20987]

Exploring morphology-aware tokenization: a case study on Spanish language modeling

Autor/a

Data de publicació

Compartir

Resum

Tipus de document

Llengua

Matèries i paraules clau

Publicat per

Documents relacionats

Citació recomanada

Exportar

Drets

Aquest element apareix en la col·lecció o col·leccions següent(s)