In-depth evaluation of cross-domain language identification methods

Altres autors/es

Universitat Politècnica de Catalunya

Københavns universitet

Van Der Goot, Rob

Data de publicació

2025-01-16

Resum

Language identification is a fundamental Natural Language Processing (NLP) task with wide-ranging applications, from machine translation to preprocessing training data for Large Language Models. While existing models demonstrate high performance within specific domains, their effectiveness significantly deteriorates when applied across different linguistic contexts. This study presents an in-depth analysis of cross-domain language identification methods, evaluating their performance, capabilities, and inherent limitations. Our research investigates the challenges posed by the linguistic diversity found in modern communication. Experiments conducted on a dataset spanning 2,034 languages reveal significant performance variations across domains. Models trained on specific domains like wiki, news, and religious texts show high in-domain accuracy but struggle to maintain performance when applied to different linguistic contexts. Our analysis highlights the need for more adaptable, context-aware language identification systems that can effectively handle the complexity of modern language use. Key findings include the limited transferability of domain-specific features, the nuanced challenges of advanced tokenization, and the complex error patterns arising from language similarities and data inconsistencies. This research contributes to the ongoing dialogue about developing more robust language identification technologies that can adapt to our increasingly diverse linguistic landscape.

Tipus de document

Bachelor thesis

Llengua

Anglès

Publicat per

Universitat Politècnica de Catalunya

Citació recomanada

Aquesta citació s'ha generat automàticament.

Drets

Open Access

Aquest element apareix en la col·lecció o col·leccions següent(s)