2026-04-19T19:17:36Zhttps://recercat.cat/oai/request

oai:recercat.cat:2117/4296352025-07-23T02:56:38Zcom_2072_1033col_2072_452951

00925njm 22002777a 4500 dc Cabré Guerrero, Víctor author 2025-01-16 Language identification is a fundamental Natural Language Processing (NLP) task with wide-ranging applications, from machine translation to preprocessing training data for Large Language Models. While existing models demonstrate high performance within specific domains, their effectiveness significantly deteriorates when applied across different linguistic contexts. This study presents an in-depth analysis of cross-domain language identification methods, evaluating their performance, capabilities, and inherent limitations. Our research investigates the challenges posed by the linguistic diversity found in modern communication. Experiments conducted on a dataset spanning 2,034 languages reveal significant performance variations across domains. Models trained on specific domains like wiki, news, and religious texts show high in-domain accuracy but struggle to maintain performance when applied to different linguistic contexts. Our analysis highlights the need for more adaptable, context-aware language identification systems that can effectively handle the complexity of modern language use. Key findings include the limited transferability of domain-specific features, the nuanced challenges of advanced tokenization, and the complex error patterns arising from language similarities and data inconsistencies. This research contributes to the ongoing dialogue about developing more robust language identification technologies that can adapt to our increasingly diverse linguistic landscape. Àrees temàtiques de la UPC::Informàtica::Intel·ligència artificial::Aprenentatge automàtic Natural language processing (Computer science)) Machine learning Linguistics langid language identification langid.py glot500 identificacio de llenguatges intelligencia artificial aprenentatge automatic fasttext textcat intelligence machine learning Tractament del llenguatge natural (Informàtica) Aprenentatge automàtic Lingüística In-depth evaluation of cross-domain language identification methods