Universitat Politècnica de Catalunya
Københavns universitet
Van Der Goot, Rob
2025-01-16
Language identification is a fundamental Natural Language Processing (NLP) task with wide-ranging applications, from machine translation to preprocessing training data for Large Language Models. While existing models demonstrate high performance within specific domains, their effectiveness significantly deteriorates when applied across different linguistic contexts. This study presents an in-depth analysis of cross-domain language identification methods, evaluating their performance, capabilities, and inherent limitations. Our research investigates the challenges posed by the linguistic diversity found in modern communication. Experiments conducted on a dataset spanning 2,034 languages reveal significant performance variations across domains. Models trained on specific domains like wiki, news, and religious texts show high in-domain accuracy but struggle to maintain performance when applied to different linguistic contexts. Our analysis highlights the need for more adaptable, context-aware language identification systems that can effectively handle the complexity of modern language use. Key findings include the limited transferability of domain-specific features, the nuanced challenges of advanced tokenization, and the complex error patterns arising from language similarities and data inconsistencies. This research contributes to the ongoing dialogue about developing more robust language identification technologies that can adapt to our increasingly diverse linguistic landscape.
Bachelor thesis
English
Àrees temàtiques de la UPC::Informàtica::Intel·ligència artificial::Aprenentatge automàtic; Natural language processing (Computer science)); Machine learning; Linguistics; langid; language; identification; langid.py; glot500; identificacio; de; llenguatges; intelligencia; artificial; aprenentatge; automatic; fasttext; textcat; intelligence; machine; learning; Tractament del llenguatge natural (Informàtica); Aprenentatge automàtic; Lingüística
Universitat Politècnica de Catalunya
Open Access
Treballs acadèmics [82541]