A comparison of approaches for measuring cross-lingual similarity of wikipedia articles
Barrón-Cedeño, Alberto; Lestari Paramita, Monica; Clough, Paul; Rosso, Paolo
Universitat Politècnica de Catalunya. Departament de Llenguatges i Sistemes Informàtics; Universitat Politècnica de Catalunya. GPLN - Grup de Processament del Llenguatge Natural
Wikipedia has been used as a source of comparable texts for a range of tasks, such as Statistical Machine Translation and Cross-Language Information Retrieval. Articles written in different languages on the same topic are often connected through inter-language-links. However, the extent to which these articles are similar is highly variable and this may impact on the use of Wikipedia as a comparable resource. In this paper we compare various language-independent methods for measuring cross-lingual similarity: character n-grams, cognateness, word count ratio, and an approach based on outlinks. These approaches are compared against a baseline utilising MT resources. Measures are also compared to human judgements of similarity using a manually created resource containing 700 pairs of Wikipedia articles (in 7 language pairs). Results indicate that a combination of language-independent models (char-n-grams, outlinks and word-count ratio) is highly effective for identifying cross-lingual similarity and performs comparably to language-dependent models (translation and monolingual analysis).
Peer Reviewed
Àrees temàtiques de la UPC::Informàtica::Intel·ligència artificial::Llenguatge natural
Similarity (Language learning)
Natural language processing (Computer science)
Cross-Lingual Similarity
Tractament del llenguatge natural (Informàtica)
