The entropy of words-learnability and expressivity across more than 1000 languages

Otros/as autores/as

Universitat Politècnica de Catalunya. Departament de Ciències de la Computació

Universitat Politècnica de Catalunya. LARCA - Laboratori d'Algorísmia Relacional, Complexitat i Aprenentatge

Fecha de publicación

2017-06-01

Resumen

The choice associated with words is a fundamental property of natural languages. It lies at the heart of quantitative linguistics, computational linguistics and language sciences more generally. Information theory gives us tools at hand to measure precisely the average amount of choice associated with words: the word entropy. Here, we use three parallel corpora, encompassing ca. 450 million words in 1916 texts and 1259 languages, to tackle some of the major conceptual and practical problems of word entropy estimation: dependence on text size, register, style and estimation method, as well as non-independence of words in co-text. We present two main findings: Firstly, word entropies display relatively narrow, unimodal distributions. There is no language in our sample with a unigram entropy of less than six bits/word. We argue that this is in line with information-theoretic models of communication. Languages are held in a narrow range by two fundamental pressures: word learnability and word expressivity, with a potential bias towards expressivity. Secondly, there is a strong linear relationship between unigram entropies and entropy rates. The entropy difference between words with and without co-textual information is narrowly distributed around ca. three bits/word. In other words, knowing the preceding text reduces the uncertainty of words by roughly the same amount across languages of the world.


Peer Reviewed


Postprint (published version)

Tipo de documento

Article

Lengua

Inglés

Documentos relacionados

http://www.mdpi.com/1099-4300/19/6/275

Citación recomendada

Esta citación se ha generado automáticamente.

Derechos

http://creativecommons.org/licenses/by/3.0/es/

Open Access

Attribution 3.0 Spain

Este ítem aparece en la(s) siguiente(s) colección(ones)

E-prints [73026]