To access the full text documents, please follow this link: http://hdl.handle.net/2117/77862

Zipf's law for word frequencies: Word forms versus lemmas in long texts
Corral, Alvaro; Boleda Torrent, Gemma; Ferrer Cancho, Ramon
Universitat Politècnica de Catalunya. Departament de Ciències de la Computació; Universitat Politècnica de Catalunya. GPLN - Grup de Processament del Llenguatge Natural; Universitat Politècnica de Catalunya. LARCA - Laboratori d'Algorísmia Relacional, Complexitat i Aprenentatge
Zipf's law is a fundamental paradigm in the statistics of written and spoken natural language as well as in other communication systems. We raise the question of the elementary units for which Zipf's law should hold in the most natural way, studying its validity for plain word forms and for the corresponding lemma forms. We analyze several long literary texts comprising four languages, with different levels of morphological complexity. In all cases Zipf's law is fulfilled, in the sense that a power-law distribution of word or lemma frequencies is valid for several orders of magnitude. We investigate the extent to which the word-lemma transformation preserves two parameters of Zipf's law: the exponent and the low-frequency cut-off. We are not able to demonstrate a strict invariance of the tail, as for a few texts both exponents deviate significantly, but we conclude that the exponents are very similar, despite the remarkavble transformation that going from words to lemmas represents, considerably affecting all ranges of frequencies. In contrast, the low-frequency cut-offs are less stable, tending to increase substantially after the transformation.
Peer Reviewed
Àrees temàtiques de la UPC::Informàtica::Intel·ligència artificial::Llenguatge natural
Computational linguistics
Power laws
Distributions
Languages
Lingüística computacional
http://creativecommons.org/licenses/by/3.0/es/
info:eu-repo/semantics/publishedVersion
Article
         

Show full item record

Related documents

Other documents of the same author

Corral, Álvaro; Ferrer Cancho, Ramon; Díaz Guilera, Albert; Boleda Torrent, Gemma
Sánchez-Marco, Cristina; Boleda Torrent, Gemma; Fontana, Josep Maria; Domingo, Judith
Arsenijevic, Boban; Boleda Torrent, Gemma; Gehrke, Berit; McNally, Louise
Peris, Aina; Taulé, Mariona; Boleda Torrent, Gemma; Rodríguez Hontoria, Horacio
Boleda Torrent, Gemma; Cuadros Oller, Montserrat; España Bonet, Cristina; Melero, Maite; Padró, Lluís; Quixal, Martí; Rodríguez, Carlos
 

Coordination

 

Supporters