Language in Vivo vs. in Silico : Size Matters but Larger Language Models Still Do Not Comprehend Language on a Par with Humans

Dentella, Vittoria; Guenther, Fritz; Leivada, Evelina

Language in Vivo vs. in Silico : Size Matters but Larger Language Models Still Do Not Comprehend Language on a Par with Humans

dc.contributor.author

Dentella, Vittoria

dc.contributor.author

Guenther, Fritz

dc.contributor.author

Leivada, Evelina

dc.date.accessioned

2025-04-09T19:08:38Z

dc.date.available

2025-04-09T19:08:38Z

dc.date.issued

2024

dc.identifier

https://ddd.uab.cat/record/310414

dc.identifier

urn:10.48550/arXiv.2404.14883

dc.identifier

urn:oai:ddd.uab.cat:310414

dc.identifier

urn:oai:egreta.uab.cat:publications/2cf16d83-94df-4d34-b45f-0a0d388bde35

dc.identifier

urn:pure_id:475866216

dc.identifier.uri

http://hdl.handle.net/2072/483461

dc.description.abstract

Understanding the limits of language is a prerequisite for Large Language Models (LLMs) to act as theories of natural language. LLM performance in some language tasks presents both quantitative and qualitative differences from that of humans, however it remains to be determined whether such differences are amenable to model size. This work investigates the critical role of model scaling, determining whether increases in size make up for such differences between humans and models. We test three LLMs from different families (Bard, 137 billion parameters; ChatGPT-3.5, 175 billion; ChatGPT-4, 1.5 trillion) on a grammaticality judgment task featuring anaphora, center embedding, comparatives, and negative polarity. N=1,200 judgments are collected and scored for accuracy, stability, and improvements in accuracy upon repeated presentation of a prompt. Results of the best performing LLM, ChatGPT-4, are compared to results of n=80 humans on the same stimuli. We find that humans are overall less accurate than ChatGPT-4 (76% vs. 80% accuracy, respectively), but that this is due to ChatGPT-4 outperforming humans only in one task condition, namely on grammatical sentences. Additionally, ChatGPT-4 wavers more than humans in its answers (12.5% vs. 9.6% likelihood of an oscillating answer, respectively). Thus, while increased model size may lead to better performance, LLMs are still not sensitive to (un)grammaticality the same way as humans are. It seems possible but unlikely that scaling alone can fix this issue. We interpret these results by comparing language learning in vivo and in silico, identifying three critical differences concerning (i) the type of evidence, (ii) the poverty of the stimulus, and (iii) the occurrence of semantic hallucinations due to impenetrable linguistic reference.

dc.format

application/pdf

dc.language

eng

dc.publisher

dc.rights

open access

dc.rights

Aquest document està subjecte a una llicència d'ús Creative Commons. Es permet la reproducció total o parcial, la distribució, la comunicació pública de l'obra i la creació d'obres derivades, fins i tot amb finalitats comercials, sempre i quan es reconegui l'autoria de l'obra original.

dc.rights

https://creativecommons.org/licenses/by/4.0/

dc.subject

Large Language Models

dc.subject

Grammaticality

dc.subject

Language

dc.subject

Scaling

dc.title

Language in Vivo vs. in Silico : Size Matters but Larger Language Models Still Do Not Comprehend Language on a Par with Humans

dc.type

Working paper

Ficheros en el ítem

Ficheros	Tamaño	Formato	Ver
No hay ficheros asociados a este ítem.

Este ítem aparece en la(s) siguiente(s) colección(ones)

Working papers [2869]