Learning about phraseology from corpora: A linguistically motivated approach for Multiword Expression identification

Inurrieta, Uxoa; Aduriz, Itziar; Diaz de Ilarraza, Arantza; Labaka, Gorka; Sarasola, Kepa; Inurrieta, Uxoa; Aduriz, Itziar; Diaz de Ilarraza, Arantza; Labaka, Gorka; Sarasola, Kepa

Learning about phraseology from corpora: A linguistically motivated approach for Multiword Expression identification

Author

Inurrieta, Uxoa

Aduriz, Itziar

Diaz de Ilarraza, Arantza

Labaka, Gorka

Sarasola, Kepa

Publication date

2021-03-11T11:21:57Z

2020-08-27

2021-03-11T11:21:57Z

Abstract

Multiword Expressions (MWEs) are idiosyncratic combinations of words which pose important challenges to Natural Language Processing. Some kinds of MWEs, such as verbal ones, are particularly hard to identify in corpora, due to their high degree of morphosyntactic flexibility. This paper describes a linguistically motivated method to gather detailed information about verb+noun MWEs (VNMWEs) from corpora. Although the main focus of this study is Spanish, the method is easily adaptable to other languages. Monolingual and parallel corpora are used as input, and data about the morphosyntactic variability of VNMWEs is extracted. This information is then tested in an identification task, obtaining an F score of 0.52, which is considerably higher than related work.

Document Type

Article

Published version

Language

English

Subjects and keywords

Morfologia (Gramàtica); Semàntica; Aprenentatge automàtic; Morphology (Grammar); Semantics; Machine learning

Publisher

Public Library of Science (PLoS)

Related items

Reproducció del document publicat a: https://doi.org/10.1371/journal.pone.0237767

PLoS One, 2020, vol. 15, num. 8, p. e0237767

https://doi.org/10.1371/journal.pone.0237767

Recommended citation

This citation was generated automatically.

Export

DIDL MARC MARC_CCUC METS OAI_DC ORE QDC RDF

Rights

cc-by (c) Inurrieta, Uxoa et al., 2020

http://creativecommons.org/licenses/by/3.0/es

This item appears in the following Collection(s)

Filologia Catalana i Lingüística General [949]

ISGlobal - Institut de Salut Global de Barcelona [60807]

Learning about phraseology from corpora: A linguistically motivated approach for Multiword Expression identification

Author

Publication date

Share

Abstract

Document Type

Language

Subjects and keywords

Publisher

Related items

Recommended citation

Export

Rights

This item appears in the following Collection(s)