Discovery of semantic non-syntactic joins

Otros/as autores/as

Universitat Politècnica de Catalunya. Departament d'Enginyeria de Serveis i Sistemes d'Informació

Universitat Politècnica de Catalunya. inSSIDE - integrated Software, Services, Information and Data Engineering

Fecha de publicación

2024

Resumen

Data discovery is an essential step in the data integration pipeline involving finding datasets whose combined information provides relevant insights. Discovering joinable attributes requires assessing the closeness of the semantic concepts that two attributes represent, which is highly sensitive and dependent on the chosen similarity metric. The state of the art commonly approaches this task from a syntactic perspective, this is, performing comparisons based on the data values or on direct transformations (e.g., via hash functions). These approaches suffice when the two sets of instances share the same syntactic representation, but fail to detect cases in which the same semantic concept is represented by different sets of values, which we refer as semantic non-syntactic joins. This is a relevant problem in data lake scenarios, when the underlying datasets present high heterogeneity and lack of standardization. To that end, in this paper, we propose an empirical approach to detect semantic non-syntactic joins, which leverages, simultaneously, syntactic and semantic measurements of the data. We demonstrate that our approach is effective in detecting such kind of joins.


Marc Maynou is supported by the EU’s Horizon Programme call, under Grant Agreements No. 101093164 (ExtremeXP), and Sergi Nadal is partially supported by the DOGO4ML project, funded by the Spanish Ministerio de Ciencia e Innovación under the funding scheme PID2020-117191RB-I00 / AEI / 10.13039/501100011033.


Peer Reviewed


Postprint (published version)

Tipo de documento

Conference lecture

Lengua

Inglés

Publicado por

CEUR-WS.org

Documentos relacionados

https://ceur-ws.org/Vol-3653/short3.pdf

info:eu-repo/grantAgreement/EC/HE/101093164/EU/EXPeriment driven and user eXPerience oriented analytics for eXtremely Precise outcomes and decisions/ExtremeXP

info:eu-repo/grantAgreement/AEI/Plan Estatal de Investigación Científica y Técnica y de Innovación 2017-2020/PID2020-117191RB-I00/ES/DESARROLLO, OPERATIVA Y GOBERNANZA DE DATOS PARA SISTEMAS SOFTWARE BASADOS EN APRENDIZAJE AUTOMATICO/

Citación recomendada

Esta citación se ha generado automáticamente.

Derechos

http://creativecommons.org/licenses/by/4.0/

Open Access

Attribution 4.0 International

Este ítem aparece en la(s) siguiente(s) colección(ones)

E-prints [72608]