Discovery of semantic non-syntactic joins

Other authors

Universitat Politècnica de Catalunya. Departament d'Enginyeria de Serveis i Sistemes d'Informació

Universitat Politècnica de Catalunya. inSSIDE - integrated Software, Services, Information and Data Engineering

Publication date

2024

Abstract

Data discovery is an essential step in the data integration pipeline involving finding datasets whose combined information provides relevant insights. Discovering joinable attributes requires assessing the closeness of the semantic concepts that two attributes represent, which is highly sensitive and dependent on the chosen similarity metric. The state of the art commonly approaches this task from a syntactic perspective, this is, performing comparisons based on the data values or on direct transformations (e.g., via hash functions). These approaches suffice when the two sets of instances share the same syntactic representation, but fail to detect cases in which the same semantic concept is represented by different sets of values, which we refer as semantic non-syntactic joins. This is a relevant problem in data lake scenarios, when the underlying datasets present high heterogeneity and lack of standardization. To that end, in this paper, we propose an empirical approach to detect semantic non-syntactic joins, which leverages, simultaneously, syntactic and semantic measurements of the data. We demonstrate that our approach is effective in detecting such kind of joins.


Marc Maynou is supported by the EU’s Horizon Programme call, under Grant Agreements No. 101093164 (ExtremeXP), and Sergi Nadal is partially supported by the DOGO4ML project, funded by the Spanish Ministerio de Ciencia e Innovación under the funding scheme PID2020-117191RB-I00 / AEI / 10.13039/501100011033.


Peer Reviewed


Postprint (published version)

Document Type

Conference lecture

Language

English

Publisher

CEUR-WS.org

Related items

https://ceur-ws.org/Vol-3653/short3.pdf

info:eu-repo/grantAgreement/EC/HE/101093164/EU/EXPeriment driven and user eXPerience oriented analytics for eXtremely Precise outcomes and decisions/ExtremeXP

info:eu-repo/grantAgreement/AEI/Plan Estatal de Investigación Científica y Técnica y de Innovación 2017-2020/PID2020-117191RB-I00/ES/DESARROLLO, OPERATIVA Y GOBERNANZA DE DATOS PARA SISTEMAS SOFTWARE BASADOS EN APRENDIZAJE AUTOMATICO/

Recommended citation

This citation was generated automatically.

Rights

http://creativecommons.org/licenses/by/4.0/

Open Access

Attribution 4.0 International

This item appears in the following Collection(s)

E-prints [72608]