Discovery of semantic non-syntactic joins

Altres autors/es

Universitat Politècnica de Catalunya. Departament d'Enginyeria de Serveis i Sistemes d'Informació

Universitat Politècnica de Catalunya. inSSIDE - integrated Software, Services, Information and Data Engineering

Data de publicació

2024

Resum

Data discovery is an essential step in the data integration pipeline involving finding datasets whose combined information provides relevant insights. Discovering joinable attributes requires assessing the closeness of the semantic concepts that two attributes represent, which is highly sensitive and dependent on the chosen similarity metric. The state of the art commonly approaches this task from a syntactic perspective, this is, performing comparisons based on the data values or on direct transformations (e.g., via hash functions). These approaches suffice when the two sets of instances share the same syntactic representation, but fail to detect cases in which the same semantic concept is represented by different sets of values, which we refer as semantic non-syntactic joins. This is a relevant problem in data lake scenarios, when the underlying datasets present high heterogeneity and lack of standardization. To that end, in this paper, we propose an empirical approach to detect semantic non-syntactic joins, which leverages, simultaneously, syntactic and semantic measurements of the data. We demonstrate that our approach is effective in detecting such kind of joins.


Marc Maynou is supported by the EU’s Horizon Programme call, under Grant Agreements No. 101093164 (ExtremeXP), and Sergi Nadal is partially supported by the DOGO4ML project, funded by the Spanish Ministerio de Ciencia e Innovación under the funding scheme PID2020-117191RB-I00 / AEI / 10.13039/501100011033.


Peer Reviewed


Postprint (published version)

Tipus de document

Conference lecture

Llengua

Anglès

Publicat per

CEUR-WS.org

Documents relacionats

https://ceur-ws.org/Vol-3653/short3.pdf

info:eu-repo/grantAgreement/EC/HE/101093164/EU/EXPeriment driven and user eXPerience oriented analytics for eXtremely Precise outcomes and decisions/ExtremeXP

info:eu-repo/grantAgreement/AEI/Plan Estatal de Investigación Científica y Técnica y de Innovación 2017-2020/PID2020-117191RB-I00/ES/DESARROLLO, OPERATIVA Y GOBERNANZA DE DATOS PARA SISTEMAS SOFTWARE BASADOS EN APRENDIZAJE AUTOMATICO/

Citació recomanada

Aquesta citació s'ha generat automàticament.

Drets

http://creativecommons.org/licenses/by/4.0/

Open Access

Attribution 4.0 International

Aquest element apareix en la col·lecció o col·leccions següent(s)

E-prints [72608]