dc.contributor
Universitat Politècnica de Catalunya. Departament d'Enginyeria de Serveis i Sistemes d'Informació
dc.contributor
Universitat Politècnica de Catalunya. inSSIDE - integrated Software, Services, Information and Data Engineering
dc.contributor.author
Maynou Yelamos, Marc
dc.contributor.author
Nadal Francesch, Sergi
dc.identifier
Maynou, M.; Nadal, S. Discovery of semantic non-syntactic joins. A: International Workshop on Design, Optimization, Languages and Analytical Processing of Big Data. "Proceedings of the 26th International Workshop on Design, Optimization, Languages and Analytical Processing of Big Data (DOLAP 2024): co-located with the 27th International Conference on Extending Database Technology and the 27th International Conference on Database Theory (EDBT/ICDT 2024): Paestum, Italy, March 25, 2024". CEUR-WS.org, 2024, p. 73-77. ISSN 1613-0073.
dc.identifier
https://hdl.handle.net/2117/409838
dc.description.abstract
Data discovery is an essential step in the data integration pipeline involving finding datasets whose combined information provides relevant insights. Discovering joinable attributes requires assessing the closeness of the semantic concepts that two attributes represent, which is highly sensitive and dependent on the chosen similarity metric. The state of the art commonly approaches this task from a syntactic perspective, this is, performing comparisons based on the data values or on direct transformations (e.g., via hash functions). These approaches suffice when the two sets of instances share the same syntactic representation, but fail to detect cases in which the same semantic concept is represented by different sets of values, which we refer as semantic non-syntactic joins. This is a relevant problem in data lake scenarios, when the underlying datasets present high heterogeneity and lack of standardization. To that end, in this paper, we propose an empirical approach to detect semantic non-syntactic joins, which leverages, simultaneously, syntactic and semantic measurements of the data. We demonstrate that our approach is effective in detecting such kind of joins.
dc.description.abstract
Marc Maynou is supported by the EU’s Horizon Programme call, under Grant Agreements No. 101093164 (ExtremeXP), and Sergi Nadal is partially supported by the DOGO4ML project, funded by the Spanish Ministerio de Ciencia e Innovación under the funding scheme PID2020-117191RB-I00 / AEI / 10.13039/501100011033.
dc.description.abstract
Peer Reviewed
dc.description.abstract
Postprint (published version)
dc.format
application/pdf
dc.relation
https://ceur-ws.org/Vol-3653/short3.pdf
dc.relation
info:eu-repo/grantAgreement/EC/HE/101093164/EU/EXPeriment driven and user eXPerience oriented analytics for eXtremely Precise outcomes and decisions/ExtremeXP
dc.relation
info:eu-repo/grantAgreement/AEI/Plan Estatal de Investigación Científica y Técnica y de Innovación 2017-2020/PID2020-117191RB-I00/ES/DESARROLLO, OPERATIVA Y GOBERNANZA DE DATOS PARA SISTEMAS SOFTWARE BASADOS EN APRENDIZAJE AUTOMATICO/
dc.rights
http://creativecommons.org/licenses/by/4.0/
dc.rights
Attribution 4.0 International
dc.subject
Àrees temàtiques de la UPC::Informàtica::Sistemes d'informació
dc.subject
Data discovery
dc.subject
Semantic similarity
dc.subject
Syntactic similarity
dc.subject
Profile comparison
dc.subject
Distribution comparison
dc.subject
Conjunts de dades
dc.subject
Dades massives
dc.title
Discovery of semantic non-syntactic joins
dc.type
Conference lecture