Evaluating quality of disparate data sources: A discord-driven approach

Other authors

Universitat Politècnica de Catalunya. Doctorat en Computació

Universitat Politècnica de Catalunya. Departament d'Enginyeria de Serveis i Sistemes d'Informació

Universitat Politècnica de Catalunya. inSSIDE - integrated Software, Services, Information and Data Engineering

Publication date

2025



Abstract

Among other measures of data quality, determining the reliability of conflicting values from different sources is especially challenging. Traditional data fusion approaches often infer correct values in simple cases, but struggle to handle variations in data granularity (such as differences in temporal, spatial, or categorical aggregations) and offer limited insight into the nature of disagreements. Thus, we propose a new source evaluation approach for numerical attributes that measures discordance (i.e., the extent to which sources differ from each other). Unlike existing methods that focus solely on point estimation, we allow both fine-grained and coarse-grained analysis, allowing more sophisticated data quality assessments. We employ a linear programming solver that transparently adapts to any data alignment expressed in a set of operators resembling relational algebra. Extensive experiments on real-world datasets demonstrate that our method generalizes existing truth discovery techniques measuring differences with Mean Absolute Error (MAE), Root Mean Square Error (RMSE), and can adapt to diverse and complex scenarios.


Y. A. Akter is funded by the EC Horizon 2020 research and innovation programme (DEDS: grant agreement No 955895). A. Abelló and P. Jovanovic are funded by the Spanish Ministerio de Ciencia e Innovación under project PID2020-117191RB-I00/AEI/10.13039/501100011033 (DOGO4ML) and the EC Horizon Europe programme (ExtremeXP: grant agreement No 101093164).


Peer Reviewed


Postprint (author's final draft)

Document Type

Conference report

Language

English

Publisher

Springer

Related items

https://link.springer.com/chapter/10.1007/978-3-032-05281-0_10

info:eu-repo/grantAgreement/EC/H2020/955895/EU/Data Engineering for Data Science/DEDS

info:eu-repo/grantAgreement/AEI/Plan Estatal de Investigación Científica y Técnica y de Innovación 2017-2020/PID2020-117191RB-I00/ES/DESARROLLO, OPERATIVA Y GOBERNANZA DE DATOS PARA SISTEMAS SOFTWARE BASADOS EN APRENDIZAJE AUTOMATICO/

info:eu-repo/grantAgreement/EC/HE/101093164/EU/EXPeriment driven and user eXPerience oriented analytics for eXtremely Precise outcomes and decisions/ExtremeXP

Recommended citation

This citation was generated automatically.

Rights

Restricted access - publisher's policy

This item appears in the following Collection(s)

E-prints [72263]