To access the full text documents, please follow this link: http://hdl.handle.net/2117/175495

Keeping the data lake in form: DS-kNN datasets categorization using proximity mining
Al-serafi, Ayman Mounir Mohamed; Abelló Gamazo, Alberto; Romero Moral, Óscar; Calders, Toon
Universitat Politècnica de Catalunya. Doctorat Erasmus Mundus en Tecnologies de la Informació per a la Intel·ligència Empresarial; Universitat Politècnica de Catalunya. Departament d'Enginyeria de Serveis i Sistemes d'Informació; Universitat Politècnica de Catalunya. inSSIDE - integrated Software, Service, Information and Data Engineering; Universitat Politècnica de Catalunya. IMP - Information Modeling and Processing
With the growth of the number of datasets stored in data repositories, there has been a trend of using Data Lakes (DLs) to store such data. DLs store datasets in their raw formats without any transformations or preprocessing, with accessibility available using schema-on-read. This makes it difficult for analysts to find datasets that can be crossed and that belong to the same topic. To support them in this DL governance challenge, we propose in this paper an algorithm for categorizing datasets in the DL into pre-defined topic-wise categories of interest. We utilise a k-NN approach for this task which uses a proximity score for computing similarities of datasets based on metadata. We test our algorithm on a real-life DL with a known ground-truth categorization. Our approach is successful in detecting the correct categories for datasets and outliers with a precision of more than 90% and recall rates exceeding 75% in specific settings.
Peer Reviewed
-Àrees temàtiques de la UPC::Informàtica::Sistemes d'informació
-Metadata
-Information storage and retrieval systems
-Data mining
-Data lake categorization
-k-Nearest-Neighbour
-Metadata management
-Proximity mining
-Metadades
-Informació -- Sistemes d'emmagatzematge i recuperació
-Mineria de dades
Article - Submitted version
Conference Object
Springer
         

Show full item record

Related documents

Other documents of the same author

Al-serafi, Ayman Mounir Mohamed; Calders, Toon; Abelló Gamazo, Alberto; Romero Moral, Óscar
Al-serafi, Ayman Mounir Mohamed; Abelló Gamazo, Alberto; Romero Moral, Óscar; Calders, Toon
Jovanovic, Petar; Romero Moral, Óscar; Calders, Toon; Abelló Gamazo, Alberto
Nadal Francesch, Sergi; Romero Moral, Óscar; Abelló Gamazo, Alberto; Vassiliadis, Panos; Vansummeren, Stijn
Abelló Gamazo, Alberto; Romero Moral, Óscar; Jovanovic, Petar; Nadal Francesch, Sergi; Bilalli, Besim; Candón Arenas, Héctor; Mayorova, Daria; Thavornun, Varunya; Gil González, Daniel
 

Coordination

 

Supporters