Automatically configuring parallelism for hybrid layouts

Home | About RECERCAT | Contact

Català | Castellano

All of RECERCAT

By Communities &
Collections By Defense Date By Authors By Titles By Subject

This Collection

By Defense Date By Authors By Titles By Subject

Statistics

View Statistics All RECERCAT

My RECERCAT

Other repositories directory

RECERCAT Home > Universitat Politècnica de Catalunya > Documents de recerca > View document

To access the full text documents, please follow this link: http://hdl.handle.net/2117/175616

Title:	Automatically configuring parallelism for hybrid layouts
Author:	Munir, Rana Faisal; Abelló Gamazo, Alberto; Romero Moral, Óscar; Thiele, Maik; Lehner, Wolfgang
Other authors:	Universitat Politècnica de Catalunya. Doctorat Erasmus Mundus en Tecnologies de la Informació per a la Intel·ligència Empresarial; Universitat Politècnica de Catalunya. Departament d'Enginyeria de Serveis i Sistemes d'Informació; Universitat Politècnica de Catalunya. inSSIDE - integrated Software, Service, Information and Data Engineering; Universitat Politècnica de Catalunya. IMP - Information Modeling and Processing
Abstract:	Distributed processing frameworks process data in parallel by dividing it into multiple partitions and each partition is processed in a separate task. The number of tasks is always created based on the total file size. However, this can lead to launch more tasks than needed in the case of hybrid layouts, because they help to read less data for certain operations (i.e., projection, selection). The over-provisioning of tasks may increase the job execution time and induce significant waste of computing resources. The latter due to the fact that each task introduces extra overhead (e.g., initialization, garbage collection, etc.). To allow a more efficient use of resources and reduce the job execution time, we propose a cost-based approach that decides the number of tasks based on the data being read. The proposed cost-model can be utilized in a multi-objective approach to decide both the number of tasks and number of machines for execution.
Abstract:	Peer Reviewed
Subject(s):	-Àrees temàtiques de la UPC::Informàtica::Sistemes d'informació -Big data -Electronic data processing -- Distributed processing -Hybrid storage layouts -Parallelism -Parquet -Spark -Macrodades -Processament distribuït de dades
Rights:
Document type:	Article - Submitted version Conference Object
Published by:	Springer
Share:

Show full item record

All of RECERCAT

This Collection

Statistics

My RECERCAT

Related documents

Other documents of the same author