Characterizing BigBench Queries, Hive, and Spark in Multi-cloud Environments

Inicio | ¿Qué es? | Contacto

English | Català

Consultar RECERCAT

Por comunidades y
colecciones Por fecha Por autores Por títulos Por temas (CDU)

Consultar departamento

Por fecha Por autores Por títulos Por temas (CDU)

Estadisticas

Del documento Todo RECERCAT

Mi RECERCAT

Entrar Alertas por correo-e

Directorio de otros repositorios

RECERCAT Principal > Universitat Politècnica de Catalunya > Documents de recerca > Visualizar documento

Para acceder a los documentos con el texto completo, por favor, siga el siguiente enlace: http://hdl.handle.net/2117/114812

dc.contributor	Barcelona Supercomputing Center
dc.contributor.author	Poggi, Nicolas
dc.contributor.author	Montero, Alejandro
dc.contributor.author	Carrera, David
dc.date	2017-12-30
dc.identifier.citation	Poggi, N.; Montero, A.; Carrera, D. Characterizing BigBench Queries, Hive, and Spark in Multi-cloud Environments. A: "TPCTC 2017: Performance Evaluation and Benchmarking for the Analytics Era. Lecture Notes in Computer Science". Springer Verlag, 2017, p. 55-74.
dc.identifier.citation	978-3-319-72400-3
dc.identifier.citation	10.1007/978-3-319-72401-0_5
dc.identifier.uri	http://hdl.handle.net/2117/114812
dc.language.iso	eng
dc.publisher	Springer Verlag
dc.relation	https://link.springer.com/chapter/10.1007/978-3-319-72401-0_5
dc.relation	info:eu-repo/grantAgreement/EC/H2020/639595/EU/Holistic Integration of Emerging Supercomputing Technologies/Hi-EST
dc.relation	info:eu-repo/grantAgreement/ES/PE2013-2016/TIN2015-65316-P
dc.rights	Attribution-NonCommercial-NoDerivs 3.0 Spain
dc.rights	info:eu-repo/semantics/openAccess
dc.rights	http://creativecommons.org/licenses/by-nc-nd/3.0/es/
dc.subject	Àrees temàtiques de la UPC::Informàtica
dc.subject	High performance computing
dc.subject	Big Data Analytics Systems (BDAS)
dc.subject	BigBench
dc.subject	Supercomputadors
dc.title	Characterizing BigBench Queries, Hive, and Spark in Multi-cloud Environments
dc.type	info:eu-repo/semantics/submittedVersion
dc.type	info:eu-repo/semantics/conferenceObject
dc.description.abstract	BigBench is the new standard (TPCx-BB) for benchmarking and testing Big Data systems. The TPCx-BB specification describes several business use cases—queries—which require a broad combination of data extraction techniques including SQL, Map/Reduce (M/R), user code (UDF), and Machine Learning to fulfill them. However, currently, there is no widespread knowledge of the different resource requirements and expected performance of each query, as is the case to more established benchmarks. Moreover, over the last year, the Spark framework and APIs have been evolving very rapidly, with major improvements in performance and the stable release of v2. It is our intent to compare the current state of Spark to Hive’s base implementation which can use the legacy M/R engine and Mahout or the current Tez and MLlib frameworks. At the same time, cloud providers currently offer convenient on-demand managed big data clusters (PaaS) with a pay-as-you-go model. In PaaS, analytical engines such as Hive and Spark come ready to use, with a general-purpose configuration and upgrade management. The study characterizes both the BigBench queries and the out-of-the-box performance of Spark and Hive versions in the cloud. At the same time, comparing popular PaaS offerings in terms of reliability, data scalability (1 GB to 10 TB), versions, and settings from Azure HDinsight, Amazon Web Services EMR, and Google Cloud Dataproc. The query characterization highlights the similarities and differences in Hive an Spark frameworks, and which queries are the most resource consuming according to CPU, memory, and I/O. Scalability results show how there is a need for configuration tuning in most cloud providers as data scale grows, especially with Sparks memory usage. These results can help practitioners to quickly test systems by picking a subset of the queries which stresses each of the categories. At the same time, results show how Hive and Spark compare and what performance can be expected of each in PaaS.
dc.description.abstract	This project has received funding from the European Research Council (ERC) under the European Unions Horizon 2020 research and innovation programme (grant agreement No. 639595). It is also partially supported by the Ministry of Economy of Spain under contract TIN2015-65316-P and Generalitat de Catalunya under contract 2014SGR1051, by the ICREA Academia program, and by the BSC-CNS Severo Ochoa program (SEV-2015-0493).
dc.description.abstract	Peer Reviewed

Mostrar el registro sencillo del ítem

Documentos relacionados

Otros documentos del mismo autor/a

ALOJA-ML: a framework for automating characterization and knowledge discovery in Hadoop deployments

Berral García, Josep Lluís; Poggi, Nicolas; Carrera Pérez, David; Call, Aaron; Reinauer, Rob; Green, Daron

ALOJA: A benchmarking and predictive platform for big data performance analysis

Poggi, Nicolas; Berral García, Josep Lluís; Carrera Pérez, David

Database integrated analytics using R : initial experiences with SQL-Server + R

Berral, Josep Ll.; Poggi, Nicolas

The state of SQL-on-Hadoop in the cloud

Poggi, Nicolas; Berral García, Josep Lluís; Fenech, Thomas; Carrera Pérez, David; Blakeley, Jose; Minhas, Umar F.; Vujic, Nikola

Database Integrated Analytics Using R: Initial Experiences with SQL-Server + R

Berrall, Josep Ll.; Poggi, Nicolas

Accesibilidad | Aviso legal | Política de Cookies | Documentos de uso interno

Coordinación

Patrocinio