Abstract:
|
Big data has become the cornerstone of modern knowledge based system. However, taking advantage of the knowledge found in big data sets requires advanced solutions to store, access and analyze data in a feasible way, either online, offline or both. Such solutions comprise on the one hand a better understanding of computational needs for big data and on the other, the design of new computational infrastructures for such purpose. This paper evaluates the performance in terms of CPU, Load and Memory utilization and scalability of some clustering and collaborative filtering algorithms of Apache Spark MLlib, which provides efficient functionality for a wide range of learning settings and includes several underlying statistical, optimization, and linear algebra primitives. The aim is to reveal the performance of such algorithms and draw conclusions for their application to real life problems. To that end, the performance evaluations are done by using a large scale Google cluster usage trace dataset. |