Data preparation as a service based on Apache Spark

Abstract

Data preparation is the process of collecting, cleaning and consolidating raw datasets into cleaned data of certain quality. It is an important aspect in almost every data analysis process, and yet it remains tedious and time-consuming. The complexity of the process is further increased by the recent tendency to derive knowledge from very large datasets. Existing data preparation tools provide limited capabilities to effectively process such large volumes of data. On the other hand, frameworks and software libraries that do address the requirements of big data, require expert knowledge in various technical areas. In this paper, we propose a dynamic, service-based, scalable data preparation approach that aims to solve the challenges in data preparation on a large scale, while retaining the accessibility and flexibility provided by data preparation tools. Furthermore, we describe its implementation and integration with an existing framework for data preparation – Grafterizer. Our solution is based on Apache Spark, and exposes application programming interfaces (APIs) to integrate with external tools. Finally, we present experimental results that demonstrate the improvements to the scalability of Grafterizer.

Read publication

Client

EC/H2020 / 732590
EC/H2020 / 732003
EC/H2020 / 644497

Language

English

Author(s)

Nivethika Mahasivam
Nikolay Nikolov
Dina Sukhobok
Titi Roman

Affiliation

SINTEF Digital / Sustainable Communication Technologies

Year

2017

Published in

Lecture Notes in Computer Science (LNCS)

ISSN

0302-9743

Publisher

Springer

Volume

10465

Page(s)

125 - 139

DOI

https://doi.org/10.1007/978-3-319-67262-5_10

Read fulltext

https://hdl.handle.net/11250/2478902

View this publication at Cristin

Contact us

Our services

Career

Sustainability

Management and board

Institutes

Other units

About us

Follow us