Big Data: from collection to visualization
作者:Mohammed Ghesmoune, Hanene Azzag, Salima Benbernou, Mustapha Lebbah, Tarn Duong, Mourad Ouziri
摘要
Organisations are increasingly relying on Big Data to provide the opportunities to discover correlations and patterns in data that would have previously remained hidden, and to subsequently use this new information to increase the quality of their business activities. In this paper we present a ‘story’ of Big Data from the initial data collection and to the end visualization, passing by the data fusion, and the analysis and clustering tasks. For this, we present a complete work flow on (a) how to represent the heterogeneous collected data using the high performance RDF language, how to perform the fusion of the Big Data in RDF by resolving the issue of entity disambiguity and how to query those data to provide more relevant and complete knowledge and (b) as the data are received in data streams, we propose batchStream, a Micro-Batching version of the growing neural gas approach, which is capable of clustering data streams with a single pass over the data. The batchStream algorithm allows us to discover clusters of arbitrary shapes without any assumptions on the number of clusters. This Big Data work flow is implemented in the Spark platform and we demonstrate it on synthetic and real data.
论文关键词:Data fusion, RDF, Semantic, Entity resolution, Big data, Map-Reduce, Spark, Data stream clustering, Micro-Batch streaming, GNG, Topological structure, Visualization
论文评审过程:
论文官网地址:https://doi.org/10.1007/s10994-016-5622-4