Distributed real-time ETL architecture for unstructured big data
作者:Erum Mehmood, Tayyaba Anees
摘要
Real-time extract transform load (ETL) is the integral part of increasing demand of faster business decisions targeting large number of modern applications. Multi-source unstructured data stream extraction and transformation using disk data in distributed environment are the building blocks of real-time ETL due to volume and velocity of data. Therefore designing an architecture for basic building blocks for real-time ETL remains a major challenge. In this paper, we focus primarily to expedite stream-disk joins during transformation phase of ETL that is considered most expensive operator in stream processing due to frequent disk access. We propose an architecture for real-time ETL to ingest unstructured stream of data from multi-sources, without having to worry about the structure of data sources, and transform them after joining with distributed disk data. We also present a novel data pipeline stream-disk join that uses partition-based input and best-effort in-memory database technique reducing frequent disk access. The proposed architecture addresses the challenges of stream data loss, ignored un-matching streams, disk overhead and real-time processing for distributed environment. The experimental results obtained using stream generator and real-world datasets on local and distributed machines show that proposed architecture yields significantly improved throughput especially for large number of stream tuples with large datasets.
论文关键词:Real-time stream, Distributed big data, Spark Structured Streaming, Apache Kafka, MongoDB, Semi-structured, Unstructured data, ETL, Data pipeline
论文评审过程:
论文官网地址:https://doi.org/10.1007/s10115-022-01757-7