Compression and aggregation of Bayesian estimates for data intensive computing

作者:Ruibin Xi, Nan Lin, Yixin Chen, Youngjin Kim

摘要

Bayesian estimation is a major and robust estimator for many advanced statistical models. Being able to incorporate prior knowledge in statistical inference, Bayesian methods have been successfully applied in many different fields such as business, computer science, economics, epidemiology, genetics, imaging, and political science. However, due to its high computational complexity, Bayesian estimation has been deemed difficult, if not impractical, for large-scale databases, stream data, data warehouses, and data in the cloud. In this paper, we propose a novel compression and aggregation schemes (C&A) that enables distributed, parallel, or incremental computation of Bayesian estimates. Assuming partitioning of a large dataset, the C&A scheme compresses each partition into a synopsis and aggregates the synopsis into an overall Bayesian estimate without accessing the raw data. Such a C&A scheme can find applications in OLAP for data cubes, stream data mining, and cloud computing. It saves tremendous computing time since it processes each partition only once, enabling fast incremental update, and allows parallel processing. We prove that the compression is asymptotically lossless in the sense that the aggregated estimator deviates from the true model by an error that is bounded and approaches to zero when the data size increases. The results show that the proposed C&A scheme can make feasible OLAP of Bayesian estimates in a data cube. Further, it supports real-time Bayesian analysis of stream data, which can only be scanned once and cannot be permanently retained. Experimental results validate our theoretical analysis and demonstrate that our method can dramatically save time and space costs with almost no degradation of the modeling accuracy.

论文关键词:Bayesian estimation, Data cubes, OLAP, Stream data mining, Compression, Aggregation

论文评审过程:

论文官网地址:https://doi.org/10.1007/s10115-011-0459-4