Complete tolerance relation based parallel filling for incomplete energy big data
作者:
Highlights:
•
摘要
With the approaching of cloud and big data computing era, renewable energy such as solar energy is increasingly integrated into data center power provisioning systems. Nevertheless, the power statistics collection may not be possible or available due to the fact that renewable energy supply exhibits intermittency, time varying behavior (e.g. shortage or failure), resulting in missing data. In this paper, we propose a filling algorithm based on complete tolerance class to solve the missing of energy big data issue. Note that traditional method based on rough sets will likely to fail when there is severe missing data, and its solution on tolerance relation and tolerance class is more complex, which is not suitable for the large scale and the time varying energy big data. Our proposed algorithm expands the tolerance relation into the complete tolerance relation to partition the complete tolerance class. Moreover, our algorithm fills the missing attribute values of the energy big data in data center, which ensures the data integrity and improve the classification accuracy. We further parallelize and optimize our algorithm on state-of-the-art Spark cluster computing framework.In addition, we propose the adaptive management architecture that handles incomplete energy big data in green data centers. Our proposed architecture integrates the techniques for preprocessing energy data, filling incomplete energy data and building decision model. It increases the power assignment efficiency between solar power and utility, while enhancing load performance and service availability. As a result, it can provide better service for green data centers. We perform comprehensive experiments on an energy data set and the results show the Completing Incomplete Big Data (CIBD) algorithm can guarantee the completeness of data while improving the filling accuracy by 10% compared to general filling algorithms such as MEAN or ERS. The proposed algorithm and architecture show more benefit as the data missing rate increases. We further utilize the filled data to establish the random forest model and yield desirable results. Compared to the Hadoop based filling algorithm, the processing speed of the CIBD algorithm improves by 50% on the 4GB data size.
论文关键词:Green data center,Incomplete energy big data,Parallel filling on Spark,Complete toleration class,Adaptive management architecture of incomplete energy big data
论文评审过程:Received 23 May 2016, Revised 15 June 2017, Accepted 20 June 2017, Available online 23 June 2017, Version of Record 24 July 2017.
论文官网地址:https://doi.org/10.1016/j.knosys.2017.06.027