Heterogeneous data release for cluster analysis with differential privacy

作者:

Highlights:

摘要

Many models have been proposed to preserve data privacy for different data publishing scenarios. Among these models, ϵ-differential privacy has drawn increasing attention in recent years due to its rigorous privacy guarantees. While many existing solutions using ϵ-differential privacy deal with relational data and set-valued data separately, most of the real-life data, such as electronic health records, are in heterogeneous form. Privacy protection on heterogeneous data has not been widely studied. Furthermore, many existing works in privacy protection consider preserving the utility for the tasks of frequent itemset mining or classification analysis, but few works have focused on data publication for cluster analysis. In this paper, we propose the first differentially-private solution to release heterogeneous data for cluster analysis. The challenge facing us is how to mask raw data without any explicit guidance. Our approach addresses this challenge by converting a clustering problem to a classification problem, in which class labels can be used to encode the cluster structure of the raw data and assist the masking process. The approach generalizes the raw data probabilistically and adds noise to them for satisfying ϵ-differential privacy. Through extensive experiments on real-life datasets, we validate the performance of our approach.

论文关键词:Data publishing,Heterogeneous data,Differential privacy,Cluster analysis

论文评审过程:Received 26 June 2019, Revised 15 May 2020, Accepted 16 May 2020, Available online 20 May 2020, Version of Record 22 May 2020.

论文官网地址:https://doi.org/10.1016/j.knosys.2020.106047