Random walk on node cliques for high-quality samples to estimate large graphs with high accuracies and low costs
作者:Lingling Zhang, Fang Wang, Hong Jiang, Dan Feng, Yanwen Xie, Zhiwei Zhang, Guoren Wang
摘要
Random-walk-based sampling is an efficient way to extract and analyze the properties of large and complex graphs representing social networks. However, it is almost impractical for existing random-walk-based sampling schemes to reach the desired node distribution because of the indeterministic sampling budget (i.e., the number of samples or sampling steps) required for doing so with large volumes of data in graphs. On the other hand, under a small sampling budget, these methods produce low-quality samples with many repeats and high correlations (i.e., many common attributes), which leads to a large deviation from the desired node distribution and large estimation errors. In this paper, we propose a new random-walk sampling scheme based on node cliques (a subset of cliques), called node-clique random walk, or NCRW, to strike a good balance between the estimation error and the sampling budget, by producing unique samples with low correlations. Meanwhile, both the deviation from the desired node distribution and the estimation errors under the constraint of the sampling budget are reduced both theoretically and experimentally. Thus, the sampling costs which are closely related to the sampling budget are reduced. Our extensive experimental evaluation driven by real-world datasets further confirms that NCRW significantly increases the quality of samples and accuracy of estimations with much lower costs than those of existing random-walk-based sampling schemes especially in estimating the higher-order node attributes.
论文关键词:Random walk, Sampling, Estimating, Graph mining
论文评审过程:
论文官网地址:https://doi.org/10.1007/s10115-022-01691-8