logo

SCIENTIA SINICA Informationis, Volume 51 , Issue 7 : 1199(2021) https://doi.org/10.1360/SSI-2020-0076

Data-adaptive privacy-preserving mechanism for data stream publishing in real-time

More info
  • ReceivedMar 31, 2020
  • AcceptedJun 8, 2020
  • PublishedJun 22, 2021

Abstract


Funded by

国家自然科学基金(61772410,61802298,U1811461)

中国博士后基金(2017M623177)

中央高校基本科研业务费(xjj2018237)


References

[1] Guo B, Wang Z, Yu Z. Mobile Crowd Sensing and Computing. ACM Comput Surv, 2015, 48: 1-31 CrossRef Google Scholar

[2] Zhang X, Hamm J, Reiter M K, et al. Statistical privacy for streaming traffic. In: Proceedings of the Network and Distributed System Security Symposium, San Diego, 2019. 1--15. Google Scholar

[3] Cao Y, Yoshikawa M. Differentially Private Real-Time Data Publishing over Infinite Trajectory Streams. IEICE Trans Inf Syst, 2016, E99.D: 163-175 CrossRef ADS Google Scholar

[4] Wang J Y, Liu C, Fu X C, et al. Crucial patterns mining with differential privacy over data streams. J Soft, 2019, 30: 158--176. Google Scholar

[5] Li M, Zhu H, Gao Z, et al. All your location are belong to us: breaking mobile social networks for automated user location tracking. In: Proceedings of ACM International Symposium on Mobile Ad Hoc Networking and Computing, Philadelphia, 2014. 43--52. Google Scholar

[6] Ji S, Li W, Srivatsa M. General Graph Data De-Anonymization. ACM Trans Inf Syst Secur, 2016, 18: 1-29 CrossRef Google Scholar

[7] Sun L, Zhang L, Ge C. An algorithm for differential privacy streaming data publication based on matrix mechanism under exponential decay mode. Sci Sin-Inf, 2017, 47: 1493-1509 CrossRef Google Scholar

[8] Cao Y, Yoshikawa M, Xiao Y. Quantifying Differential Privacy in Continuous Data Release Under Temporal Correlations. IEEE Trans Knowl Data Eng, 2019, 31: 1281-1295 CrossRef Google Scholar

[9] Cao Y, Yoshikawa M, Xiao Y, et al. Quantifying differential privacy under temporal correlations. In: Proceedings of IEEE International Conference on Data Engineering, San Diego, 2017. 821--832. Google Scholar

[10] Dwork C, Roth A. The Algorithmic Foundations of Differential Privacy. FNT Theor Comput Sci, 2014, 9: 211-407 CrossRef Google Scholar

[11] Zhang X J, Meng X F. Differential privacy in data publication and analysis. J Comput, 2014, 4: 197--219. Google Scholar

[12] Dwork C, Naor M, Pitassi T, et al. Differential privacy under continual observation. In: Proceedings of ACM Symposium on Theory of Computing, Cambridge, 2010. 715--724. Google Scholar

[13] Dwork C. Differential privacy in new settings. In: Proceedings of ACM-SIAM Symposium on Discrete Algorithms (SODA19), Austin, 2010. 174--183. Google Scholar

[14] Kellaris G, Papadopoulos S, Xiao X. Differentially private event sequences over infinite streams. Proc VLDB Endow, 2014, 7: 1155-1166 CrossRef Google Scholar

[15] Chen R, Shen Y, Jin H. Private analysis of infinite data streams via retroactive grouping. In: Proceedings of ACM International Conference on Information and Knowledge Management, Melbourne, 2015. 1061--1070. Google Scholar

[16] Liyue Fan , Li Xiong . An Adaptive Approach to Real-Time Aggregate Monitoring With Differential Privacy. IEEE Trans Knowl Data Eng, 2014, 26: 2094-2106 CrossRef Google Scholar

[17] Wang Q, Zhang Y, Lu X, et al. RescueDP: real-time spatio-temporal crowd-sourced data publishing with differential privacy. In: Proceedings of International Conference on Computer Communications, San Francisco, 2016. 1--9. Google Scholar

[18] Chen Y, Machanavajjhala A, Hay M, et al. PeGaSus: Data-adaptive differentially private stream processing. In: Proceedings of ACM Conference on Computer and Communications Security, Dallas, 2017. 1375--1388. Google Scholar

[19] Papadimitriou S, Li F, Kollios G, et al. Time series compressibility and privacy. In: Proceedings of Very Large Data Bases, Vienna, 2007. 459--470. Google Scholar

[20] Wang H, Xu Z. CTS-DP: Publishing correlated time-series data via differential privacy. Knowledge-Based Syst, 2017, 122: 167-179 CrossRef Google Scholar

[21] Wang T, Yang X Y, Ren X B, et al. Adaptive differentially private data stream publishing in spatio-temporal monitoring of IoT. In: Proceedings of IEEE International Performance Computing and Communications Conference, London, 2019. 1--8. Google Scholar

[22] Dwork C, McSherry F, Nissim K, et al. Calibrating noise to sensitivity in private data analysis. In: Proceedings of Theory of Cryptography Conference, New York, 2006. 265--284. Google Scholar

[23] Kifer D, Lin B R. Towards an axiomatization of statistical privacy and utility. In: Proceedings of the 29th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, Indianapolis, 2010. 147--158. Google Scholar

  • Figure 1

    (Color online) A general architecture for privacy-preserving stream aggregation and publishing

  • Figure 2

    (Color online) An example of hierarchical aggregated stream publishing

  • Figure 3

    (Color online) The framework of the data-adaptive privacy-preserving mechanism

  • Figure 4

    (Color online) A visualized example of DimParti

  • Figure 5

    (Color online) Visualizations of the protection for Flu stream ($\epsilon=1$). (a) $\omega=10$; (b) $\omega=100$; (c) $\omega=10$

  • Table 1   Comparisons of related methods with our proposed mechanism
    Algorithm Privacy level Dimension Stream scenario Learn spatial/temporal correlation Data-adaptive
    FAST [16] User-level DP Single Finite No/Yes No
    BD/BA [14] $\omega$-event DP Multiple Infinite No/Yes
    RescueDP [17] $\omega$-event DP Multiple Infinite Yes/Yes No
    PeGaSus [18] Event-level DP Multiple Infinite No/Yes No
    AdaPub $\omega$-event DP Multiple Infinite Yes/Yes Yes
  •   

    Algorithm 1 AdaPub: privately adaptive stream publishing in real-time

    Require:$\textit{\textbf{X}}^d~=~\{X_1^d,X_2^d,~\ldots~\}$: $d$-dimensional infinite stream;

    Output:$\textit{\textbf{R}}^d~=~\{R_1^d,R_2^d,~\ldots~\}$: sanitized $d$-dimensional infinite stream;

    for each timestamp $t$

    Execute multiple hash-based dimension partition: ${{\mathcal~P}_t}~=~{\mathop{\rm~{DimParti}}\nolimits}~(R_{t~-~1}^d,g)$;

    for each partition ${p}$ in ${{\mathcal~P}_t}$

    Compute the sum statistics of stream in ${p}$ as ${X}_t^{p}~=~\sum\nolimits_{k~\in~p}~{x_t^k}$;

    Add Laplace noise: ${\hat~X}_t^{p}~\leftarrow~{X}_t^{p}{\rm{~+~}}{\rm~Lap}(\Delta~\omega~/{\epsilon_p})$;

    Average to each dimension in $p$, that is, $\hat~x_t^k~=~{\hat~X}_t^{p}/\left|~{p}~\right|$;

    end for

    Perform adaptive cumulative backtracking time clustering: ${\mathcal~C}_t^d~\leftarrow~{\mathop{\rm~\mbox{{AdaCluster}}}\nolimits}~({X}_t^d,{\epsilon~_c},{\rm{thre}}_t^d,{\mathcal~C}_{t-1}^d)$;

    $r_t^k~=~{\rm~median}\{~\hat~x_j^k\left|~{j~\in~{\mathcal~C}_t^k}~\right.\}$ for each dimension $k\in[1,d]$;

    Publish $R_t^d~=~(r_t^1,r_t^2,\ldots,r_t^d)^{\rm~T}$;

    end for

  •   

    Algorithm 2 DimParti: multiple hash-based dimension partition

    Require:The last released stream ${R}_{t~-~1}^d$, space mapping functions ${\mathcal~M}$, the number of hash functions $g$;

    Output:Partition results ${{\mathcal~P}_t}~=~\left\{~{{p_1},{p_2},~\ldots~,{p_{\left|~{\mathcal~P}~\right|}}}~\right\}$ of $d$-dimensional stream;

    Obtain the prior estimation: ${\bar~X}_t^d~=~{(~{\bar~x_t^1,\bar~x_t^2,~\ldots~,\bar~x_t^d}~)^{\rm~T}~}~=~{R}_{t~-~1}^d$;

    Perform space mapping and obtain $d$-dimensional $g$-bit binary vector matrix ${{V}_{d~\times~g}}$, that is, ${(~{{\bar~X}_t^d}~)_{d~\times~1}}\xrightarrow{{{{\mathcal~M}_{1~\times~g}}}}{{V}_{d~\times~g}}$;

    Obtain the initial hash table ${\mathcal~T}$ by selecting unique value in ${{V}_{d~\times~g}}$;

    for each $g$-bit vector ${{v}_i}$ ($i~\in~[1,d]$) in ${{V}_{d~\times~g}}$

    Store vector $v_i$ on corresponding bucket of ${\mathcal~T}$ based on hash value ${{\mathcal~H}_g}\left(~{{v}_i}~\right)$;

    Obtain partition result ${{\mathcal~P}_t}~=~\left\{~{{p_1},{p_2},~\ldots~,{p_{\left|~{\mathcal~P}~\right|}}}~\right\}$ from hash table ${\mathcal~T}$;

    end for

    return ${{\mathcal~P}_t}$;

  • Table 2   Datasets used in experiments
    Dataset Length Dimension Total count Average count Value level Sparsity Fluctuation
    Flu 427 1 9787 22.92 Median 0.0679 Low
    SateFlu 389 51 5599045 282.22 High 0.0364 Median
    Uber 212 32 2312991 340.95 High 0.0015 High
    OnlineRetail 374 1298 4423432 9.11 Low 0.6841 High
    Syn 500 100 24580115 491.60 High 0.0097 High
  •   

    Algorithm 3 AdaCluster: adaptive cumulative backtracking time clustering

    Require:$d$-dimensional stream ${X}_t^d~=~{(x_t^1,x_t^2,~\ldots~,x_t^d)^{\rm~T}~}$ at time $t$, privacy budget ${\epsilon~_c}$;

    Output:Clustering result ${\mathcal~C}_t^d$ of $d$-dimensional stream at time $t$;

    for each dimension $k~\in~[1,d]$

    Compute the corrected threshold ${\rm{thre}}_t^k$;

    if $t=1$ then

    Compute ${\mathcal~C}_t^k~=~\{~t\}~$ and set ${\mathcal~C}_{\rm~state}^k~=~{\rm~open}$;

    else

    if ${\mathcal~C}_{\rm~state}^k~=~{\rm~open}$ then

    Compute deviation value ${\rm~dev}_t^k~=~f_{\rm~dev}({X}[{\mathcal~C}_{t~-~1}^k~\cup~\{~t\}~])$;

    if ${{\rm~dev}}_t^k+{\rm~Lap}(\Delta_{\rm~dev}\cdot\omega/\epsilon_c)<{\rm{thre}}_t^k$ then

    Compute ${\mathcal~C}_t^k~=~{\mathcal~C}_{t~-~1}^i~\cup~\{~t\}~$ and set ${\mathcal~C}_{\rm~state}^k~=~{\rm~close}$;

    else

    Compute ${\mathcal~C}_t^k~=~\{~t\}~$ and set ${\mathcal~C}_{\rm~state}^k~=~{\rm~close}$;

    end if

    else

    Compute ${\mathcal~C}_t^k~=~\{~t\}~$ and set ${\mathcal~C}_{\rm~state}^k~=~{\rm~open}$;

    end if

    end if

    end for

    ${\mathcal~C}_t^d~=~{\{~{\mathcal~C}_t^k\}~^{\rm~T}~},k~\in~[1,d]$;

    return ${\mathcal~C}_t^d$;

  • Table 3   The hierarchical aggregated streams datasets with binary-tree structure
    Dataset Dimension # of Aggregation Height Weight Sparsity
    SF32 32 31 5 [1,~2,~4,~8,~16] 0.0036
    OR1024 1024 1023 10 [1,~2,~4,~8,~16,~32,~64,~128,~256,~512] 0.3715
  • Table 4   The hierarchical aggregated streams datasets with quad-tree structure
    Dataset Dimension # of Aggregation Height Weight Sparsity
    SF32 32 21 3 [1,~4,~16] 0.0053
    OR1024 1024 341 5 [1,~4,~16,~64,~256] 0.2851
  •   

    Algorithm 4 HierAdaPub: data-adaptive framework of hierarchical streams publishing in real-time

    Require:Raw hierarchical streams $\textit{\textbf{X}}^A=~\{X_1^A,X_2^A,~\ldots~\}$, privacy budget $\epsilon~_{A}$, the weight $\mu$ of hierarchical tree;

    Output:Sanitized hierarchical streams $\textit{\textbf{R}}^{A}~=~\{~{R}_1^{A},{R}_2^{A},~\ldots~\}$;

    Compute the optimal privacy budget allocation ${\epsilon~^i},i~\in~\left[~{1,H}~\right]$;

    for each time $t$

    Initialize ${R}_t^{A}~=~\emptyset$;

    for each level $i~\in~H$

    Obtain $\mu~_i$-dimensional stream ${X}_t^{\mu~_i}$ of the $i$th level based on weight $\mu$;

    ${\epsilon~_p^i}~=~{\epsilon~_i}$, $\epsilon~_c^i~=~\epsilon~_c~/~H$;

    Perform ${R}_t^{\mu~_i}$=AdaPub$({X}_t^{\mu~_i},{\epsilon~_p^i},{\epsilon~_c^i},\omega~)$;

    ${R}_t^{A}~=~{R}_t^{A}~\cup~{R}_t^{\mu~_i}$;

    end for

    Publish ${R}_t^{A}$;

    end for

  • Table 5   Runtime (s) of privacy-preserving mechanisms on different datasets
    BD BA RescueDP PeGaSus AdaPub
    Time complexity ${O}(d)$ ${O}(d)$ ${O}(d^2)$ ${O}(d|\mathcal~C|_m)$ ${O}(d|\mathcal~C|_m~+~dg)$
    Flu ($d=1$) $1\times{10^{-5}}$ $1\times{10^{-5}}$ $1\times{10^{-3}}$ $6\times{10^{-5}}$ $6\times{10^{-5}}$
    StateFlu ($d=51$) $2\times{10^{-5}}$ $2\times{10^{-5}}$ $8\times{10^{-3}}$ $6\times{10^{-3}}$ $6\times{10^{-3}}$
    Uber ($d=32$) $1\times{10^{-5}}$ $1\times{10^{-5}}$ $4\times{10^{-4}}$ $1\times{10^{-3}}$ $1\times{10^{-3}}$
    OnlineRetail ($d=1298$) $2\times{10^{-4}}$ $2\times{10^{-4}}$ $3\times{10^{-2}}$ $7\times{10^{-2}}$ $7\times{10^{-2}}$
    Syn ($d=100$) $5\times{10^{-5}}$ $5\times{10^{-5}}$ $1\times{10^{-3}}$ $9\times{10^{-3}}$ $9\times{10^{-3}}$
  • Table 6   Runtime (s) of privacy-preserving mechanisms on hierarchical streams with binary and quad-tree structures
    Algorithm Time complexity
    SF32 (Binary)
    ($|A|=31$)
    OR1024 (Binary)
    ($|A|=1023$)
    SF32 (Quad)
    ($|A|=21$)
    OR1024 (Quad)
    ($|A|=341$)
    PeGaSus ${O}(|\mathcal~C|_m\left|A\right|)$ $4\times{10^{-3}}$ $7\times{10^{-2}}$ $2\times{10^{-3}}$ $6\times{10^{-5}}$
    HierAdaPub ${O}(H({\mu_m}|\mathcal~C|_m+{\mu_m}g))$ $9\times{10^{-3}}$ $6\times{10^{-2}}$ $7\times{10^{-3}}$ $4\times{10^{-2}}$
qqqq

Contact and support