logo

SCIENCE CHINA Information Sciences, Volume 64 , Issue 5 : 152101(2021) https://doi.org/10.1007/s11432-020-2910-1

Differential identifiability clustering algorithms for big data analysis

More info
  • ReceivedJan 18, 2020
  • AcceptedApr 18, 2020
  • PublishedMar 31, 2021

Abstract


Acknowledgment

This work was supported by National Key Research and Development Program of China (Grant No. 2016YFC1000307) and National Natural Science Foundation of China (Grant Nos. 61971021, 61571024).


References

[1] Mendes R, Vilela J P. Privacy-Preserving Data Mining: Methods, Metrics, and Applications. IEEE Access, 2017, 5: 10562-10582 CrossRef Google Scholar

[2] Yu S. Big Privacy: Challenges and Opportunities of Privacy Study in the Age of Big Data. IEEE Access, 2016, 4: 2751-2763 CrossRef Google Scholar

[3] Dinur I, Nissim K. Revealing information while preserving privacy. In: Proceedings of ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, San Diego, 2003. 202--210. Google Scholar

[4] Dwork C, Nissim K. Privacy-preserving datamining on vertically partitioned databases. In: Advances in Cryptology -- CRYPTO 2004. Berlin: Springer, 2004. 528--544. Google Scholar

[5] Blum A, Dwork C, McSherry F, et al. Practical privacy: the sulq framework. In: Proceedings of ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, Baltimore, 2005. 128--138. Google Scholar

[6] Dwork C, McSherry F, Nissim K, et al. Calibrating noise to sensitivity in private data analysis. In: Theory of Cryptography. Berlin: Springer, 2006. 265--284. Google Scholar

[7] Dwork C. Differential privacy. In: Automata, Languages, and Programming. Berlin: Springer, 2006. 1--12. Google Scholar

[8] Kifer D, Machanavajjhala A. No free lunch in data privacy. In: Proceedings of ACM SIGMOD International Conference on Management of Data, Athens, 2011. 193--204. Google Scholar

[9] Cormode J. Personal privacy vs population privacy: learning to attack anonymization. In: Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, 2011. 1253--1261. Google Scholar

[10] Lee J, Clifton C. Differential identifiability. In: Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Beijing, 2012. 1041--1049. Google Scholar

[11] Dwork C. A firm foundation for private data analysis. Commun ACM, 2011, 54: 86-95 CrossRef Google Scholar

[12] Li L, Dong Y, Wang J. Differential privacy data protection method based on clustering. In: Proceedings of Cyber-Enabled Distributed Computing and Knowledge Discovery, Nanjing, 2017. 11--16. Google Scholar

[13] Zhao Z, Shang T, Liu J W, et al. Clustering algorithm for privacy preservation on MapReduce. In: Cloud Computing and Security, Cham, 2018. 622--632. Google Scholar

[14] Bugata P, Drotar P. On some aspects of minimum redundancy maximum relevance feature selection. Sci China Inf Sci, 2020, 63: 112103 CrossRef Google Scholar

[15] Samarati P, Sweeney L. Protecting privacy when disclosing information: k-anonymity and its enforcement through generalization and suppression. In: Proceedings of IEEE Symposium on Security and Privacy, Oakland, 1998. 1--19. Google Scholar

[16] Samarati P. Protecting respondents identities in microdata release. IEEE Trans Knowl Data Eng, 2001, 13: 1010-1027 CrossRef Google Scholar

[17] Sweeney L. k-ANONYMITY: A MODEL FOR PROTECTING PRIVACY. Int J Unc Fuzz Knowl Based Syst, 2002, 10: 557-570 CrossRef Google Scholar

[18] Machanavajjhala A, Gehrke J, Kifer D, et al. l-diversity: privacy beyond k-anonymity. In: Proceedings of International Conference on Data Engineering, Atlanta, 2006. 24--35. Google Scholar

[19] Li N, Li T, Venkatasubramanian S. t-closeness: privacy beyond k-anonymity and l-diversity. In: Proceedings of International Conference on Data Engineering, Istanbul, 2007. 106--115. Google Scholar

[20] Ninghui Li , Tiancheng Li , Venkatasubramanian S. Closeness: A New Privacy Measure for Data Publishing. IEEE Trans Knowl Data Eng, 2010, 22: 943-956 CrossRef Google Scholar

[21] Li N H, Qardaji W, Su D, et al. Membership privacy: a unifying framework for privacy definitions. In: Proceedings of ACM SIGSAC Conference on Computer and Communications Security, New York, 2013. 889--900. Google Scholar

[22] Backes M, Berrang P, Humbert M, et al. Membership privacy in microRNA-based studies. In: Proceedings of ACM SIGSAC Conference on Computer and Communication, Vienna, 2016. 319--330. Google Scholar

[23] Aggarwal G, Feder T, Kenthapadi K, et al. Achieving anonymity via clustering. In: Proceedings of ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, Chicago, 2006. 153--162. Google Scholar

[24] Byun J W, Kamra A, Bertino E, et al. Efficient k-anonymization using clustering techniques. In: Proceedings of International Conference on Database Systems for Advanced Applications, Bangkok, 2007. 188--200. Google Scholar

[25] Nissim K, Raskhodnikova S, Smith A. Smooth sensitivity and sampling in private data analysis. In: Proceedings of ACM Symposium on Theory of Computing, San Diego, 2007. 75--84. Google Scholar

[26] Jafer Y, Matwin S, Sokolova M. Using Feature Selection to Improve the Utility of Differentially Private Data Publishing. Procedia Comput Sci, 2014, 37: 511-516 CrossRef Google Scholar

[27] Huang Z. Clustering large data sets with mixed numeric and categorical values. In: Proceedings of Pacific-Asia Conference on Knowledge Discovery and Data Mining, Singapore, 1997. 21--34. Google Scholar

[28] Huang Y. Understanding of Big Data: Big Data Processing and Programming Practices. Beijing: Machinery Industry Press, 2014. Google Scholar

[29] McSherry F, Talwar K. Mechanism design via differential privacy. In: Proceedings of IEEE Annual Symposium on Foundations of Computer Science, Providence, 2007. 94--103. Google Scholar

[30] Dwork C, Rothblum G N, Vadhan S. Boosting and differential privacy. In: Proceedings of IEEE Annual Symposium on Foundations of Computer Science, Las Vegas, 2010. 51--60. Google Scholar

[31] Bkakria A, Cuppens-Boulahia N, Cuppens F. Linking differential identifiability with differential privacy. In: Proceedings of International Conference on Information and Communications Security, Cham, 2018. 232--247. Google Scholar

[32] McSherry F. Privacy integrated queries:an extensible platform for privacy-preserving data analysis. Commun ACM, 2010, 53(9): 89--97 doi: 10.1145/1559845.1559850. Google Scholar

[33] Xiong J, Ren J, Chen L. Enhancing Privacy and Availability for Data Clustering in Intelligent Electrical Service of IoT. IEEE Internet Things J, 2019, 6: 1530-1540 CrossRef Google Scholar

[34] A differential privacy protecting K-means clustering algorithm based on contour coefficients. PLoS ONE, 2018, 13: e0206832 CrossRef ADS Google Scholar

[35] Gao Z, Sun Y, Cui X, et aL. Privacy-preserving hybrid $k$-means. Int. J. Data Warehous. Min., 2018, 14(2): 1--17 doi: 10.4018/978-1-5225-7113-1.ch049. Google Scholar

[36] Ren J, Xiong J, Cui X, et al. Dplk-means: A novel differential privacy $k$-means mechanism. In: Proceedings of IEEE International Conference on Data Science in Cyberspace, Shenzhen, 2017. 133--139. Google Scholar

[37] Su D, Cao J, Li N. Differentially Private K-Means Clustering and a Hybrid Approach to Private Optimization. ACM Trans Priv Secur, 2017, 20: 1-33 CrossRef Google Scholar

[38] Author A, Author B, Author C. Reference title. Journal, Year, Vol: Number or pages. Google Scholar

[39] Author A, Author B, Author C, et al. Reference title. In: Proceedings of Conference, Place, Year. Number or pages. Google Scholar

  • Figure 1

    MapReduce architecture.

  • Figure 6

    (Color online) F-measure: DI $k$-prototypes algoriTheorem.

  • Table 1  

    Table 1Main notations

    NotationMeaning
    $U$ Universe
    $D~\subset~U$ Database to be queried
    $D'=D-i\*$ Subset of $D$ missing one individual
    $t~\in~U$ Data associated with an individual
    $I(t)$ Identity of an individual corresponding to $t$
    $\mathcal{I}_D=\{I(t)|t\in~D\}$ Set of individuals which belong to $D$
    $\Psi$ All possible databases
    $f$ Query function
    $\Delta~f$ Sensitive range of $f$
    $sf$ Score function
    $M$ Randomized mechanism
    $\Gamma(t)$ Identifiable risk for an individual $t$
  • Table 21  

    Table 2Table 1

    Analysis of DP and DI $k$-means algorithms$^{\rm~a)}$

  • Table 3  

    Table 3Description of magic data set

    Data set characteristicMultivariate
    Number of instances19020
    AreaPhysical
    Attribute characteristicsReal
    Number of attributes11
    Classification$g=$ gamma(signal): 12332, $h=$ hadron(background): 6688
  • Table 4  

    Table 4Attribute description of heart data set

    AttributeCharacteristicInformation
    AgeNumerical
    SexCategorical1: male; 2: female
    Chest pain typeCategorical
    1: typical angina;
    2: atypical angina;
    3: non-angina pectoris;
    4: asymptomatic
    Resting blood pressureNumerical
    SerumNumericalmg/dl
    Fasting blood SugarCategorical
    0: abnormal($>120$ mg/dl);
    1: normal
    Resting
    electrocardiographic
    results
    Categorical
    0: normal;
    1: ST-T abnormal;
    2: left ventricular
    hypertrophy
    Maximum heart rate achieved
    Numerical
    Exercise induced anginaCategorical0: no; 1: yes
    ST depression induced
    by exercise relative to rest
    Numerical
    The slope of the peak
    exercise ST segment
    Categorical
    1: rise; 2: steady;
    3: decline
    Number of major vesselsNumerical
    ThalassemiaCategorical
    3: normal;
    6: fixed defect;
    7: reversable defect
qqqq

Contact and support