logo

SCIENTIA SINICA Informationis, Volume 50 , Issue 8 : 1239-1254(2020) https://doi.org/10.1360/SSI-2019-0110

Clustering method based on sample's stability

More info
  • ReceivedMay 30, 2019
  • AcceptedNov 1, 2019
  • PublishedAug 6, 2020

Abstract


Funded by

国家重点研发计划(2018YFB1004300)

山西省重点研发计划(201903D421003)

国家自然科学基金(61672332,U1805263,61872226,61802238,61673249)

山西省海外归国人员研究项目(2017023,2016004)


References

[1] LeCun Y, Bengio Y, Hinton G. Deep learning. Nature, 2015, 521: 436-444 CrossRef PubMed ADS Google Scholar

[2] Jain A K, Murty M N, Flynn P J. Data clustering: a review. ACM Comput Surv, 1999, 31: 264-323 CrossRef Google Scholar

[3] Zheng L, Yang Y, Tian Q. SIFT Meets CNN: A Decade Survey of Instance Retrieval.. IEEE Trans Pattern Anal Mach Intell, 2018, 40: 1224-1244 CrossRef PubMed Google Scholar

[4] Dhillon I S, Modha D S. Concept decompositions for large sparse text data using clustering. Machine Learning, 2001, 42: 143-175 CrossRef Google Scholar

[5] Lu D, Tripodis Y, Gerstenfeld L C, et al. Clustering of temporal gene expression data with mixtures of mixed effects models with a penalized likelihood. Bioinformatics, 2018, 35: 778-786. Google Scholar

[6] Zechao Li , Jing Liu , Yi Yang . Clustering-Guided Sparse Structural Learning for Unsupervised Feature Selection. IEEE Trans Knowl Data Eng, 2014, 26: 2138-2150 CrossRef Google Scholar

[7] Erhan D, Bengio Y, Courville A, et al. Why does unsupervised pre-training help deep learning? J Mach Learn Res, 2010, 11: 625-660 DOI: 10.1145/1756006.1756025. Google Scholar

[8] Frey B J, Dueck D. Clustering by Passing Messages Between Data Points. Science, 2007, 315: 972-976 CrossRef PubMed ADS Google Scholar

[9] Rodriguez A, Laio A. Clustering by fast search and find of density peaks. Science, 2014, 344: 1492-1496 CrossRef PubMed ADS Google Scholar

[10] Otto C, Wang D, Jain A K. Clustering Millions of Faces by Identity.. IEEE Trans Pattern Anal Mach Intell, 2018, 40: 289-303 CrossRef PubMed Google Scholar

[11] Dan Zhang , Fei Wang , Luo Si . Maximum margin multiple instance clustering with applications to image and text clustering.. IEEE Trans Neural Netw, 2011, 22: 739-751 CrossRef PubMed Google Scholar

[12] Brockmeier A J, Mu T, Ananiadou S. Self-Tuned Descriptive Document Clustering Using a Predictive Network. IEEE Trans Knowl Data Eng, 2018, 30: 1929-1942 CrossRef Google Scholar

[13] Qian Y, Liang J, Pedrycz W. Positive approximation: An accelerator for attribute reduction in rough set theory. Artificial Intelligence, 2010, 174: 597-618 CrossRef Google Scholar

[14] Qian Y, Li F, Liang J. Space Structure and Clustering of Categorical Data.. IEEE Trans Neural Netw Learning Syst, 2016, 27: 2047-2059 CrossRef PubMed Google Scholar

[15] Fahy C, Yang S, Gongora M. Ant Colony Stream Clustering: A Fast Density Clustering Algorithm for Dynamic Data Streams.. IEEE Trans Cybern, 2019, 49: 2215-2228 CrossRef PubMed Google Scholar

[16] Blomstedt P, Tang J, Xiong J. A Bayesian Predictive Model for Clustering Data of Mixed Discrete and Continuous Type.. IEEE Trans Pattern Anal Mach Intell, 2015, 37: 489-498 CrossRef PubMed Google Scholar

[17] Douzas G, Bacao F. Self-Organizing Map Oversampling (SOMO) for imbalanced data set learning. Expert Syst Appl, 2017, 82: 40-52 CrossRef Google Scholar

[18] Xu J, Han J, Nie F. Re-Weighted Discriminatively Embedded K -Means for Multi-View Clustering. IEEE Trans Image Process, 2017, 26: 3016-3027 CrossRef PubMed ADS Google Scholar

[19] Yao X, Han J, Zhang D. Revisiting Co-Saliency Detection: A Novel Approach Based on Two-Stage Multi-View Spectral Rotation Co-clustering. IEEE Trans Image Process, 2017, 26: 3196-3209 CrossRef PubMed ADS Google Scholar

[20] Li F, Qian Y, Wang J. Clustering ensemble based on sample's stability. Artificial Intelligence, 2019, 273: 37-55 CrossRef Google Scholar

[21] Strehl A L, Ghosh J. Cluster ensembles---a knowledge reuse framework for combining multiple partitions. Journal ofMachine Learning Research, 2003, 3: 583-617. Google Scholar

[22] Li F, Qian Y, Wang J. Multigranulation information fusion: A Dempster-Shafer evidence theory-based clustering ensemble method. Inf Sci, 2017, 378: 389-409 CrossRef Google Scholar

[23] Shannon C E. A Mathematical Theory of Communication. Bell Syst Technical J, 1948, 27: 379-423 CrossRef Google Scholar

[24] Hu Q H, Guo M Z, Yu D R. Information entropy for ordinal classification. Sci China Inf Sci, 2010, 53: 1188-1200 CrossRef Google Scholar

[25] Otsu N. A Threshold Selection Method from Gray-Level Histograms. IEEE Trans Syst Man Cybern, 1979, 9: 62-66 CrossRef Google Scholar

[26] Chan T F, Vese L A. Active contours without edges. IEEE Trans Image Process, 2001, 10: 266-277 CrossRef PubMed ADS Google Scholar

[27] Malinen M I, Mariescu-Istodor R, Fr?nti P. K-means?: Clustering by gradual data transformation. Pattern Recognition, 2014, 47: 3376-3386 CrossRef Google Scholar

[28] Likas A, Vlassis N, J. Verbeek J. The global k-means clustering algorithm. Pattern Recognition, 2003, 36: 451-461 CrossRef Google Scholar

[29] Fan Z, Jiang J, Weng S. Adaptive density distribution inspired affinity propagation clustering. Neural Comput Applic, 2019, 31: 435-445 CrossRef Google Scholar

[30] Xu X, Ding S, Shi Z. An improved density peaks clustering algorithm with fast finding cluster centers. Knowledge-Based Syst, 2018, 158: 65-74 CrossRef Google Scholar

[31] Averbuch-Elor H, Bar N, Cohen-Or D. Border-peeling clustering. IEEE Trans Pattern Anal Mach Intell, 2019. doi: 10.1109/TPAMI.2019.2924953. Google Scholar

[32] Hubert L, Arabie P. Comparing partitions. J Classification, 1985, 2: 193-218 CrossRef Google Scholar

  • Figure 1

    An example with $\mu=3$ and $\sigma=3$. (a) Conditional probability; (b) posterior probability

  • Figure 2

    The cuver of (a) $H_2$ and (b) ${\rm~fe}$ ($t=0.7$)

  • Figure 3

    Experiments on synthetic data sets. (a) 2d2k; (b) Flame; (c) Jain; (d) WingNut

  • Figure 4

    Experiments on image segmentation data sets

  • Figure 5

    (Color online) Nemenyi post-hoc test based on (a) Table 3and (b) Table 4

  • Table 1   Benchmark data sets
    Number Data set Number of samples Number of attributes Number of clusters
    1 Iris 150 4 3
    2 Wine recognition data 178 13 3
    3 Seeds data set 210 7 3
    4 Glass identification database 214 9 6
    5 Ecoli 336 7 8
    6 Cardiotocography data set 2126 40 10
    7 Image segmentation data 2310 19 7
    8 Waveform database generator 5000 21 3
    9 tr45 690 8261 10
    10 tr41 878 7454 10
    11 wap 1560 8460 20
    12 re1 1657 3758 25
  •   

    Algorithm 1 基于样本稳定性的聚类方法

    输入: 数据集${{X}}=\{{{\boldsymbol~x}_1},{{\boldsymbol~x}_2},\ldots,{{\boldsymbol~x}_n}\}$, 预期类个数$k$.

    输出: 聚类结果$C=\{c_1,c_2,\ldots,c_k\}$.

    基于式(6)得到样本对的共现概率矩阵$\boldsymbol{P}=\{p_{ij}|1\leq~i\leq~n,~1\leq~j\leq~n\}$;

    根据式(5)计算$n$个样本的稳定性$\boldsymbol{S}=\{s_1,s_2,\ldots,s_n\}$;

    根据式(7)和(8)得到稳定样本集${\rm~SS}$和不稳定样本集${\rm~NS}$;

    利用AP算法对稳定样本集${\rm~SS}$聚类:$C_{ss}=\{c_1^{\rm ss},c_2^{\rm ss},\ldots,c_{k'}^{\rm ss}\}\leftarrow {\rm AP}({\rm SS})$;

    while $|{\rm~NS}|\neq~0$ do

    根据式(9)得${\rm~NS}^>$;

    更新稳定样本集和不稳定样本集: ${\rm~SS}={\rm~SS}\cup~{\rm~NS}^>$, ${\rm~NS}={\rm~NS}-{\rm~NS}^>$;

    根据式(10)扩充$C_{\rm~ss}$中的每个类簇;

    end while

    if $k' >k$ then

    根据式(11)和层次聚类算法HC合并$C_{{\rm~SS}}$中相似类簇直至类个数为$k$: $C\leftarrow~{\rm~HC}(C_{{\rm~SS}},k)$;

    else

    $C\leftarrow~C_{{\rm~SS}}$.

    end if

  • Table 2   The cross tabulation of $C^b$ and $C^d$
    $C^d$ $C^b$
    $c_1^b$ $c_2^b$ $\cdots$ $c_k^b$ SUM
    $c_1^d$ $n_{11}$ $n_{12}$ $\cdots$ $n_{1k}$ $n_{1\cdot}$
    $c_2^d$ $n_{21}$ $n_{22}$ $\cdots$ $n_{2k}$ $n_{2\cdot}$
    $\vdots$ $\vdots$ $\vdots$ $\ddots$ $\vdots$ $\vdots$
    $c_k^d$ $n_{k1}$ $n_{k2}$ $\cdots$ $n_{kk}$ $n_{k~\cdot}$
    SUM $n_{\cdot~1}$ $n_{\cdot~2}$ $\cdots$ $n_{\cdot~k}$ $n$
  • Table 3   Index NMI from the nine clustering algorithms
    Number K-means K-means* GK-means AP AD-AP DP GDPC BPC SSC
    1 0.7116$\pm$0.0622 0.7309 $\pm$0.0142 0.7419 0.7777 0.7777 0.6586 0.6586 0.7630 _0.8031
    2 0.8414$\pm$0.0118 _0.8423$\pm$0.0184 0.8347 0.7507 0.7315 0.5885 0.7415 0.7955 0.8364
    3 0.6743$\pm$0.0000 0.6725$\pm$0.0037 0.6654 0.6873 0.6790 0.6797 0.5246 0.6959 _0.7043
    4 0.4107$\pm$0.0292 0.3409$\pm$0.0313 0.4008 0.3392 0.3065 0.3265 0.2757 0.3918 _0.4481
    5 0.5939$\pm$0.0283 0.5970$\pm$0.0185 0.6224 0.6311 0.5653 0.4985 0.4773 0.6333 _0.6768
    6 0.8972$\pm$0.0227 0.9177$\pm$0.0368 0.9478 0.9270 0.8118 0.8020 0.8199 _1.0000 0.9797
    7 0.6114$\pm$0.0159 0.6065$\pm$0.0112 0.6109 0.6116 0.6095 0.6599 0.6666 0.6664 _0.6683
    8 0.3642$\pm$0.0000 0.3642$\pm$0.0000 0.3642 0.2989 0.2760 0.2214 0.2128 0.3592 _0.3728
    9 0.4669$\pm$0.0406 0.2708$\pm$0.0244 0.4245 0.4062 0.3681 0.2673 0.3209 0.4491 _0.4752
    10 0.4829$\pm$0.0389 0.3160$\pm$0.0312 0.4292 0.4003 0.3917 0.3507 0.4268 0.4311 _0.5160
    11 _0.5058$\pm$0.0211 0.4668$\pm$0.0163 0.4792 0.3603 0.3603 0.2728 0.2608 0.4523 0.4995
    12 0.4360$\pm$0.0170 0.3031$\pm$0.0148 0.4194 0.4464 0.4461 0.3628 0.3524 0.4192 _0.4477
    Average rank 4.0000 5.9167 4.3333 4.6667 6.2500 7.5000 7.4167 3.5833 1.3333
  • Table 4   Index ARI from the nine clustering algorithms
    Number K-means K-means* GK-means AP AD-AP DP GDPC BPC SSC
    1 0.6589$\pm$0.1179 0.7102$\pm$0.0080 0.7163 0.7445 0.7445 0.4531 0.4531 0.7184 _0.7874
    2 0.8451$\pm$0.0140 0.8477$\pm$0.0196 0.8471 0.7262 0.7144 0.4956 0.7567 0.8025 _0.8516
    3 0.7049$\pm$0.0000 0.7026$\pm$0.0048 0.6934 0.7064 0.6952 0.7076 0.4726 0.7020 _0.7406
    4 _0.2572$\pm$0.0149 0.1859$\pm$0.0278 0.2471 0.1862 0.1583 0.2267 0.2294 0.2397 0.2481
    5 0.4238$\pm$0.0811 0.4180$\pm$0.0334 0.4150 0.4551 0.3412 0.3086 0.2655 0.6142 _0.7080
    6 0.7779$\pm$0.0654 0.8049$\pm$0.0905 0.8521 0.8558 0.6529 0.6323 0.6372 _1.0000 0.9771
    7 0.4711 $\pm$0.0446 0.4736$\pm$0.0354 0.5034 0.5100 0.5117 0.5301 _0.5318 0.5159 0.5260
    8 0.2535$\pm$0.0000 0.2535$\pm$0.0000 0.2536 0.2167 0.1980 0.1897 0.1875 0.2542 _0.2592
    9 _0.3336$\pm$0.0627 0.1855$\pm$0.0421 0.2512 0.2498 0.2217 0.0822 0.1275 0.2848 0.2354
    10 0.3024$\pm$0.0691 0.2698$\pm$0.0305 0.2493 0.2296 0.2349 0.1388 0.2725 0.2695 _0.3066
    11 0.2025$\pm$0.0857 0.1408$\pm$0.0390 0.1792 0.1126 0.1126 0.1756 0.1964 0.1157 _0.2035
    12 0.2451$\pm$0.0181 0.1224$\pm$0.0147 0.1598 0.2066 0.2178 0.1303 0.1660 0.2427 _0.2709
    Average rank 3.9167 5.7500 4.9167 5.3333 6.4167 7.0000 6.2500 3.7500 1.6667