SCIENTIA SINICA Informationis, Volume 49 , Issue 12 : 1572-1585(2019) https://doi.org/10.1360/SSI-2019-0117

Maximum average entropy-rate based correlation clustering for big data

More info
  • ReceivedJun 4, 2019
  • AcceptedJul 10, 2019
  • PublishedDec 16, 2019


Funded by




[1] Kolosnjaji B, Zarras A, Webster G, et al. Deep learning for classification of malware system call sequences. In: Proceedings of Australasian Joint Conference on Artificial Intelligence. Berlin: Springer, 2016. 137--149. Google Scholar

[2] Bühlmann P, Rütimann P, van de Geer S. Correlated variables in regression: Clustering and sparse estimation. J Statistical Planning Inference, 2013, 143: 1835-1858 CrossRef Google Scholar

[3] He Z, Liu H, Wang Y W. Generative Adversarial Networks-Based Semi-Supervised Learning for Hyperspectral Image Classification. Remote Sens, 2017, 9: 1042 CrossRef ADS Google Scholar

[4] Kanungo T, Mount D M, Netanyahu N S. An efficient k-means clustering algorithm: analysis and implementation. IEEE Trans Pattern Anal Machine Intell, 2002, 24: 881-892 CrossRef Google Scholar

[5] Comaniciu D, Meer P. Mean shift: a robust approach toward feature space analysis. IEEE Trans Pattern Anal Machine Intell, 2002, 24: 603-619 CrossRef Google Scholar

[6] Estrada F, Jepson A, Chennubhotla C, et al. Spectral embedding and min cut for image segmentation. In: Proceedings of BMVC, Kingston, 2004. 1--10. Google Scholar

[7] Shi J B, Malik J. Normalized cuts and image segmentation. IEEE Trans Pattern Anal Machine Intell, 2000, 22: 888-905 CrossRef Google Scholar

[8] Bansal N, Blum A, Chawla S. Correlation Clustering. Machine Learning, 2004, 56: 89-113 CrossRef Google Scholar

[9] Felzenszwalb P F, Huttenlocher D P. Efficient Graph-Based Image Segmentation. Int J Comput Vision, 2004, 59: 167-181 CrossRef Google Scholar

[10] Mohebi A, Aghabozorgi S, Wah T Y. Iterative big data clustering algorithms: a review. Softw Pract Exper, 2016, 46: 107-129 CrossRef Google Scholar

[11] Zhao W Z, Ma H F, He Q. Parallel k-means clustering based on mapreduce. In: Proceedings of IEEE International Conference on Cloud Computing. Berlin: Springer, 2009. 674--679. Google Scholar

[12] Bu Y Y, Howe B, Balazinska M. HaLoop. Proc VLDB Endow, 2010, 3: 285-296 CrossRef Google Scholar

[13] Zaharia M, Chowdhury M, Das T, et al. Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of NSDI, San Jose, 2012. 2. Google Scholar

[14] He Y B, Tan H Y, Luo W M, et al. Mr-dbscan: an efficient parallel density-based clustering algorithm using mapreduce. In: Proceedings of ICPADS, Tainan, 2011. 473--480. Google Scholar

[15] Chen W Y, Song Y Q, Bai H J. Parallel spectral clustering in distributed systems.. IEEE Trans Pattern Anal Mach Intell, 2011, 33: 568-586 CrossRef PubMed Google Scholar

[16] Chopra S, Rao M R. The partition problem. Math Programming, 1993, 59: 87-115 CrossRef Google Scholar

[17] Nowozin S, Jegelka S. Solution stability in linear programming relaxations: graph partitioning and unsupervised learning. In: Proceedings of ICML, Montreal, 2009. 769--776. Google Scholar

[18] Vitaladevuni S N, Basri R. Co-clustering of image segments using convex optimization applied to EM neuronal reconstruction. In: Proceedings of CVPR, San Francisco, 2010. 2203--2210. Google Scholar

[19] Tsochantaridis I, Joachims T, Hofmann T, et al. Large margin methods for structured and interdependent output variables. J Mach Learn Res, 2005, 6: 1453--1484. Google Scholar

[20] Bagon S, Galun M. Large scale correlation clustering optimization. 2011,. arXiv Google Scholar

[21] Chierichetti F, Dalvi N, Kumar R. Correlation clustering in mapreduce. In: Proceedings of KDD, New York, 2014. 641--650. Google Scholar

[22] Blondel V D, Guillaume J L, Lambiotte R. Fast unfolding of communities in large networks. J Stat Mech, 2008, 2008(10): P10008 CrossRef ADS arXiv Google Scholar

[23] Liu M Y, Tuzel O, Ramalingam S. Entropy-rate clustering: cluster analysis via maximizing a submodular function subject to a matroid constraint.. IEEE Trans Pattern Anal Mach Intell, 2014, 36: 99-112 CrossRef PubMed Google Scholar

[24] Rodriguez A, Laio A. Clustering by fast search and find of density peaks. Science, 2014, 344: 1492-1496 CrossRef PubMed ADS Google Scholar

[25] Frey B J, Dueck D. Clustering by Passing Messages Between Data Points. Science, 2007, 315: 972-976 CrossRef PubMed ADS Google Scholar

[26] Ng A Y, Jordan M I, Weiss Y. On spectral clustering: analysis and an algorithm. In: Proceedings of NIPS, Vancouver, 2002. 849--856. Google Scholar

[27] Bouguettaya A, Yu Q, Liu X M. Efficient agglomerative hierarchical clustering. Expert Syst Appl, 2015, 42: 2785-2797 CrossRef Google Scholar

[28] Franti P, Sieranoja S. K-means properties on six clustering benchmark datasets. Appl Intell, 2018, 48: 4743-4759 CrossRef Google Scholar

[29] King D E. Dlib-ml: A machine learning toolkit. Journal of Machine Learning Research, 2009, 10(Jul): 1755--1758 DOI: 10.1145/1577069.1755843. Google Scholar

[30] Deng J K, Guo J, Xue N N, et al. Arcface: additive angular margin loss for deep face recognition. In: Proceedings of CVPR, Long Beach, 2019. 4690--4699. Google Scholar

  • Figure 1

    (Color online) Illustration of correlation clustering

  • Figure 2

    (Color online) Comparison of neighbors by original $\sigma_{0}$ and optimized $\hat{\sigma}$. Taking points in Aggregation for instance, neighbors by original $\sigma_{0}=10$ are shown in (a) and neighbors by optimized $\hat{\sigma}=0.4$ are shown in (b).

  • Figure 3

    (Color online) Clustering results comparison on standard 2D datasets

  • Figure 4

    (Color online) Comparison on running time of ECC and $k$-means: running time is the average of 3 times running; values of $k$ for $k$-means are $1003,~2119,~3141,~4207,~5181,~6209,~7232,~8252$ and 9306; parameters for ECC are $\sigma_{0}=0.5,{\rm~iter}_{\rm~max}=10,l=1$; parameters for $k$-NN+Louvain are $l=0,k=10$.

  • Table 1   Clustering metrics comparison on standard 2D datasets, including f1-measure and purity
    Datasets Metrics ECC $k$-means[4] AP[25] Mean-shift[5] Spectral[26] HAC[27] Ncut[7]
    2*Flame $F1$ 1.0000 0.8406 0.8243 0.8528 0.9832 0.7732 0.9917
    $P$ 1.0000 0.8647 0.8488 0.8586 0.8787 0.8595 0.9917
    2*Spiral $F1$ 1.0000 0.3502 0.3225 0.3562 1.0000 0.3917 0.5823
    $P$ 1.0000 0.3447 0.3566 0.3475 1.0000 0.3337 0.5992
    2*Aggregation $F1$ 1.0000 0.8504 0.8618 0.9595 0.7491 0.9097 0.9925
    $P$ 1.0000 0.9453 0.9415 0.9685 0.8242 0.9646 0.9930
    2*Face $F1$ 0.9669 0.8637 0.8867 0.6883 0.3478 0.9605 0.1644
    $P$ 0.9727 0.7134 0.7583 0.5445 0.1738 0.8335 0.1952

    Algorithm 1 Neighboring by greedy algorithm-${{\rm~nbr}_{\bar{\mathcal{H}}}}(\cdot,\cdot)$

    Require:Node $~i~$, initial edge set of $i$ $\boldsymbol{E}_{i}~=~\{e_{ij}\in\boldsymbol{E}\}$;

    Output:Edge set based on maximum average entropy rate $\mathcal{N}_{i}$;


    $\boldsymbol{E}'_{i}~\Leftarrow~$Order edges in $\boldsymbol{E}_{i}$ by weight in descending order;

    $\bar{\mathcal{H}}_{\rm~max}~\Leftarrow~0$; % Variable to store the value of maximum average entropy rate.

    for Edge $e_{ij}$ in $\boldsymbol{E}'_{i}$

    $\bar{\mathcal{H}}_{\rm~temp}~\Leftarrow\bar{\mathcal{H}}(\mathcal{N}_{i})$; % Average entropy rate of current edge set.

    if $\bar{\mathcal{H}}_{\rm~temp}\ge\bar{\mathcal{H}}_{\rm~max}$ then

    Add $e_{ij}$ to $\mathcal{N}_{i}$;



    Return $\mathcal{N}_{i}$;

    end if

    end for


    Algorithm 2 Correlation clustering on single cluster-${{\rm~Cc_{s}}}(\cdot,~\cdot)$

    Require:Graph $\boldsymbol{G}(\boldsymbol{V},\boldsymbol{E})$: $v\in~\boldsymbol{V},~v.{\text~{cluster~=~`unset'}},~e\in\boldsymbol{E},~e.{\rm~label}=-$, maximum iteration ${\rm~iter}_{\rm~max}$;

    Output:Node set of cluster $\boldsymbol{c}$;

    $\boldsymbol{V}_{\rm~temp}\Leftarrow$ nodes from $\boldsymbol{V}$ with .cluster = `unset'; % Unclustered node set.

    $i~\Leftarrow~$ node from $\boldsymbol{V}_{\rm~temp}$ with maximum degree;

    Set edge in $\boldsymbol{E}_{i}$ .label = +;


    while ${\rm~iter<~iter}_{\rm~max}$ do

    for each $v$ in $\boldsymbol{V}$

    if ${\boldsymbol~f}^{+}(v)\ge{\boldsymbol~f}^{-}(v)$ then

    Set $v.{\rm~cluster}=i$, set edge label in $\boldsymbol{E}_{v}$ to $+$; % $v$ is contained by cluster ${\boldsymbol~c}$.


    Set $v.{\rm~cluster=`unset}$', set edge label in $\boldsymbol{E}_{v}$ to $-$; % $v$ is not contained by cluster ${\boldsymbol~c}$.

    end if

    end for


    end while

    $\boldsymbol{c}\Leftarrow$ nodes from $\boldsymbol{V}$ with .cluster = $i$;

    Return $\boldsymbol{c}$.


    Algorithm 3 Correlation clustering on multiple clusters-${\rm~Cc_{m}}(\cdot,~\cdot,~\cdot)$

    Require:Dataset $\boldsymbol{X}$, initial neighbor parameter $\sigma_{0}$, maximum iteration ${\rm~iter}_{\rm~max}$;

    Output:Clustering result $\boldsymbol{C}$;


    for each $x$ in $\boldsymbol{X}$

    $\boldsymbol{E}_{x}\Leftarrow\mathcal{N}_{\sigma_{0}}(x)$; % $\mathcal{N}_{\sigma_{0}}(\cdot)$ return initial neighbor with parameter $\sigma_{0}$.


    Add $\mathcal{N}_{x}$ to ${\boldsymbol~E}$;

    end for

    Generate graph $\boldsymbol{G}(\boldsymbol{V}=\boldsymbol{X},\boldsymbol{E})$;

    Initialize $\boldsymbol{G}$: $v\in~\boldsymbol{V},~v.{\text{cluster~=~`unset&apos;}},~e\in\boldsymbol{E},~e.{\rm~label}=-$;


    while $\{v\in~\boldsymbol{V},~v.{\text{cluster=`unset&apos;}}\}\ne~\emptyset$ do


    Add $\boldsymbol{c}$ to $\boldsymbol{C}$;

    Reset $\boldsymbol{G}$: $e\in\boldsymbol{E},~e.{\rm~label}=-$;

    end while

    Return $\boldsymbol{C}$.


    Algorithm 4 Start points initialization-${\rm~Initspts}(\cdot)$

    Require:Graph $\boldsymbol{G}(\boldsymbol{V},\boldsymbol{E})$: $v\in~\boldsymbol{V},~{\text{v.cluster~=~`unset&apos;}},~e\in\boldsymbol{E},~e.{\text{label~=~`unset&apos;}}$;

    Output:Initialized graph $\boldsymbol{G}(\boldsymbol{V},\boldsymbol{E}&apos;)$;


    for each $e_{ij}$ in $\boldsymbol{E}$

    if $e_{ij}.{\text{label~==~`unset&apos;}}$ then

    $e_{ij}.{\text{label}}\Leftarrow$ node with maximum degree between node $i,j$;


    $e_{ij}.{\rm~label}\Leftarrow$ node with maximum degree among node $i,j$ and $e_{ij}.{\rm~label}$;

    end if

    Add $e_{ij}$ to $\boldsymbol{E}&apos;$;

    end for

    Return $\boldsymbol{G}(\boldsymbol{V},\boldsymbol{E}&apos;)$.


    Algorithm 5 Clusters generation-${\rm~Gencluster}(\cdot,\cdot,\cdot)$

    Require:Edge initialized graph $\boldsymbol{G}(\boldsymbol{V},\boldsymbol{E})$, overlapping parameter $l$, maximum iteration ${\rm~iter}_{\rm~max}$;

    Output:Node set of clusters $\boldsymbol{C}$;

    Initialize $\boldsymbol{V}$ of $\boldsymbol{G}(\boldsymbol{V},\boldsymbol{E})$: $v\in\boldsymbol{V},~v.{\rm~cluster}\leftarrow~{\text{empty~list~with}}~~l~~{\rm~legnth}$;

    while ${\rm~iter}<{\rm~iter}_{\rm~max}$ do

    for each $v$ in $\boldsymbol{V}$

    $\boldsymbol{K}\Leftarrow$ collection of .label value of edge in $\boldsymbol{E}_{v}$;

    ${\rm~templist}\Leftarrow$ empty list;

    for each $k$ in $\boldsymbol{K}$

    if ${\boldsymbol~f}_{k}(v)\ge\sum\nolimits_{i\ne~k}{\boldsymbol~f}_{i}(v)$ then

    Add $k$ to templist;

    end if

    end for

    $v.{\rm~cluster}\Leftarrow$ Keep the top $l$ items in templist ordered by its force ${\boldsymbol~f}_{k}(v)$;

    end for


    end while

    $\boldsymbol{C}\Leftarrow$ aggregate nodes by their label;

    Return $\boldsymbol{C}$.


    Algorithm 6 Correlation clustering on big data-${\rm~Cc_{b}}(\cdot,\cdot,\cdot,\cdot)$

    Require:Dataset $\boldsymbol{X}$, initial neighbor parameter $\sigma_{0}$, maximum iteration ${\rm~iter}_{\rm~max}$, overlapping parameter $l$;

    Output:Node set of clusters $\boldsymbol{C}$;


    for each $x$ in $\boldsymbol{X}$



    Add $\mathcal{N}_{x}$ to ${\boldsymbol~E}$;

    end for

    Generate graph $\boldsymbol{G}(\boldsymbol{V}=\boldsymbol{X},\boldsymbol{E})$;



    Return $\boldsymbol{C}$.