国家自然科学基金(61772410,61802298,11690011,U1811461)
国家重点研发计划(2017YFB1010004)
[1] Kolosnjaji B, Zarras A, Webster G, et al. Deep learning for classification of malware system call sequences. In: Proceedings of Australasian Joint Conference on Artificial Intelligence. Berlin: Springer, 2016. 137--149. Google Scholar
[2] Bühlmann P, Rütimann P, van de Geer S. Correlated variables in regression: Clustering and sparse estimation. J Statistical Planning Inference, 2013, 143: 1835-1858 CrossRef Google Scholar
[3] He Z, Liu H, Wang Y W. Generative Adversarial Networks-Based Semi-Supervised Learning for Hyperspectral Image Classification. Remote Sens, 2017, 9: 1042 CrossRef ADS Google Scholar
[4] Kanungo T, Mount D M, Netanyahu N S. An efficient k-means clustering algorithm: analysis and implementation. IEEE Trans Pattern Anal Machine Intell, 2002, 24: 881-892 CrossRef Google Scholar
[5] Comaniciu D, Meer P. Mean shift: a robust approach toward feature space analysis. IEEE Trans Pattern Anal Machine Intell, 2002, 24: 603-619 CrossRef Google Scholar
[6] Estrada F, Jepson A, Chennubhotla C, et al. Spectral embedding and min cut for image segmentation. In: Proceedings of BMVC, Kingston, 2004. 1--10. Google Scholar
[7] Shi J B, Malik J. Normalized cuts and image segmentation. IEEE Trans Pattern Anal Machine Intell, 2000, 22: 888-905 CrossRef Google Scholar
[8] Bansal N, Blum A, Chawla S. Correlation Clustering. Machine Learning, 2004, 56: 89-113 CrossRef Google Scholar
[9] Felzenszwalb P F, Huttenlocher D P. Efficient Graph-Based Image Segmentation. Int J Comput Vision, 2004, 59: 167-181 CrossRef Google Scholar
[10] Mohebi A, Aghabozorgi S, Wah T Y. Iterative big data clustering algorithms: a review. Softw Pract Exper, 2016, 46: 107-129 CrossRef Google Scholar
[11] Zhao W Z, Ma H F, He Q. Parallel k-means clustering based on mapreduce. In: Proceedings of IEEE International Conference on Cloud Computing. Berlin: Springer, 2009. 674--679. Google Scholar
[12] Bu Y Y, Howe B, Balazinska M. HaLoop. Proc VLDB Endow, 2010, 3: 285-296 CrossRef Google Scholar
[13] Zaharia M, Chowdhury M, Das T, et al. Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of NSDI, San Jose, 2012. 2. Google Scholar
[14] He Y B, Tan H Y, Luo W M, et al. Mr-dbscan: an efficient parallel density-based clustering algorithm using mapreduce. In: Proceedings of ICPADS, Tainan, 2011. 473--480. Google Scholar
[15] Chen W Y, Song Y Q, Bai H J. Parallel spectral clustering in distributed systems.. IEEE Trans Pattern Anal Mach Intell, 2011, 33: 568-586 CrossRef PubMed Google Scholar
[16] Chopra S, Rao M R. The partition problem. Math Programming, 1993, 59: 87-115 CrossRef Google Scholar
[17] Nowozin S, Jegelka S. Solution stability in linear programming relaxations: graph partitioning and unsupervised learning. In: Proceedings of ICML, Montreal, 2009. 769--776. Google Scholar
[18] Vitaladevuni S N, Basri R. Co-clustering of image segments using convex optimization applied to EM neuronal reconstruction. In: Proceedings of CVPR, San Francisco, 2010. 2203--2210. Google Scholar
[19] Tsochantaridis I, Joachims T, Hofmann T, et al. Large margin methods for structured and interdependent output variables. J Mach Learn Res, 2005, 6: 1453--1484. Google Scholar
[20] Bagon S, Galun M. Large scale correlation clustering optimization. 2011,. arXiv Google Scholar
[21] Chierichetti F, Dalvi N, Kumar R. Correlation clustering in mapreduce. In: Proceedings of KDD, New York, 2014. 641--650. Google Scholar
[22] Blondel V D, Guillaume J L, Lambiotte R. Fast unfolding of communities in large networks. J Stat Mech, 2008, 2008(10): P10008 CrossRef ADS arXiv Google Scholar
[23] Liu M Y, Tuzel O, Ramalingam S. Entropy-rate clustering: cluster analysis via maximizing a submodular function subject to a matroid constraint.. IEEE Trans Pattern Anal Mach Intell, 2014, 36: 99-112 CrossRef PubMed Google Scholar
[24] Rodriguez A, Laio A. Clustering by fast search and find of density peaks. Science, 2014, 344: 1492-1496 CrossRef PubMed ADS Google Scholar
[25] Frey B J, Dueck D. Clustering by Passing Messages Between Data Points. Science, 2007, 315: 972-976 CrossRef PubMed ADS Google Scholar
[26] Ng A Y, Jordan M I, Weiss Y. On spectral clustering: analysis and an algorithm. In: Proceedings of NIPS, Vancouver, 2002. 849--856. Google Scholar
[27] Bouguettaya A, Yu Q, Liu X M. Efficient agglomerative hierarchical clustering. Expert Syst Appl, 2015, 42: 2785-2797 CrossRef Google Scholar
[28] Franti P, Sieranoja S. K-means properties on six clustering benchmark datasets. Appl Intell, 2018, 48: 4743-4759 CrossRef Google Scholar
[29] King D E. Dlib-ml: A machine learning toolkit. Journal of Machine Learning Research, 2009, 10(Jul): 1755--1758 DOI: 10.1145/1577069.1755843. Google Scholar
[30] Deng J K, Guo J, Xue N N, et al. Arcface: additive angular margin loss for deep face recognition. In: Proceedings of CVPR, Long Beach, 2019. 4690--4699. Google Scholar
Figure 1
(Color online) Illustration of correlation clustering
Figure 2
(Color online) Comparison of neighbors by original $\sigma_{0}$ and optimized $\hat{\sigma}$. Taking points in Aggregation for instance, neighbors by original $\sigma_{0}=10$ are shown in (a) and neighbors by optimized $\hat{\sigma}=0.4$ are shown in (b).
Figure 3
(Color online) Clustering results comparison on standard 2D datasets
Figure 4
(Color online) Comparison on running time of ECC and $k$-means: running time is the average of 3 times running; values of $k$ for $k$-means are $1003,~2119,~3141,~4207,~5181,~6209,~7232,~8252$ and 9306; parameters for ECC are $\sigma_{0}=0.5,{\rm~iter}_{\rm~max}=10,l=1$; parameters for $k$-NN+Louvain are $l=0,k=10$.
Datasets | Metrics | ECC | $k$-means | AP | Mean-shift | Spectral | HAC | Ncut |
2*Flame | $F1$ | 1.0000 | 0.8406 | 0.8243 | 0.8528 | 0.9832 | 0.7732 | 0.9917 |
$P$ | 1.0000 | 0.8647 | 0.8488 | 0.8586 | 0.8787 | 0.8595 | 0.9917 | |
2*Spiral | $F1$ | 1.0000 | 0.3502 | 0.3225 | 0.3562 | 1.0000 | 0.3917 | 0.5823 |
$P$ | 1.0000 | 0.3447 | 0.3566 | 0.3475 | 1.0000 | 0.3337 | 0.5992 | |
2*Aggregation | $F1$ | 1.0000 | 0.8504 | 0.8618 | 0.9595 | 0.7491 | 0.9097 | 0.9925 |
$P$ | 1.0000 | 0.9453 | 0.9415 | 0.9685 | 0.8242 | 0.9646 | 0.9930 | |
2*Face | $F1$ | 0.9669 | 0.8637 | 0.8867 | 0.6883 | 0.3478 | 0.9605 | 0.1644 |
$P$ | 0.9727 | 0.7134 | 0.7583 | 0.5445 | 0.1738 | 0.8335 | 0.1952 |
$\mathcal{N}_{i}~\Leftarrow~\emptyset$; |
$\boldsymbol{E}'_{i}~\Leftarrow~$Order edges in $\boldsymbol{E}_{i}$ by weight in descending order; |
$\bar{\mathcal{H}}_{\rm~max}~\Leftarrow~0$; % Variable to store the value of maximum average entropy rate. |
$\bar{\mathcal{H}}_{\rm~temp}~\Leftarrow\bar{\mathcal{H}}(\mathcal{N}_{i})$; % Average entropy rate of current edge set. |
|
Add $e_{ij}$ to $\mathcal{N}_{i}$; |
$\bar{\mathcal{H}}_{\rm~max}~\Leftarrow~\bar{\mathcal{H}}_{\rm~temp}$; |
|
Return $\mathcal{N}_{i}$; |
|
$\boldsymbol{V}_{\rm~temp}\Leftarrow$ nodes from $\boldsymbol{V}$ with .cluster = `unset'; % Unclustered node set. |
$i~\Leftarrow~$ node from $\boldsymbol{V}_{\rm~temp}$ with maximum degree; |
Set edge in $\boldsymbol{E}_{i}$ .label = +; |
${\rm~iter}\Leftarrow0$; |
|
|
Set $v.{\rm~cluster}=i$, set edge label in $\boldsymbol{E}_{v}$ to $+$; % $v$ is contained by cluster ${\boldsymbol~c}$. |
|
Set $v.{\rm~cluster=`unset}$', set edge label in $\boldsymbol{E}_{v}$ to $-$; % $v$ is not contained by cluster ${\boldsymbol~c}$. |
|
|
${\rm~iter}\Leftarrow~{\rm~iter}+1$; |
$\boldsymbol{c}\Leftarrow$ nodes from $\boldsymbol{V}$ with .cluster = $i$; |
Return $\boldsymbol{c}$. |
$\boldsymbol{E}\Leftarrow~\emptyset$; |
$\boldsymbol{E}_{x}\Leftarrow\mathcal{N}_{\sigma_{0}}(x)$; % $\mathcal{N}_{\sigma_{0}}(\cdot)$ return initial neighbor with parameter $\sigma_{0}$. |
$\mathcal{N}_{x}\Leftarrow{{\rm~nbr}_{\bar{\mathcal{H}}}}(x,\boldsymbol{E}_{x})$; |
Add $\mathcal{N}_{x}$ to ${\boldsymbol~E}$; |
Generate graph $\boldsymbol{G}(\boldsymbol{V}=\boldsymbol{X},\boldsymbol{E})$; |
Initialize $\boldsymbol{G}$: $v\in~\boldsymbol{V},~v.{\text{cluster~=~`unset'}},~e\in\boldsymbol{E},~e.{\rm~label}=-$; |
$\boldsymbol{C}\Leftarrow\{\}$; |
$\boldsymbol{c}\Leftarrow{\rm~Cc_{s}}(\boldsymbol{G},~{\rm~iter}_{\rm~max})$; |
Add $\boldsymbol{c}$ to $\boldsymbol{C}$; |
Reset $\boldsymbol{G}$: $e\in\boldsymbol{E},~e.{\rm~label}=-$; |
Return $\boldsymbol{C}$. |
$\boldsymbol{E}'\Leftarrow\{\}$; |
|
$e_{ij}.{\text{label}}\Leftarrow$ node with maximum degree between node $i,j$; |
|
$e_{ij}.{\rm~label}\Leftarrow$ node with maximum degree among node $i,j$ and $e_{ij}.{\rm~label}$; |
|
Add $e_{ij}$ to $\boldsymbol{E}'$; |
Return $\boldsymbol{G}(\boldsymbol{V},\boldsymbol{E}')$. |
Initialize $\boldsymbol{V}$ of $\boldsymbol{G}(\boldsymbol{V},\boldsymbol{E})$: $v\in\boldsymbol{V},~v.{\rm~cluster}\leftarrow~{\text{empty~list~with}}~~l~~{\rm~legnth}$; |
|
$\boldsymbol{K}\Leftarrow$ collection of .label value of edge in $\boldsymbol{E}_{v}$; |
${\rm~templist}\Leftarrow$ empty list; |
|
|
Add $k$ to templist; |
|
|
$v.{\rm~cluster}\Leftarrow$ Keep the top $l$ items in templist ordered by its force ${\boldsymbol~f}_{k}(v)$; |
|
${\rm~iter}\Leftarrow~{\rm~iter}+1$; |
$\boldsymbol{C}\Leftarrow$ aggregate nodes by their label; |
Return $\boldsymbol{C}$. |
$\boldsymbol{E}\Leftarrow\emptyset$; |
$\boldsymbol{E}_{x}\Leftarrow\mathcal{N}_{\sigma_{0}}(x)$; |
$\mathcal{N}_{x}\Leftarrow{{\rm~nbr}_{\bar{\mathcal{H}}}}(x,\boldsymbol{E}_{x})$; |
Add $\mathcal{N}_{x}$ to ${\boldsymbol~E}$; |
Generate graph $\boldsymbol{G}(\boldsymbol{V}=\boldsymbol{X},\boldsymbol{E})$; |
${\rm~Initespts}(\boldsymbol{G})$; |
$\boldsymbol{C}\Leftarrow{\rm~Gencluster}(\boldsymbol{G},l,{\rm~iter}_{\rm~max})$; |
Return $\boldsymbol{C}$. |