logo

SCIENTIA SINICA Informationis, Volume 51 , Issue 5 : 779(2021) https://doi.org/10.1360/SSI-2019-0120

A cross-media search method for social networks based on adversarial learning and semantic similarity

More info
  • ReceivedJun 5, 2019
  • AcceptedSep 4, 2019
  • PublishedApr 13, 2021

Abstract


Funded by

国家自然科学基金(61532006,61772083,61802028,61877006)


References

[1] Bai Y. Research on tag-topic identification and community mining in social network. Dissertation for Ph.D. Degree. Dalian: Dalian University of Technology, 2018. Google Scholar

[2] Li X H, Kong W W, Ma Y Y, et al. Title microblog hot topic detection using lexical semantic co-occurrence and community partition. Appl Res Comput, 2019, 37: 1--5. Google Scholar

[3] Bai H, Lin X G. Sina weibo disaster information detection based on chinese short text classification. J Catastrophology, 2016, 31: 19--23. Google Scholar

[4] Goodfellow I, Bengio Y, Courville A. Deep Learning. Cambridge: MIT Press, 2016. Google Scholar

[5] Goodfellow I, Pouget-Abadie J, Mirza M, et al. Generative adversarial nets. In: Proceedings of the 27th International Conference on Neural Information Processing Systems, 2014. 2672--2680. Google Scholar

[6] Radford A, Metz L, Chintala S. Unsupervised representation learning with deep convolutional generative adversarial networks. 2015,. arXiv Google Scholar

[7] Li J W, Monroe W, Shi T L, et al. Adversarial learning for neural dialogue generation. 2017,. arXiv Google Scholar

[8] Donahue J, Krähenbühl P, Darrell T. Adversarial feature learning. 2016,. arXiv Google Scholar

[9] Wang K F, Gou C, Duan Y J, et al. Generative adversarial networks: the state of the art and beyond. Acta Autom Sin, 2017, 43: 321--332. Google Scholar

[10] Feng F X, Wang X J, Li R F. Cross-modal retrieval with correspondence autoencoder. In: Proceedings of the 22nd ACM International Conference on Multimedia, 2014. 7--16. Google Scholar

[11] Hardoon D R, Szedmak S, Shawe-Taylor J. Canonical correlation analysis: an overview with application to learning methods.. Neural Computation, 2004, 16: 2639-2664 CrossRef PubMed Google Scholar

[12] Peng Y X, Huang X, Qi J W. Cross-media shared representation by hierarchical learning with multiple deep networks. In: Proceedings of the 25th International Joint Conference on Artificial Intelligenc, 2016. 3846--3853. Google Scholar

[13] Wang K, He R, Wang L. Joint Feature Selection and Subspace Learning for Cross-Modal Retrieval.. IEEE Trans Pattern Anal Mach Intell, 2016, 38: 2010-2023 CrossRef PubMed Google Scholar

[14] Wang K Y, He R, Wang W, et al. Learning coupled feature spaces for cross-modal matching. In: Proceedings of IEEE International Conference on Computer Vision, 2013. 2088--2095. Google Scholar

[15] Li D G, Dimitrova N, Li M K, et al. Multimedia content processing through cross-modal association. In: Proceedings of the 11th ACM International Conference on Multimedia, 2003. 604--611. Google Scholar

[16] Yao T, Mei T, Ngo C W. Learning query and image similarities with ranking canonical correlation analysis. In: Proceedings of IEEE International Conference on Computer Vision, 2015. 28--36. Google Scholar

[17] Andrew G, Arora R, Bilmes J, et al. Deep canonical correlation analysis. In: Proceedings of International Conference on Machine Learning, 2013. 1247--1255. Google Scholar

[18] Zhuang Y T, Wang Y F, Wu F, et al. Supervised coupled dictionary learning with group structures for multi-modal retrieval. In: Proceedings of the 27th AAAI Conference on Artificial Intelligence. 2013. Google Scholar

[19] Ngiam J, Khosla A, Kim M, et al. Multimodal deep learning. In: Proceedings of the 28th International Conference on Machine Learning (ICML-11), 2011. 689--696. Google Scholar

[20] Yan F, Mikolajczyk K. Deep correlation for matching images and text. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2015. 3441--3450. Google Scholar

[21] Song J K, Yang Y, Yang Y, et al. Inter-media hashing for large-scale retrieval from heterogeneous data sources. In: Proceedings of ACM SIGMOD International Conference on Management of Data, 2013. 785--796. Google Scholar

[22] Li Z, Tang J, Mei T. Deep Collaborative Embedding for Social Image Understanding.. IEEE Trans Pattern Anal Mach Intell, 2019, 41: 2070-2083 CrossRef PubMed Google Scholar

[23] Wang B K, Yang Y, Xu X, et al. Adversarial cross-modal retrieval. In: Proceedings of the 25th ACM International Conference on Multimedia, 2017. 154--162. Google Scholar

[24] Xu X, He L, Lu H. Deep adversarial metric learning for cross-modal retrieval. World Wide Web, 2019, 22: 657-672 CrossRef Google Scholar

[25] He L, Xu X, Lu H M, et al. Unsupervised cross-modal retrieval through adversarial learning. In: Proceedings of IEEE International Conference on Multimedia and Expo (ICME), 2017. 1153--1158. Google Scholar

[26] Li C, Deng C, Li N, et al. Self-supervised adversarial hashing networks for cross-modal retrieval. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2018. 4242--4251. Google Scholar

[27] Wu L, Wang Y, Shao L. Cycle-Consistent Deep Generative Hashing for Cross-Modal Retrieval.. IEEE Trans Image Process, 2019, 28: 1602-1612 CrossRef PubMed Google Scholar

[28] Zhang J, Peng Y X, Yuan M K. Unsupervised generative adversarial cross-modal hashing. In: Proceedings of the 32nd AAAI Conference on Artificial Intelligence, 2018. Google Scholar

[29] Su W H, Yuan Y, Zhu M. A relationship between the average precision and the area under the ROC curve. In: Proceedings of International Conference on the Theory of Information Retrieval, 2015. 349--352. Google Scholar

[30] Flach P, Kull M. Precision-recall-gain curves: PR analysis done right. In: Proceedings of the 28th International Conference on Neural Information Processing Systems, 2015. 838--846. Google Scholar

[31] Zhen L, Hu P, Wang X, et al. Deep supervised cross-modal retrieval. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2019. 10394--10403. Google Scholar

  • Figure 1

    (Color online) The general flowchart of the proposed SSACR method

  • Figure 2

    (Color online) PR curve of image to text (a), text to image (b) search/Wikipedia dataset

  • Figure 3

    (Color online) PR curve of image to text (a), text to image (b) search/Weibo dataset

  • Figure 4

    (Color online) The performance of SSACR with different model parameter $k$ on Wikipedia dataset (a) and Weibo dataset (b)

  • Figure 5

    (a) The performance of image to text search, (b) textto image search, and (c) the average performance with different model parameters/Wikipedia dataset

  • Figure 6

    (a) The performance of image to text search, (b) textto image search, and (c) the average performance with different model parameters/Weibo dataset

  • Figure 7

    The embed and domain loss during the training process

  • Figure 8

    (Color online) Example of cross-media retrieval on Weibo dataset

  • Table 1   Statistics of the datasets in our experiments
    Training instance Test instance Labels Image feature Text feature
    Wikipedia dataset 2173 693 10 4096d VGG19 5000d BoW
    Weibo dataset 4262 473 5 4096d VGG19 6739d BoW
  •   

    Algorithm 1 Training process of SSACR

    Require:Based on mini-batch, extract the feature of the current batch data, image feature as $\{~{v_1},\ldots,{v_n}\}~$, text feature as $\{~{t_1},\ldots,{t_n}\}~$, semantic distribution as $\{~{l_1},\ldots,{l_n}\}~$; Output: Trained ${\theta~_V}$ and ${\theta~_T}$;

    Output:The number of iterations for feature projection network's single training is $k$, data amount is $m$ for each mini-batch, learning rate is $\mu~$, loss parameter is $\lambda~$;

    Randomly initialize the parameters of the model;

    while not converge do

    while $k~>~0$ do

    Optimize ${\theta~_V}$, ${\theta~_T}$ and ${\theta~_{\rm~imd}}$ in the direction of decreasing gradient;${\theta~_V}~=~{\theta~_V}~-~\mu~~\cdot~{\nabla~_{{\theta~_V}}}\frac{1}{m}({L_{\rm~emb}}~-~{L_{\rm~adv}})$;${\theta~_T}~=~{\theta~_T}~-~\mu~~\cdot~{\nabla~_{{\theta~_T}}}\frac{1}{m}({L_{\rm~emb}}~-~{L_{\rm~adv}})$;${\theta~_{\rm~imd}}~=~{\theta~_{\rm~imd}}~-~\mu~~\cdot~{\nabla~_{{\theta~_{\rm~imd}}}}\frac{1}{m}({L_{\rm~emb}}~-~{L_{\rm~adv}})$;

    $k~\Leftarrow~k~-~1$;

    end while

    Optimize ${\theta~_D}$ in the direction of increasing gradient;${\theta~_D}~=~{\theta~_D}~+~\mu~~\cdot~\lambda~~\cdot~{\nabla~_{{\theta~_D}}}\frac{1}{m}({L_{\rm~emb}}~-~{L_{\rm~adv}})$;

    end while

    Return ${\theta~_V}$ and ${\theta~_T}$.

  •   

    Algorithm 2 Cross-modal search process

    Require:Search item $x$, image data $V$, text data $T$; Output: Search result list res;

    Output:The projection vector of search item is $s$, the projection matrix of data which has different models with search item is $R$, similarity matrix between $s$ and $R$ is $S$;

    if $x$ is text then

    $s~=~{f_T}(x;{\theta~_T})$;

    $R~=~{f_V}(V;{\theta~_V})$;

    else

    $s~=~{f_V}(x;{\theta~_V})$;

    $R~=~{f_T}(T;{\theta~_T})$;

    end if

    $S~=~{\rm~sim}(s,R)$;

    ${\rm~res}~=~{\rm~argsort}(S)[:\text{top-}K]$;

    Return res.

  • Table 2   Comparison of cross-media retrieval performance on Wikipedia dataset
    map@5 map@20 map@50
    txt2img img2txt Average txt2img img2txt Average txt2img img2txt Average
    CCA 0.2685 0.2151 0.2418 0.2831 0.2209 0.2520 0.2543 0.2178 0.2361
    JFSSL 0.4406 0.3473 0.3940 0.4264 0.3576 0.3920 0.4146 0.3454 0.3800
    CMDN 0.5094 0.4125 0.4609 0.4895 0.4102 0.4498 0.4624 0.3956 0.4290
    ACMR 0.6225 0.4987 0.5606 0.6109 0.4986 0.5548 0.5732 0.4835 0.5284
    DSCMR 0.6342 0.4982 0.5662 0.6421 0.5012 0.5716 0.6347 0.4901 0.5624
    Ours 0.6604 0.4964 0.5784 0.6647 0.5052 0.5850 0.6436 0.4964 0.5700
  • Table 3   Comparison of cross-media retrieval performance on Weibo dataset
    map@5 map@20 map@50
    txt2img img2txt Average txt2img img2txt Average txt2img img2txt Average
    CCA 0.3885 0.3251 0.3568 0.3583 0.3239 0.3411 0.3213 0.3199 0.3206
    JFSSL 0.6478 0.5351 0.5915 0.6128 0.5181 0.5655 0.5197 0.5282 0.5239
    CMDN 0.7183 0.5814 0.6499 0.6799 0.5843 0.6321 0.5906 0.5729 0.5817
    ACMR 0.8653 0.7133 0.7893 0.8238 0.7071 0.7655 0.7065 0.6992 0.7029
    DSCMR 0.8742 0.7177 0.7960 0.8192 0.7246 0.7719 0.7410 0.7267 0.7339
    Ours 0.8792 0.7344 0.8068 0.8161 0.7361 0.7761 0.7493 0.7410 0.7452
  • Table 4   Comparison of SSACR performance on Wikipedia dataset composed with different loss
    map@5 map@20 map@50
    txt2img img2txt Average txt2img img2txt Average txt2img img2txt Average
    ${L_{\rm~imd}}$ only 0.6591 0.4859 0.5725 0.6677 0.4946 0.5811 0.6468 0.4916 0.5692
    ${L_{\rm~imi}}$ only 0.5118 0.5031 0.5075 0.5287 0.5055 0.5171 0.5100 0.4974 0.5037
    Both 0.6604 0.4964 0.5784 0.6647 0.5052 0.5850 0.6436 0.4964 0.5700
  • Table 5   Comparison of SSACR performance on Weibo dataset composed with different loss
    map@5 map@20 map@50
    txt2img img2txt Average txt2img img2txt Average txt2img img2txt Average
    ${L_{\rm~imd}}$ only 0.6082 0.6842 0.6462 0.7094 0.6476 0.6785 0.7819 0.6577 0.7198
    ${L_{\rm~imi}}$ only 0.6853 0.5031 0.5942 0.6891 0.5055 0.5973 0.6905 0.4974 0.5940
    Both 0.8792 0.7344 0.8068 0.8161 0.7361 0.7761 0.7493 0.7410 0.7452
qqqq

Contact and support