logo

SCIENTIA SINICA Informationis, Volume 50 , Issue 6 : 813-823(2020) https://doi.org/10.1360/SSI-2019-0284

A semantic relation preserved word embedding reuse method

More info
  • ReceivedDec 24, 2019
  • AcceptedApr 14, 2020
  • PublishedJun 1, 2020

Abstract


Funded by

国家重点研发计划(2018YFB1004300)

国家自然科学基金(61773198,61632004)


References

[1] Otter D W, Medina J R, Kalita J K. A Survey of the Usages of Deep Learning in Natural Language Processing. 2018,. arXiv Google Scholar

[2] Young T, Hazarika D, Poria S, et al. Recent Trends in Deep Learning Based Natural Language Processing. 2017,. arXiv Google Scholar

[3] Zhang L, Wang S, Liu B. Deep learning for sentiment analysis: A survey. Wiley Interdiscip Rev Data Min Knowl Discov, 2018, 8: e1253. Google Scholar

[4] Zhu M, Pan P, Chen W, et al. DM-GAN: dynamic memory generative adversarial networks for text-to-image synthesis. In: Proceedings of the 2019 IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, 2019. 5802--5810. Google Scholar

[5] Mikolov T, Chen K, Corrado G, et al. Efficient estimation of word representations in vector space. In: Proceedings of the 1st International Conference on Learning Representations, Scottsdale, 2013. Google Scholar

[6] Pennington J, Socher R, Manning C D. Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, Doha, 2014. 1532--1543. Google Scholar

[7] Kim Y. Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, Doha, 2014. 1746--1751. Google Scholar

[8] Gu X, Gu Y, Wu H. Cascaded Convolutional Neural Networks for Aspect-Based Opinion Summary. Neural Process Lett, 2017, 46: 581-594 CrossRef Google Scholar

[9] Irsoy O, Cardie C. Opinion mining with deep recurrent neural networks. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, Doha, 2014. 720--728. Google Scholar

[10] Zhang X, Lapata M. Chinese poetry generation with recurrent neural networks. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, Doha, 2014. 670--680. Google Scholar

[11] Zhou Z H. Learnware: on the future of machine learning. Front Comput Sci, 2016, 10: 589-590 CrossRef Google Scholar

[12] Sylvestre-Alvise R, Alexander K, Georg S, et al. iCaRL: incremental classifier and representation learning. In: Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, 2017. 5533--5542. Google Scholar

[13] Kibok L, Kimin L, Jinwoo S, et al. Overcoming catastrophic forgetting with unlabeled data in the wild. In: Proceedings of 2019 International Conference on Computer Vision, Seoul, 2019. 312--321. Google Scholar

[14] Matthias D L, Rahaf A, Marc M, et al. Continual learning: a comparative study on how to defy forgetting in classification tasks. 2019,. arXiv Google Scholar

[15] Eden B, Adrian P. IL2M: class incremental learning with dual memory. In: Proceedings of the 2019 International Conference on Computer Vision, Seoul, 2019. 583--592. Google Scholar

[16] Ethayarajh K, Duvenaud D, Hirst G. Towards understanding linear word analogies. In: Proceedings of the 57th Conference of the Association for Computational Linguistics, Florence, 2019. 3253--3262. Google Scholar

[17] Papadimitriou C H, Raghavan P, Tamaki H. Latent Semantic Indexing: A Probabilistic Analysis. J Comput Syst Sci, 2000, 61: 217-235 CrossRef Google Scholar

[18] Bengio Y, Ducharme R, Vincent P, et al. A Neural Probabilistic Language Model. Mach Learn, 2003, 3: 1137-1155. Google Scholar

[19] Mikolov T, Sutskever I, Chen K, et al. Distributed representations of words and phrases and their compositionality. In: Proceedings of the 27th Annual Conference on Neural Information Processing Systems, Lake Tahoe, 2013. 3111--3119. Google Scholar

[20] Peters M E, Neumann M, Iyyer M, et al. Deep contextualized word representations. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Louisiana, 2018. 2227--2237. Google Scholar

[21] Devlin J, Chang M W, Lee K, et al. BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2019. 4171--4186. Google Scholar

[22] Radford A, Narasimhan K, Salimans T, et al. Improving language understanding by generative pre-training. 2018. Google Scholar

[23] Chen X, Cardie C. Unsupervised multilingual word embeddings. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, 2018. 261--270. Google Scholar

[24] Alaux J, Grave E, Cuturi M, et al. Unsupervised hyper-alignment for multilingual word embeddings. In: Proceedings of the 7th International Conference on Learning Representations, 2019. Google Scholar

[25] Yang W, Lu W, Zheng V W. A simple regularization-based algorithm for learning cross-domain word embeddings. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, 2017. 2898--2904. Google Scholar

[26] Neill J O, Bollegala D. Semi-supervised multi-task word embeddings. 2018,. arXiv Google Scholar

[27] Peng H, Li J, Song Y, et al. Incrementally learning the hierarchical softmax function for neural language models. In: Proceedings of the 31st AAAI Conference on Artificial Intelligence, 2017. 3267--3273. Google Scholar

[28] Kaji N, Kobayashi H. Incremental skip-gram model with negative sampling. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, 2017. 363--371. Google Scholar

[29] Kabbach A, Gulordava K, Herbelot A. Towards incremental learning of word embeddings using context informativeness. In: Proceedings of the 57th Conference of the Association for Computational Linguistics, Florence, 2019. 162--168. Google Scholar

[30] Hu Z, Chen T, Chang K W, et al. Few-shot representation learning for out-of-vocabulary words. In: Proceedings of the 57th Conference of the Association for Computational Linguistics, Florence, 2019. 4102--4112. Google Scholar

[31] Tang D, Qin B, Liu T. Document modeling with gated recurrent neural network for sentiment classification. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, 2015. 1422--1432. Google Scholar

  • Figure 1

    (Color online) The illustration of the deep NLP framework

  • Figure 2

    (Color online) The illustration of the comparison between SrpWer and other conventional methods

  • Figure 3

    (Color online) Learning curves for different usages based on Yelp13 task and WIKI embeddings. (a) Training loss; (b) test loss

  •   

    Algorithm 1 保持语义关系的词向量复用算法(SrpWer)

    输入:

    $~\text{预训练词向量}~\boldsymbol{W}_{\rm~P}~\in~\mathbb{R}^{n_{\rm~p}~\times~d};$

    $~\text{预训练语料的词库}~V_{\rm~P}~=~\{v_i\}_{i=1}^{n_p};$

    $~\text{当前语料库}~\mathcal{D};$

    $~\text{当前词库}~V~=~\{v_j\}_{j=1}^n;$

    $~\text{正则化项系数}~\lambda.$

    主迭代:

    $~\text{\rm~Skipgram}(\mathcal{D})~\rightarrow~\boldsymbol{W}~\in~\mathbb{R}^{n~\times~d};$

    $~V_{\rm~I}~=~V_{\rm~P}~\cap~V,$ $V_{\rm~O}~=~V~\setminus~V_{\rm~I};$

    $~\boldsymbol{W}_{\rm~I}~=~\text{\rm~lookup}(\boldsymbol{W},~V_{\rm~I}),$ $\boldsymbol{W}_{\rm~O}~=~\text{\rm~lookup}(\boldsymbol{W},~V_{\rm~O});$

    $~\boldsymbol{W}_{\rm~PI}~=~\text{\rm~lookup}(\boldsymbol{W}_{\rm~P},~V_{\rm~I});$

    $ {Z}^{\star} = \arg\min_{{Z}} \left \Vert{Z}^{\rm T}{W}_{\rm I} -{W}_{\rm O} \right \Vert_{\rm F}^2 + \lambda \Vert {Z} \Vert_{\rm F}^2;$

    $~\boldsymbol{W}_{\rm~PO}~=~{\boldsymbol{Z}^{\star}}^{\rm~T}\boldsymbol{W}_{\rm~PI};$

    $~\hat{\boldsymbol{W}}~=~\left[~\boldsymbol{W}_{\rm~PI};~\boldsymbol{W}_{\rm~PO}~\right].$

    $输出:$~\hat{\boldsymbol{W}}~\in~\mathbb{R}^{n~\times~d}.$

  • Table 1   The statistics among vocabularies of pretrained corpus and current tasks
    IMDB News20 Yelp13
    $n_{\rm~I}$ $n_{\rm~O}$ $n_{\rm~I}$ $n_{\rm~O}$ $n_{\rm~I}$ $n_{\rm~O}$
    WIKI-Glove 19911 89 17902 2038 18965 1035
    IMDB-SG 19976 24 13616 6384 18856 1144
  • Table 2   Performance comparisons of different usages on three NLP tasks based on WIKI-Glove embeddings
    CNN GRU
    IMDB News20 Yelp13 IMDB News20 Yelp13
    NoPT 0.318 0.416 0.508 0.429 0.684 0.596
    PT-NoFT 0.309 0.444 0.553 0.476 0.777 0.609
    PT-FT 0.336 0.741 0.587 0.473 0.816 0.603
    PT-FT-Mu 0.321 0.665 0.565 0.471 0.809 0.615
    SrpWer-NoFT 0.322 0.648 0.598 0.472 0.806 0.623
    SrpWer-FT 0.369 0.719 0.626 0.469 0.805 0.612
    SrpWer-FT-Mu 0.350 0.677 0.624 0.480 0.809 0.631
    Improve +0.033 $-$0.022 +0.039 +0.004 $-$0.007 +0.016
  • Table 3   Performance comparisons of different usages on three NLP tasks based on IMDB-SG embeddings
    CNN GRU
    IMDB News20 Yelp13 IMDB News20 Yelp13
    NoPT 0.293 0.553 0.541 0.450 0.703 0.614
    PT-NoFT 0.330 0.578 0.551 0.499 0.734 0.628
    PT-FT 0.338 0.745 0.578 0.466 0.819 0.605
    PT-FT-Mu 0.340 0.652 0.574 0.485 0.812 0.618
    SrpWer-NoFT 0.353 0.566 0.598 0.481 0.786 0.613
    SrpWer-FT 0.350 0.686 0.595 0.469 0.819 0.642
    SrpWer-FT-Mu 0.373 0.652 0.634 0.503 0.802 0.641
    Improve +0.033 $-$0.059 +0.056 +0.004 +0.000 +0.014
  • Table 4   Performance comparisons of varying new words' ratios on News20 task based on IMDB-SG embeddings and GRU network
    0.01 0.05 0.10 0.15 0.20 0.30 0.40
    SrpWer-NoFT 0.811 0.807 0.816 0.812 0.803 0.798 0.794
    SrpWer-FT 0.822 0.832 0.828 0.817 0.819 0.812 0.817
    SrpWer-FT-Mu 0.816 0.821 0.826 0.815 0.812 0.802 0.805