logo

SCIENTIA SINICA Informationis, Volume 50 , Issue 7 : 1003-1018(2020) https://doi.org/10.1360/SSI-2019-0243

Attribute identification for Chinese inter-personal relation knowledge graph based on encyclopedic text

More info
  • ReceivedNov 1, 2019
  • AcceptedMar 23, 2020
  • PublishedJul 13, 2020

Abstract


Funded by

国家自然科学基金(61525205,61876115)


References

[1] Zhong X Q, Liu Z, Ding P P. Construction of knowledge base on hybird reasoning and its application. Chin J Comput, 2012, 35: 761-766 CrossRef Google Scholar

[2] Pujara J, Miao H, Getoor L, et al. Knowledge graph identification. In: Proceedings of International Semantic Web Conference, Berlin, 2013. 542--557. Google Scholar

[3] Liu Q, Li Y, Duan H, et al. Knowledge graph construction techniques. J Comput Res Dev, 2016, 53: 582--600. Google Scholar

[4] Cui W Y, Xiao Y H, Wang H X, et al. KBQA: learning question answering over QA corpora and knowledge bases. In: Proceedings of the VLDB Endowment, Munich, 2017. 565--576. Google Scholar

[5] Abney S. Bootstrapping. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, 2002. 360--367. Google Scholar

[6] Zhao J, Liu K, He S Z, et al. Knowledge Graph. Beijing: Higher Education Press, 2018. Google Scholar

[7] Miller G A. WordNet: a lexical database for English. Commun ACM, 1995, 38: 39-41 CrossRef Google Scholar

[8] Dong Z D, Dong Q. HowNet-a hybrid language and knowledge resource. In: Proceedings of International Conference on Natural Language Processing and Knowledge Engineering, Toulouse, 2003. 820--824. Google Scholar

[9] Mendes P, Jakob M, Bizer C. DBpedia: a multilingual cross-domain knowledge base. In: Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC), Istanbul, 2012. 1813--1817. Google Scholar

[10] Auer S, Bizer C, Kobilarov G, et al. Dbpedia: a nucleus for a web of open data. In: Proceedings of International Semantic Web Conference, 2007. 722--735. Google Scholar

[11] Suchanek F M. YAGO: a core of semantic knowledge. In: Proceedings of the 16th International Conference on World Wide Web (WWW), New York, 2007. 697--706. Google Scholar

[12] Rebele T, Suchanek F, Hoffart J, et al. YAGO: a multilingual knowledge base from wikipedia, wordnet, and geonames. In: Proceedings of International Semantic Web Conference (ISWC), Kobe, 2016. 177--185. Google Scholar

[13] Bollacker K, Cook R, Tufts P. Freebase: a shared database of structured general human knowledge. In: Proceedings of the 22nd AAAI Conference on Artificial Intelligence, Vancouver, 2007. 1962--1963. Google Scholar

[14] Bollacker K, Evans C, Paritosh P, et al. Freebase: a collaboratively created graph database for structuring human knowledge. In: Proceedings of ACM SIGMOD International Conference on Management of Data, Vancouver, 2008. 1247--1250. Google Scholar

[15] Huang R H, Su X B. Development of information organization in social network environment. Libr Tribune, 2011, 31: 190--198. Google Scholar

[16] Niu X, Sun X R, Wang H F, et al. Zhishi.me: weaving chinese linking open data. In: Proceedings of International Semantic Web Conference, San Francisco, 2011. 205--220. Google Scholar

[17] Xu B, Xu Y, Liang J Q, et al. CN-DBpedia: a never-ending chinese knowledge extraction system. In: Proceedings of International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems, 2017. 428--438. Google Scholar

[18] Carlson A, Betteridge J, Kisiel B, et al. Toward an architecture for never-ending language learning. In: Proceedings of the 24th AAAI Conference on Artificial Intelligence, Atlanta, 2010. 1306--1313. Google Scholar

[19] Miller E. An Introduction to the Resource Description Framework. D-Lib Mag, 1998, 4 CrossRef Google Scholar

[20] Shen W, Wang J, Han J. Entity Linking with a Knowledge Base: Issues, Techniques, and Solutions. IEEE Trans Knowl Data Eng, 2015, 27: 443-460 CrossRef Google Scholar

[21] Chieu H L, Ng H T. Named entity recognition with a maximum entropy approach. In: Proceedings of the 7th Conference on Natural Language Learning at HLT-NAACL, Edmonton, 2003. 160--163. Google Scholar

[22] Lafferty J, McCallum A, Pereira F C N. Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the 18th International Conference on Machine Learning, Williamstown, 2001. 282--289. Google Scholar

[23] Lample G, Ballesteros M, Subramanian S, et al. Neural architectures for named entity recognition. In: Proceedings of NAACL-HLT, San Diego, 2016. 260--270. Google Scholar

[24] Strubell E, Verga P, Belanger D, et al. Fast and accurate entity recognition with iterated dilated convolutions. In: Proceedings of Conference on Empirical Methods in Natural Language Processing (EMNLP), Copenhagen, 2017. Google Scholar

[25] Huang Z H, Xu W, Yu K. Bidirectional LSTM-CRF models for sequence tagging. 2015,. arXiv Google Scholar

[26] Berger A L, Pietra V J D, Pietra S A D. A maximum entropy approach to natural language processing. Comput Linguist, 1996, 22: 39--71. Google Scholar

[27] Kim Y. Convolutional neural networks for sentence classification. In: Proceedings of Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, 2014. 1746--1751. Google Scholar

[28] Wang J H, Liu T W, Luo X, et al. An lstm approach to short text sentiment classification with word embeddings. In: Proceedings of the 30th Conference on Computational Linguistics and Speech Processing, Taiwan, 2018. 214--223. Google Scholar

[29] Mintz M, Bills S, Snow R, et al. Distant supervision for relation extraction without labeled data. In: Proceedings of Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, Singapore, 2009. 1003--1011. Google Scholar

[30] Yang Y S, Chen W L, Li Z H, et al. Distantly supervised ner with partial annotation learning and reinforcement learning. In: Proceedings of the 27th International Conference on Computational Linguistics, Santa Fe, 2018. 2159--2169. Google Scholar

[31] He Z Q, Chen W L, Li Z H, et al. SEE: syntax-aware entity embedding for neural relation extraction. 2018,. arXiv Google Scholar

  • Figure 1

    Flow chart of inter-personal relation knowledge graph construction

  • Figure 2

    (Color online) Expressions related to “teacher"

  • Figure 3

    (Color online) Part of the inter-personal relation tree

  • Figure 4

    (Color online) Char vector representation

  • Figure 5

    (Color online) Feature extraction of Five-Stroke coding based on CNN

  • Figure 6

    Flow chart of alias recognition module

  • Table 1   Extraction result of entity “Yao Ming"
    Entry ID Entry name Entry information (InfoBox) Entry label
    中文名 外文名 $\ldots$ 生肖 星座
    28 姚明 姚明 Yao Ming $\ldots$ 处女座 运动员, 篮球, 体育人物
  • Table 2   Common attributes & labels of persons
    Attribute Frequency Entry label Frequency
    国籍 735080 人物 781199
    职业 466398 政治人物 167416
    民族 421832 学者 102575
    性别 314391 体育人物 97445
    毕业院校 228019 娱乐人物 71672
  • Table 3   Diversity of “date of birth"
    Source Frequency
    出生日期 62038
    出生时间 21462
    出生年月 12227
    生日 8552
  • Table 4   Examples of attribute value normalization
    Type of value Original format Normalized format
    Time type 1992.9.9/1992-9-9 1992-09-09
    /1992年9月9日
    182 cm/182 CM/182 公分 182 厘米
    Number type 56 kg/56 Kg/56 公斤 56 千克
    1.82 m/1.82 M 1.82 米
    String type 1米82/一米八二 1.82 米
    「红楼梦」 /【红楼梦】 《红楼梦》
  • Table 5   Example of entry introduction text
    Entry ID Entry name Entry introduction text
    28 姚明 姚明(Yao Ming), 1980年9月12日 出生于上海市徐汇区,
    祖籍江苏省 苏州市吴江区震泽镇, 前中国职业 篮球运动员$\ldots$
  • Table 6   Some pattern examples for identifying the value of “height"
    Pattern Example
    [1-2]1[0-9]1,2(cm$\mid$CM$\mid$Cm$\mid$cM$\mid$厘米$\mid$公分) 183 cm/183 CM/183 公分
    [1-2]1($\backslash$.[0-9]*)?(米$\mid$公尺$\mid$m)([0-9]*)? 1米83/1.83 m/1米/1 m
    (([一二]$\mid$[1-2])1)(米)([一二三四五六七八九]1,2$\mid$[0-9]*) 一米八/1米八三/一米83/1米83
    ([一二三四五六七八九])(尺$\mid$英尺)([一二三四五六七八九]1(寸$\mid$英寸)?)? 一英尺三英寸/三尺八
  • Table 7   Common sources of “alias"
    Source Frequency Source Frequency
    中文名 1015465 字号 22960
    外文名 163707 其他名称 9451
    别名 74988 英文名 9010
    本名 43595 别称 7585
  • Table 8   Some examples of alias value
    Entry name Alias string Result after cleaning
    亚利克斯・维加 阿丽夏・维加 、 Alex 、 Lex 阿丽夏・维加, Alex, Lex
    默森 全名: Paul Mohsen Paul Mohsen
    水莲寺璐珈 水莲寺璐珈 (水莲寺流歌) 水莲寺流歌
    观月小鸟 Mizuki Kotori (罗马音) Tori Meadows (英文名) Mizuki Kotori, Tori Meadows
    窦士镛 字晓湘号警凡 晓湘, 警凡
  • Table 9   Pattern extraction of general aliases
    Entry name Alias Entry intro
    李白 诗仙 S S S 李白 , 唐 代 伟 大 浪 漫 主 义 诗 人 , 人 们 称 之 为 诗仙 . E E E
  • Table 10   Pattern extraction with entities/aliases nearby
    Entry name Alias Entry intro
    泰颉 朱德其, par 三友轩主 S S S 泰颉 本 名 朱德其 , 号 三友轩主 , 男 , 1 9 4 0 年 1 月 生 . E E E
  • Table 11   Some examples of alias pattern
    Pattern Frequency
    , 原 名 ## , 1 9 1504
    清 字 ## , 江 苏 335
    $\backslash$entity , 字 ## , 生 卒 225
    名 $\backslash$alias , ## 等 . E 62
  • Table 12   Experimental results of different models on distant supervised data
    Model Precision (%) Recall (%) $F_1$ score (%)
    CRF 75.07 80.07 77.49
    LSTM-CRF 80.03 80.79 80.41
    IDCNN-CRF 80.13 81.40 80.76
    FS-BiLSTM-CRF 83.95 81.04 82.47
  • Table 13   Experimental results of rule-based attribute recognition
    Attribute Range Sampling accuracy (500) (%)
    身高 Numbers, letters and Chinese 99.20
    体重 Numbers, letters and Chinese 96.20
    三围 Numbers, letters 95.00
    星座 Chinese 96.00
    血型 Letters, Chinese 97.40
    政治面貌 Chinese 98.60
  • Table 14   Experimental results of model-based attribute recognition
    Attribute Range Category Total number Sampling accuracy (500) (%)
    出生日期 Numbers, Chinese A 627454 99.00
    B 114467 99.00
    C 72348 100.00
    D 32856 96.20 93.40
    运动项目 Numbers, Chinese A 64773 96.80
    B 56743 95.80
    C 41918 99.80
    D 716 96.40 95.60
    出生地 Chinese A 62474 97.20
    B 113837 93.80
    C 31327 99.00
    D 20889 97.00 87.20
    学位 Chinese A 51374 93.40
    B 218215 91.00
    C 30153 97.40
    D 21221 87.40 84.60
    毕业院校 Chinese, English A 234951 94.20
    B 87901 96.80
    C 10620 99.60
    D 7475 92.80 89.20
    民族 Chinese A 396757 94.60
    B 25389 93.00
    C 14567 99.80
    D 2362 93.20 88.60
  • Table 15   Coverage comparison before and after attribute completion
    Attribute Original coverage (%) New coverage (%)/$\bigtriangleup$ (%) Attribute Original coverage (%) New coverage (%)/$\bigtriangleup$ (%)
    学位 4.61 24.21/+19.60 民族 35.63 37.02/+1.39
    出生地 5.61 15.83/+10.22 身高 7.93 8.32/+0.39
    出生日期 56.36 66.05/+9.69 体重 5.80 6.06/+0.26
    毕业院校 21.10 28.78/+7.68 三围 0.27 0.51/+0.24
    运动项目 5.81 10.21/+4.40 星座 2.58 2.65/+0.07
    政治面貌 2.80 4.45/+1.65 血型 0.02 0.03/+0.01
  • Table 16   Experimental results of alias extraction
    Model Precision (%) Recall (%) $F_1$ score (%)
    MaxEnt 98.07 38.93 55.73
    CNN 94.59 44.52 60.55
    LSTM 95.85 47.07 63.14