北京市科技计划(Z171100005117002)
国家自然科学基金(61972403,61872315,61661146001)
[1] Doan A H, Halevy A, Ives Z. Principles of data integration. Waltham: Elsevier, 2012. 19--58. Google Scholar
[2] Song X Y, Wang Y H. Data Integration and Application Integration. Beijing: China Water Power Press, 2008. Google Scholar
[3] Pandey Y, Bansal S. Safety check: a semantic web application for emergency management. In: Proceedings of the International Workshop on Semantic Big Data, New York, 2017. 1--6. Google Scholar
[4] Qin Y B, Feng L, Chen Y P, et al. “Intelligent Court” data fusion analysis and integrated application. Big Data, 2019, 3: 35--46. Google Scholar
[5] Du X Y, Chen Y G, Lu W, et al. Data wrangling: a key technique of data governance. Big Data Res, 2019, 5: 13--22. Google Scholar
[6] Alwan A A, Nordin A, Alzeber M, et al. A Survey of schema matching research using database schemas and instances. Int J Adv Computer Sci Applicat, 2017, 8: 102--111. Google Scholar
[7] Bernstein P A, Madhavan J, Rahm E. Generic schema matching, ten years later. In: Proceedings of the 37th International Conference on Very Large Data Bases, Seattle, 2011. 695--701. Google Scholar
[8] Munir S, Khan F, Riaz M A. An instance-based schema matching between opaque database schemas. In: Proceedings of the 4th International Conference on Engineering Technology and Technopreneuship, Kuala Lumpur, 2014. 177--182. Google Scholar
[9] Do H H, Erhard R, COMA: a system for flexible combination of schema matching approaches. In: Proceedings of the 28th International Conference on Very Large Data Bases, Hong Kong, 2002. 610--621. Google Scholar
[10] Zhao H, Ram S. Combining schema and instance information for integrating heterogeneous data sources. Data Knowledge Eng, 2007, 61: 281-303 CrossRef Google Scholar
[11] Melnik S, Garcia-Molina H, Rahm E. Similarity flooding: a versatile graph matching algorithm and its application to schema matching. In: Proceedings of the 18th International Conference on Data Engineering, Washington, 2002. 117--128. Google Scholar
[12] Alzeber M, Alwan A A, Nordin A, et al. An empirical comparative study of instance-based schema matching. Int J Electr Electron Syst, 2018, 10: 1266-1277. Google Scholar
[13] Li Y, Liu D B, Zhang W M. A method for automatic schema matching using characteristic of data distribution. Computer science, 2005, 32: 85--87. Google Scholar
[14] Mehdi O A, Ibrahim H, Affendey L S. An approach for instance based schema matching with google similarity and regular expression. Int Arab J Inf Technol, 2017, 14(5): 755-763. Google Scholar
[15] Cilibrasi R L, Vitanyi P M B. The Google Similarity Distance. IEEE Trans Knowl Data Eng, 2007, 19: 370-383 CrossRef Google Scholar
[16] Nozaki K, Hochin T, Nomiya H. Semantic schema matching for string attribute with word vectors and its evaluation. Int J Netw Distrib Comput, 2019, 7(3): 100-106. Google Scholar
[17] Mikolov T, Chen K, Corrado G, et al. Efficient estimation of word representations in vector space. In: Proceedings of the International Conference on Learning Representations, Scottsdale, 2013. 1--12. Google Scholar
[18] Kuhn H W. The Hungarian method for the assignment problem. Naval Res Logistics, 1955, 2: 83-97 CrossRef Google Scholar
[19] Zhao C, He Y Y. Auto-EM: end-to-end fuzzy entity-matching using pre-trained deep models and transfer learning. In: Proceedings of the World Wide Web Conference, San Francisco, 2019. 2413--2424. Google Scholar
[20] Dong X L, Rekatsinas T. Data integration and machine learning: a natural synergy. In: Proceedings of the 2018 International Conference on Management of Data, New York, 2018. 1645--1650. Google Scholar
[21] Dunn H L. Record Linkage. Am J Public Health Nations Health, 1946, 36: 1412-1416 CrossRef Google Scholar
[22] Christen P. Development and user experiences of an open source data cleaning, deduplication and record linkage system. SIGKDD Explor Newsl, 2009, 11: 39-48 CrossRef Google Scholar
[23] Sarawagi S, Bhamidipaty A. Interactive deduplication using active learning. In: Proceedings of the eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Edmonton, 2002. 269--278. Google Scholar
[24] Dong X, Gabrilovich E, Heitz G, et al. Knowledge vault: a web-scale approach to probabilistic knowledge fusion. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, 2014. 601--610. Google Scholar
[25] Nickel M, Murphy K, Tresp V. A Review of Relational Machine Learning for Knowledge Graphs. Proc IEEE, 2016, 104: 11-33 CrossRef Google Scholar
[26] Xiao C, Wang W, Lin X. Ed-Join. Proc VLDB Endow, 2008, 1: 933-944 CrossRef Google Scholar
[27] Jiang Y, Li G, Feng J. String similarity joins. Proc VLDB Endow, 2014, 7: 625-636 CrossRef Google Scholar
[28] Singh R, Meduri V, Elmagarmid A, et al. Generating concise entity matching rules. In: Proceedings of the 2017 ACM International Conference on Management of Data, New York, 2017. 1635--1638. Google Scholar
[29] Fan W, Jia X, Li J. Reasoning about record matching rules. Proc VLDB Endow, 2009, 2: 407-418 CrossRef Google Scholar
[30] Arasu A, Gotz M, Kaushik R. On active learning of record matching packages. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, Indianapolis, 2010. 783--794. Google Scholar
[31] Singla P, Domingos P. Entity resolution with markov logic. In: Proceedings of the 6th International Conference on Data Mining, Hong Kong, 2006. 572--582. Google Scholar
[32] Zhuang Y, Li G, Zhong Z, et al. Hike: a hybrid human-machine method for entity alignment in large-scale knowledge bases. In: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, Singapore, 2017. 1917--1926. Google Scholar
[33] Mudgal S, Li H, Rekatsinas T, et al. Deep learning for entity matching: A design space exploration. In: Proceedings of the 2018 International Conference on Management of Data, Houston, 2018. 19--34. Google Scholar
[34] Konda P, Naughton J, Prasad S. Magellan. Proc VLDB Endow, 2016, 9: 1197-1208 CrossRef Google Scholar
[35] Pang J A. Research on web text feature extraction method and its development. Inf Stud: Theor Ap, 2006, 3: 338--340+367. Google Scholar
Figure 1
(Color online) Multi-source relational data fusion framework
Figure 2
Time comparison of schema matching
Figure 3
(Color online) Evaluation of entity alignment results. (a) Precision; (b) recall; (c) F1-score
Figure 4
(Color online) Multi-source relational data fusion platform
1: function multimatch(tbs) |
2: Feature_list $\Leftarrow$ feature_extraction(tbs);//Extract feature |
3: |
4: DisMatrix $\Leftarrow$ Combination(Feature_lis[idx1], Feature_lis[idx2]);//Generate distance matrix |
5: bestmatches $\Leftarrow$ Hungarian(DisMatrix);//Find the best matches with Hungarian |
6: bestmatches list.append(bestmatches); |
7: |
8: |
9: |
Features of string type | Features of int/float type |
ItemLetterAverage | ItemAverage |
LetterVariationCoefficient | ItemVariationCoefficient |
LetterProbabliity | ItemProbability |
NormalizedLetterEntropy | – |
Evaluating indicator | 2 tables | 3 tables | 4 tables | 5 tables | 6 tables | 7 tables |
Recall (%) | 77.2 | 77.2 | 77.2 | 77.1 | 77.1 | 77.0 |
Precision (%) | 27.2 | 27.2 | 27.1 | 27.1 | 27.1 | 27.1 |
Evaluating indicator | 8 tables | 9 tables | 10 tables | 11 tables | 12 tables | Average |
Recall (%) | 76.9 | 76.8 | 76.7 | 76.5 | 76.4 | |
Precision (%) | 27.1 | 27.1 | 27.0 | 27.0 | 27.0 |
1: |
2: data1,data2 $\Leftarrow$ ( loaddata(tb1, tb2); |
3: C $\Leftarrow$ ( overlap coefficient_join(data1, data2, schema correlations);//blocking |
4: S $\Leftarrow$ ( diversity sample(C);//sampling |
5: Label $\Leftarrow$ ( load labeled sample(labeled path); // Label candidate pairs in S with 0/1 manually, which means non-matched/matched respectively and load it; |
6: F $\Leftarrow$ ( feature extract(Labeled); |
7: Train, Test $\Leftarrow$ ( split(F); |
8: for match in matchers; //decision tree, random forest, support vector machine $\cdots$ |
9: model, result $\Leftarrow$ ( cross validation(match, Train);//training |
10: bestmodel, bestresult $\Leftarrow$ ( compare(model, result); |
11: |
12: predictions $\Leftarrow$ ( best model.predict(Test);//evaluation |
13: eval result $\Leftarrow$ ( eval matches(predictions); |
14: candset feature $\Leftarrow$ ( feature extract(C); |
15: alignment results $\Leftarrow$ ( best model.predict(candset feature);//prediction |
16: fusion results $\Leftarrow$ ( formalize(alignment results);//data collation and specification |
17: |
18: |
Fusion type | Source table1 | Source table2 | Match results | Profit | |
Single-source | l_水务局行政许可 | r_水务局行政处罚 | l_行政相对人名称 | r_行政相对人名称 | 0.7019 |
l_统一社会信用代码 | r_统一社会信用代码 | 0.5454 | |||
l_保健食品生产单位证件信息 | r_保健食品企业标准备案信息 | l_名称 | r_企业名称 | 0.8501 | |
Multi-source | l_水务局行政许可 | r_企业信息 | l_统一社会信用代码 | r_creditcode | 0.8067 |
l_行政相对人名称 | r_name | 0.7079 | |||
l_食品生产企业 | r_标准化达标信息 | l_企业名称 | r_企业名称 | 0.7169 |
水务局行政许可&水务局行政处罚 | |||||
Single-source | l_统一社会信用代码 | l_行政相对人名称 | r_统一社会信用代码 | r_行政相对人名称 | Incomplete match reasons |
91110105M****MYD5M | 北京伯佳诊所有限公司 | 91110105M****MYD5M | 北京嘉树广渠诊所有限公司 | Changed company name | |
北京市建筑工程研究院有限责任公司 | 北京市建筑工程研究院有限责任公司 | Inconsistent entries of credit code | |||
北京招商局物业管理有限公司招商局大厦 | 北京招商局物业管理有限公司 | Incorrect entries of credit code and inconsistent entries of company name | |||
保健食品生产单位证件信息&保健食品企业标准备案信息 | |||||
l_名称 | r_企业名称 | Incomplete match reasons | |||
北京中研万通科技有限责任公司保健食品厂 | 北京中研万通科技中心 | Inconsistent entries of company name | |||
华润紫竹药业有限公司 | 北京紫竹药业有限公司 | ||||
Multi-source | 水务局行政许可&企业信息 | ||||
l_统一社会信用代码 | l_行政相对人名称 | r_creditCode | r_name | Incomplete match reasons | |
911101067****5204Q | 北京八方达客运有限责任公司保修分公司 | 911101067****5204Q | 北京公共交通控股(集团)有限公司第二保修分公司 | Changed company name | |
9111 | 东方艺术大厦有限公司 | 9111 | 东方艺术大厦有限公司 | Inconsistent entries of credit code | |
911101056**** | 北京湄洲三苏大酒楼有限公司 | 911101056**** | 北京湄州三苏大酒楼有限公司 | Incorrect entries of credit code and inconsistent entries of company name | |
食品生产企业&标准化达标信息 | |||||
l_企业名称 | r_企业名称 | Incomplete match reasons | |||
北京航天农业生物科技有限公司 | 北京首诚航天农业生物科技有限公司 | Inconsistent entries of company name | |||
宝健中国有限公司 | 宝健中国日用品有限公司 |