国家重点研发计划(2016YFB1000901)
教育部创新团队发展计划(IRT13059)
国家自然科学基金(612-łinebreak73297,61673152)
国家留学基金(201506695019)
[1] CNNIC. Statistical report on Internet development in China. Technical report. China Internet Network Information Center, 2016. http://www.cnnic.net.cn/hlwfzyj/hlwxzbg/hlwtjbg/201608/P020160803367337470363.pdf . Google Scholar
[2] Xindong Wu , Xingquan Zhu , Gong-Qing Wu . Data mining with big data. IEEE Trans Knowl Data Eng, 2014, 26: 97-107 CrossRef Google Scholar
[3] Wu X, Chen H, Wu G, et al. Knowledge engineering with big data. IEEE Intell Syst, 2015, 30: 46--55. Google Scholar
[4] Li X L, Gong H G. A survey on big data systems. Sci Sin Inform, 2015, 45: 1--44 . Google Scholar
[5] Glynn C J, Herbst S, Lindeman M, et al. Public Opinion. Colorado: Westview Press, 2015. Google Scholar
[6] Zhu C, Zhu H, Ge Y. Tracking the evolution of social emotions with topic models. Knowl Inf Syst, 2016, 47: 517-544 CrossRef Google Scholar
[7] Etzioni O, Fader A, Christensen J, et al. Open information extraction: the second generation. In: Proceedings of the 22nd International Joint Conference on Artificial Intelligence, Barcelona, 2011. 1: 3--10. Google Scholar
[8] Zhao J, Liu K, Zhou G Y, et al. Open information extraction. J Chinese Inform Proces, 2011, 25: 98--111 . Google Scholar
[9] Parapar J, Barreiro A. An effective and efficient web news extraction technique for an operational NewsIR system. In: Proceedings of the Conferencia de la Asociación Espanola para la Inteligencia Artificial CAEPIA-TTIA, Salamanca, 2007. 2: 319--328. Google Scholar
[10] Weninger T, Palacios R, Crescenzi V. Web Content Extraction. SIGKDD Explor Newsl, 2016, 17: 17-23 CrossRef Google Scholar
[11] Chia-Hui Chang , Kayed M, Girgis M R. A Survey of Web Information Extraction Systems. IEEE Trans Knowl Data Eng, 2006, 18: 1411-1428 CrossRef Google Scholar
[12] Ferrara E, De Meo P, Fiumara G. Web data extraction, applications and techniques: A survey. Knowledge-Based Syst, 2014, 70: 301-323 CrossRef Google Scholar
[13] Jiménez P, Corchuelo R. Roller: a novel approach to Web information extraction. Knowl Inf Syst, 2016, 49: 197-241 CrossRef Google Scholar
[14] Wu G Q, Hu J, Li L, et al. Online web news extraction via tag path feature fusion. J Softw, 2016, 27: 714--735 . Google Scholar
[15] Sun F, Song D, Liao L. Dom based content extraction via text density. In: Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, Beijing, 2011. 245--254. Google Scholar
[16] Wu X D, He J, Lu Y Q, et al. From big data to big knowledge: HACE+BigKE. Acta Autom Sin, 2016, 42: 965--982 . Google Scholar
[17] Xue Y, Hu Y, Xin G. Web page title extraction and its application. Inf Processing Manage, 2007, 43: 1332-1347 CrossRef Google Scholar
[18] Zhao X, Jin P, Yue L. Discovering topic time from web news. Inf Processing Manage, 2015, 51: 869-890 CrossRef Google Scholar
[19] Garcia-Molina H, Hammer J, McHugh J. Semistructured data: the TSIMMIS experience. In: Proceedings of the 1st East-European Conference on Advances in Databases and Information systems, St Petersburg, 1997. 1--8. Google Scholar
[20] Sahuguet A, Azavant F. Building intelligent Web applications using lightweight wrappers. Data Knowledge Eng, 2001, 36: 283-316 CrossRef Google Scholar
[21] Liu L, Pu C, Han W. XWRAP: an XML-enabled wrapper construction system for web information sources. In: Proceedings of the 16th International Conference on Data Engineering, San Diego, 2000. 611--621. Google Scholar
[22] Wu G, Wu X. Extracting web news using tag path patterns. In: Proceedings of IEEE International Conference on Web Intelligence and Intelligent Agent Technology, Macau, 2012. 1: 588--595. Google Scholar
[23] Wu S, Liu J, Fan J. Automatic web content extraction by combination of learning and grouping. In: Proceedings of the 24th International Conference on World Wide Web, Florence, 2015. 1264--1274. Google Scholar
[24] Dalvi N, Bohannon P, Sha F. Robust web extraction: an approach based on a probabilistic tree-edit model. In: Proceedings of the 2009 ACM SIGMOD International Conference on Management of data, Providence, 2009. 335--348. Google Scholar
[25] Parameswaran A, Dalvi N, Garcia-Molina H, et al. Optimal schemes for robust web extraction. In: Proceedings of the VLDB Endowment, Seattle, 2011. 4: 980--991. Google Scholar
[26] Hogue A, Karger D. Thresher: automating the unwrapping of semantic content from the World Wide Web. In: Proceedings of the 14th International Conference on World Wide Web, Chiba, 2005. 86--95. Google Scholar
[27] Alarte J, Insa D, Silva J, et al. TeMex: the web template extractor. In: Proceedings of the 24th International Conference on World Wide Web, Florence, 2015. 155--158. Google Scholar
[28] Wu G, Li L, Hu X, et al. Web news extraction via path ratios. In: Proceedings of the 22nd ACM International Conference on Information and Knowledge Management, San Francisco, 2013. 2059--2068. Google Scholar
[29] Peters M E, Lecocq D. Content extraction using diverse feature sets. In: Proceedings of the 22nd International Conference on World Wide Web, Rio de Janeiro, 2013. 89--90. Google Scholar
[30] Otsu N. A threshold selection method from gray-level histograms. Automatica, 1975, 11: 23--27. Google Scholar
[31] Baroni M, Chantree F, Kilgarriff A, et al. Cleaneval: a competition for cleaning web pages. In: Proceedings of the International Conference on Language Resources and Evaluation, Marrakech, 2008. 638--643. Google Scholar
[32] Pan Q, Yu X, Cheng Y, et al. Essential methods and progress of information fusion theory. Acta Autom Sin, 2003, 29: 599--615 . Google Scholar
[33] Pan Q, Wang Z F, Liang Y, et al. Basic methods and progress of information fusion (II). Control Theory Appl, 2012, 29: 1233--1244 . Google Scholar
Figure 1
Average extraction time of each algorithm
DataSet | CETR | CEPR-AT | CETD-QT | CETD-Jsoup | CEPF | CEDP-TD | CEDP-CTD | CEDP-DSum | CEDP-NLTD |
(%) | (%) | (%) | (%) | (%) | (%) | (%) | (%) | (%) | |
CleanEval-En | 88.30 | 75.33 | 83.89 | | 88.39 | 88.38 | 66.10 | 66.78 | 88.24 |
CleanEval-Zh | 83.36 | 75.65 | N/A | | 86.86 | 87.04 | 74.13 | 74.07 | 86.94 |
NY Post | 58.19 | 81.02 | 82.78 | 83.97 | 90.04 | 89.00 | 89.29 | | 89.36 |
Freep | 70.36 | 86.00 | 74.41 | 74.65 | 87.79 | 88.97 | 90.18 | | 90.16 |
Suntimes | 82.20 | 85.90 | 90.37 | 90.07 | 94.08 | 94.41 | 95.03 | | 94.58 |
Techweb | 74.56 | 88.86 | 77.86 | 77.35 | 90.70 | 91.00 | 89.53 | 90.70 | |
Tribune | 89.83 | 90.32 | N/A | | 95.21 | 94.86 | 95.08 | 94.90 | 94.93 |
Nytimes | 91.14 | 86.91 | | 96.25 | 92.31 | 92.07 | 91.09 | 90.50 | 92.16 |
BBC | 72.76 | 80.13 | 89.45 | 84.85 | 89.53 | 90.26 | 90.56 | | 90.65 |
Reuters | 71.73 | 84.26 | N/A | 77.67 | | 94.24 | 78.09 | 79.22 | 94.35 |
Yahoo | 82.06 | 84.96 | 89.85 | 85.88 | 89.33 | 90.91 | 90.42 | | 91.10 |
Sina | 73.99 | 90.63 | N/A | 89.44 | 96.92 | 97.36 | 97.33 | | 97.34 |
People | 86.23 | 85.32 | 82.40 | 82.14 | 89.27 | 95.08 | 95.10 | 94.84 | |
163 | 38.28 | | N/A | 53.15 | 79.84 | 80.44 | 80.63 | 80.16 | 80.57 |
Xinhua | 83.32 | 81.24 | 91.18 | 90.57 | | 94.84 | 94.82 | 94.92 | 94.82 |
TunxunWb | 79.36 | 17.75 | 86.28 | 86.90 | 83.72 | | 86.34 | 86.71 | 87.17 |
SinaWb | 57.99 | 18.88 | N/A | 77.01 | 79.38 | 85.56 | 82.57 | 79.90 | |
SohuWb | 87.16 | 92.32 | N/A | 87.57 | | 92.22 | 92.28 | 92.38 | 92.27 |
Average | 76.16 | 77.45 | N/A | 84.01 | 89.80 | 90.81 | 87.70 | 87.84 | |
|
---|
|
|
解析${\rm wp}$得到解析树$T_{{\rm wp}}$, ${\rm content} \gets ``";$ |
${\rm nts}\gets \langle (v_1, c(v_1)),(v_2, c(v_2)),\ldots,(v_n, c(v_n)) \rangle$ //计算$T_{{\rm wp}}$的(节点, 内容) –规范文本节点序列${\rm nts}$; |
|
$f(v_i) \gets {\rm combine}(v_i,{\rm densityF}(\cdot),{\rm pathF}(\cdot))$; |
${\rm nfts} \gets \langle(f(v_1), c(v_1)),(f(v_2), c(v_2)),\ldots,(f(v_n), c(v_n)) \rangle$; |
${\rm s\_nfts} \gets {\rm smoothing}({\rm nfts})$, 并记${\rm s\_nfts}$为$\langle (sf(v_1), c(v_1)),(sf(v_2), c(v_2)),\ldots,(sf(v_n), c(v_n)) \rangle$; |
|
|
${\rm content} \gets {\rm content} + c(v_i);$ |
|
DataSet | CETR | CEPR-AT | CETD-QT | CETD-Jsoup | CEPF | CEDP-TD | CEDP-CTD | CEDP-DSum | CEDP-NLTD |
(%) | (%) | (%) | (%) | (%) | (%) | (%) | (%) | (%) | |
CleanEval-En | 89.42 | | 75.16 | 89.86 | 92.93 | 93.39 | 70.80 | 70.07 | 93.70 |
CleanEval-Zh | 79.60 | 85.23 | N/A | 85.18 | 88.37 | 88.82 | 80.32 | 79.89 | |
NY Post | 42.68 | | 72.91 | 74.76 | 92.10 | 90.80 | 91.38 | 93.24 | 91.42 |
Freep | 57.38 | | 59.52 | 59.92 | 80.66 | 83.07 | 85.67 | 85.65 | 85.26 |
Suntimes | 70.82 | | 83.01 | 82.64 | 94.73 | 95.28 | 96.04 | 96.24 | 95.44 |
Techweb | 60.29 | | 63.88 | 63.36 | 86.96 | 87.22 | 87.38 | 87.22 | 87.39 |
Tribune | 84.99 | | N/A | 92.49 | 96.32 | 96.26 | 96.64 | 96.34 | 96.31 |
Nytimes | 88.98 | | 96.53 | 95.86 | 96.13 | 96.85 | 96.93 | 96.37 | 96.91 |
BBC | 60.11 | | 81.93 | 74.57 | 91.75 | 93.60 | 94.80 | 95.07 | 94.60 |
Reuters | 58.48 | | N/A | 63.65 | 95.41 | 95.05 | 79.30 | 81.04 | 95.27 |
Yahoo | 72.67 | | 83.63 | 76.62 | 93.04 | 93.29 | 93.40 | 94.70 | 93.41 |
Sina | 59.09 | | N/A | 81.93 | 96.29 | 97.64 | 97.72 | 97.74 | 97.65 |
People | 77.04 | 95.11 | 70.21 | 69.74 | 84.47 | 96.14 | | 95.02 | 96.65 |
163 | 24.40 | | N/A | 37.72 | 73.59 | 74.47 | 74.90 | 74.45 | 74.66 |
Xinhua | 72.10 | 94.72 | 84.39 | 83.28 | 96.26 | 96.51 | | 96.49 | 96.59 |
TunxunWb | 66.37 | 94.25 | 88.89 | 88.40 | | 98.87 | 98.87 | 98.87 | 98.87 |
SinaWb | 41.81 | | N/A | 64.43 | 66.90 | 77.86 | 85.34 | 67.65 | 80.74 |
SohuWb | 81.00 | 96.00 | N/A | 78.31 | | 97.51 | 97.71 | 98.00 | 97.69 |
Average | 65.96 | | N/A | 75.71 | 90.21 | 91.81 | 90.05 | 89.11 | 92.30 |
DataSet | CETR | CEPR-AT | CETD-QT | CETD-Jsoup | CEPF | CEDP-TD | CEDP-CTD | CEDP-DSum | CEDP-NLTD |
(%) | (%) | (%) | (%) | (%) | (%) | (%) | (%) | (%) | |
CleanEval-En | 87.20 | 68.00 | | 93.18 | 84.28 | 83.88 | 61.98 | 63.78 | 83.38 |
CleanEval-Zh | 87.48 | 72.67 | N/A | | 85.40 | 85.33 | 68.82 | 69.05 | 85.07 |
NY Post | 91.40 | 71.43 | 95.75 | | 88.07 | 87.28 | 87.30 | 87.33 | 87.39 |
Freep | 90.93 | 90.46 | | 98.98 | 96.29 | 95.76 | 95.19 | 96.38 | 95.65 |
Suntimes | 97.95 | 78.25 | | 98.95 | 93.44 | 93.55 | 94.04 | 94.03 | 93.74 |
Techweb | 97.68 | 87.31 | | 99.27 | 94.77 | 95.13 | 91.79 | 94.47 | 95.10 |
Tribune | 95.24 | 83.62 | N/A | | 94.12 | 93.51 | 93.56 | 93.50 | 93.58 |
Nytimes | 93.40 | 78.29 | 96.00 | | 88.78 | 87.74 | 85.91 | 85.30 | 87.85 |
BBC | 92.17 | 70.88 | | 98.41 | 87.41 | 87.15 | 86.67 | 87.03 | 87.01 |
Reuters | 92.75 | 77.01 | N/A | | 93.41 | 93.44 | 76.91 | 77.49 | 93.44 |
Yahoo | 94.22 | 78.86 | 97.08 | | 85.91 | 88.66 | 87.62 | 88.98 | 88.91 |
Sina | | 85.54 | N/A | 98.47 | 97.55 | 97.09 | 96.93 | 97.16 | 97.04 |
People | 97.89 | 79.12 | 99.70 | | 94.64 | 94.04 | 93.16 | 94.65 | 93.83 |
163 | 88.76 | 81.46 | N/A | | 87.24 | 87.45 | 87.31 | 86.81 | 87.49 |
Xinhua | 98.68 | 73.94 | 99.15 | | 93.93 | 93.22 | 93.12 | 93.41 | 93.11 |
TunxunWb | | 10.25 | 83.81 | 85.44 | 72.34 | 79.27 | 76.63 | 77.21 | 77.95 |
SinaWb | 94.56 | 11.10 | N/A | 95.70 | | 94.94 | 79.98 | 97.59 | 91.72 |
SohuWb | 94.33 | 89.53 | N/A | | 88.92 | 87.47 | 87.42 | 87.37 | 87.42 |
Average | 94.01 | 71.54 | N/A | | 90.23 | 90.27 | 85.80 | 87.31 | 89.98 |