logo

SCIENTIA SINICA Informationis, Volume 48 , Issue 12 : 1681-1696(2018) https://doi.org/10.1360/N112018-00138

A personalized mail re-filtering system based on the client

More info
  • ReceivedMay 28, 2018
  • AcceptedAug 22, 2018
  • PublishedDec 4, 2018

Abstract


Funded by

国家自然科学基金项目(61672281)

国家自然科学基金项目(61472186)


References

[1] Messaging Anti-Abuse Working Group. MAAWG email metrics program. First Quarter 2006 Report. 2006. http://www.maawg.org/about/FINAL_1Q2006_Metrics_Report.pdf. Google Scholar

[2] Teng W L, Teng W C. A personalized spam filtering approach utilizing two separately trained filters. In: Proceedings of the 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology. Washington: IEEE Computer Society, 2008. 125--131. Google Scholar

[3] Lin H Z, Wang J L, Wu J P, et al. Effect of cold-rolling cladding on microstructure and properties of composite aluminum alloy foil. J Commun, 2017, 34: 121--132. Google Scholar

[4] Huang G W, Liu Y X, Chen Z. Personalized spam filtering method based on users' feedback. Electron Design Eng, 2014, 22: 53--56. Google Scholar

[5] Guzella T S, Caminhas W M. A review of machine learning approaches to Spam filtering. Expert Syst Appl, 2009, 36: 10206-10222 CrossRef Google Scholar

[6] Liu W Y, Wang T. Ensemble learning and active learning based personal spam email filtering. Comput Eng Sci, 2011, 33: 34--41. Google Scholar

[7] Clark J, Koprinska I, Poon J. Linger-a smart personal assistant for e-mail classification. In: Proceedings of the 13th International Conference on Artificial Neural Networks (ICANN'03), 2003. 274--277. Google Scholar

[8] Sahami M, Dumais S, Heckerman D, et al. A Bayesian approach to filtering junk e-mail. In: Proceedings of AAAI Workshop on Learning for Text Categorization, 1998. 62: 98--105. Google Scholar

[9] Graham P. Better Bayesian filtering. 2003. http://www.paulgraham.com/better.html. Google Scholar

[10] Amayri O, Bouguila N. A study of spam filtering using support vector machines. Artif Intell Rev, 2010, 34: 73-108 CrossRef Google Scholar

[11] Sanghani G, Kotecha K. Personalized spam filtering using incremental training of support vector machine. In: Proceedings of Conference on Computing, Analytics and Security Trends (CAST), Pune, 2016. 323--328. Google Scholar

[12] Yeh C Y, Wu C H, Doong S H. Effective spam classification based on meta-heuristics. In: Proceedings of IEEE International Conference on Systems, Man and Cybernetics, Waikoloa, 2005. 4: 3872--3877. Google Scholar

[13] Toolan F, Carthy J. Feature selection for spam and phishing detection. In: Proceedings of Conference on eCrime Researchers Summit (eCrime), Dallas, 2010. 1--12. Google Scholar

[14] Cheng V, Li C H. Personalized spam filtering with semi-supervised classifier ensemble. In: Proceedings of the 2006 IEEE/WIC/ACM international Conference on Web intelligence. Washington: IEEE Computer Society, 2006. 195--201. Google Scholar

[15] Gomes H M, Barddal J P, Enembreck F, et al. A survey on ensemble learning for data stream classification. ACM Comput Surv, 2017, 50: 23. Google Scholar

[16] Wang S, Minku L L, Yao X. A Systematic Study of Online Class Imbalance Learning With Concept Drift.. IEEE Trans Neural Netw Learning Syst, 2018, 29: 4802-4821 CrossRef PubMed Google Scholar

[17] Syed N A, Liu H, Sung K K. Handling concept drifts in incremental learning with support vector machines. In: Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM, 1999. 317--321. Google Scholar

[18] Wang Y W, Liu Y N, Feng L Z, et al. A novel online spam identification method based on user interest degree. J South China Univ Tech (Nat Sci Ed), 2014, 7: 21--27. Google Scholar

[19] Junejo K N, Karim A. Robust personalizable spam filtering via local and global discrimination modeling. Knowl Inf Syst, 2013, 34: 299-334 CrossRef Google Scholar

[20] Cohen L, Avrahami-Bakish G, Last M. Real-time data mining of non-stationary data streams from sensor networks. Inf Fusion, 2008, 9: 344-353 CrossRef Google Scholar

[21] Gama J, Medas P, Castillo G, et al. Learning with drift detection. In: Proceedings of Conference on Brazilian Symposium on Artificial Intelligence. Berlin: Springer, 2004. 286--295. Google Scholar

[22] Harel M, Mannor S, El-Yaniv R, et al. Concept drift detection through resampling. In: Proceedings of the 31st International Conference on Machine Learning, Beijing, 2014. 1009--1017. Google Scholar

[23] Bach S H, Maloof M A. Paired learners for concept drift. In: Proceedings of the 8th IEEE International Conference on Data Mining, Pisa, 2008. 23--32. Google Scholar

[24] Xu Y, Xu R, Yan W, et al. Concept drift learning with alternating learners. In: Proceedings of International Joint Conference on Neural Networks (IJCNN), Anchorage, 2017. 2104--2111. Google Scholar

[25] Wang J, Xu S, Duan B, et al. An ensemble classification algorithm based on information entropy for data streams. 2017,. arXiv Google Scholar

[26] Mandelbaum A, Shalev A. Word embeddings and their use in sentence classification tasks. 2016,. arXiv Google Scholar

[27] Sugiyama M, Nakajima S, Kashima H, et al. Direct importance estimation with model selection and its application to covariate shift adaptation. In: Proceedings of Conference on Advances in Neural Information Processing Systems, Vancouver, 2008. 1433--1440. Google Scholar

[28] Zhang K, Zheng V, Wang Q, et al. Covariate shift in hilbert space: a solution via sorrogate kernels. In: Proceedings of the 30th International Conference on Machine Learning, Atlanta, 2013. 388--395. Google Scholar

[29] Liu A, Ziebart B. Robust classification under sample selection bias. In: Proceedings of the Conference on Advances in Neural Information Processing Systems, Montreal, 2014. 37--45. Google Scholar

[30] Huang J, Gretton A, Borgwardt K M, et al. Correcting sample selection bias by unlabeled data. In: Proceedings of Conference on Advances in Neural Information Processing Systems, Vancouver, 2007. 601--608. Google Scholar

[31] Kawahara Y, Sugiyama M. Sequential change-point detection based on direct density-ratio estimation. Statistical Analy Data Min, 2012, 5: 114-127 CrossRef Google Scholar

[32] Kanamori T, Hido S, Sugiyama M. Efficient direct density ratio estimation for non-stationarity adaptation and outlier detection. In: Proceedings of Conference on Advances in Neural Information Processing Systems, Vancouver, 2009. 809--816. Google Scholar

[33] Kivinen J, Smola A J, Williamson R C. Online Learning with Kernels. IEEE Trans Signal Process, 2004, 52: 2165-2176 CrossRef ADS Google Scholar

[34] Junejo K N. Distribution shift resilient discrimination information space for SVM classification. In: Proceedings of the 8th IEEE Annual Information Technology, Electronics and Mobile Communication Conference (IEMCON), Vancouver, 2017. 378--383. Google Scholar

[35] Han Y, He X, Yang M, et al. Chinese spam filter based on relaxed online support vector machine. In: Proceedings of Conference on Asian Language Processing (IALP), Harbin, 2010. 185--188. Google Scholar

[36] Sun G, Li S, Chen T, et al. Active learning method for Chinese spam filtering. Int J Performability Eng, 2017, 17: 511. Google Scholar

  •   

    Algorithm 1 PRFC实现

    Require:有真实标记样本$\{{\rm~SW}^{i}\}_{i=1}^{N_m}$, 无真实标记样本$\{{\rm~TW}^{(i)}\}_{i=1}^{N_m}$, 解析后测试邮件email, LW的起始位置$T_0$和当前位置$T_1$, L模型的可接受错滤率阈值$\rho$, 预测标记的置信度阈 值$\xi$, 已初始化的过滤器Filter_inbox和Filter_junkbox;

    if 根据email`From'或email`Re'并基于规则能判定$y=0$ then

    return $y$;

    else

    利用Filter_inbox过滤器;

    基于主题过滤: 向量化email`Subject'为$x^s$;

    ${\rm~TW}^{(N_m+1)}~\leftarrow~x^s$,${\rm~SW}^{(N_m+1)}\leftarrow~{\rm~TW}^{(1)}$;

    ${\rm~SW}^{(i)}\leftarrow~{\rm~SW}^{(i+1)}$,${\rm~TW}^{(i)}\leftarrow~{\rm~TW}^{(i+1)},i=1,\ldots,N_m$;

    if ${\rm~SW}$中出现类不平衡 then

    MTFL学得模型参数$w^i,w^j$;

    更新L;

    end if

    利用${\rm~SW}^{(N_m)}$增量学习L;

    更新$\{{\rm~SW}^{(i)}\}_{i=1}^{N_m}$权重$\{\alpha~_i\}_{i=1}^{N_m}$;

    if 检测到协变量漂移发生 then

    重新计算$\{\alpha~_i\}_{i=1}^{N_m}$;

    end if

    利用加权$\{{\rm~SW}^{(i)}\}_{i=1}^{N_m}$更新S;

    if ${\rm~Err}(L)>~{\rm~Err}(S)$且${\rm~Err}(L)>\rho$ then

    $L\leftarrow~S$;

    $T_0~=~T_1-N_m$;

    else

    $T_1~=~T_1+1$;

    end if

    $L_.{\rm~predict}(x^s)\rightarrow~[y,{\rm~confidence}]$;

    if ${\rm~confidence}>\xi$ then

    return $y$;

    else

    基于正文过滤: 向量化email`Body'为$x^b$;

    同理, 重复6$\sim$24;

    return $y$;

    end if

    else

    利用Filter_junkbox过滤器;

    同理, 重复5$\sim$31;

    end if

  • Table 1   Experimental corpuses
    Corpus Normal Spam Total
    TREC 2006c 21766 42854 64620
    TREC 2007p 25220 50199 75419
  • Table 2   Multi-task vs. single task $^{\rm~a)}$
    Method G-mean ($\uparrow$) $F1$ ($\uparrow$) ($1-$ROCA)% ($\downarrow$) Accuracy ($\uparrow$)
    Multi-task 0.9895 0.9921 0.0104 0.9896
    Single task 0.9703 0.9775 0.0296 0.9705
  • Table 3   Evaluating different algorithms on TREC 2006c and TREC 2007p $^{\rm~a)}$
    Corpus TREC 2006c TREC 2007p
    Evaluation Accuracy FPR ($1-$ROCA)% lam% Accuracy FPR ($1-$ROCA)% lam%
    criteria ($\uparrow$) ($\downarrow$) ($\downarrow$) ($\downarrow$) ($\uparrow$) ($\downarrow$) ($\downarrow$) ($\downarrow$)
    DISvm [34] 0.9594 0.0107 0.0383 2.73 0.9658 0.0087 0.0321 2.10
    ROSVM [35] 0.9935 0.0036 0.0094 0.34 0.9848 0.0060 0.0108 0.86
    MLC [36] 0.9992 0.0021 0.0004 0.08 0.9855 0.0056 0.0096 0.64
    PRFC 0.9984 0.0013 0.0025 0.12 0.9865 0.0053 0.0068 0.45