SCIENTIA SINICA Informationis, Volume 48 , Issue 12 : 1681-1696(2018) https://doi.org/10.1360/N112018-00138

A personalized mail re-filtering system based on the client


  ReceivedMay 28, 2018
  AcceptedAug 22, 2018
  PublishedDec 4, 2018







    Algorithm 1 PRFC实现

    Require:有真实标记样本$\{{\rm~SW}^{i}\}_{i=1}^{N_m}$, 无真实标记样本$\{{\rm~TW}^{(i)}\}_{i=1}^{N_m}$, 解析后测试邮件email, LW的起始位置$T_0$和当前位置$T_1$, L模型的可接受错滤率阈值$\rho$, 预测标记的置信度阈 值$\xi$, 已初始化的过滤器Filter_inbox和Filter_junkbox;

    if 根据email`From'或email`Re'并基于规则能判定$y=0$ then

    return $y$;



    基于主题过滤: 向量化email`Subject'为$x^s$;



    if ${\rm~SW}$中出现类不平衡 then



    end if



    if 检测到协变量漂移发生 then


    end if


    if ${\rm~Err}(L)>~{\rm~Err}(S)$且${\rm~Err}(L)>\rho$ then





    end if


    if ${\rm~confidence}>\xi$ then

    return $y$;


    基于正文过滤: 向量化email`Body'为$x^b$;

    同理, 重复6$\sim$24;

    return $y$;

    end if



    同理, 重复5$\sim$31;

    end if

  • Table 1   Experimental corpuses
    Corpus Normal Spam Total
    TREC 2006c 21766 42854 64620
    TREC 2007p 25220 50199 75419
  • Table 2   Multi-task vs. single task $^{\rm~a)}$
    Method G-mean ($\uparrow$) $F1$ ($\uparrow$) ($1-$ROCA)% ($\downarrow$) Accuracy ($\uparrow$)
    Multi-task 0.9895 0.9921 0.0104 0.9896
    Single task 0.9703 0.9775 0.0296 0.9705
  • Table 3   Evaluating different algorithms on TREC 2006c and TREC 2007p $^{\rm~a)}$
    Corpus TREC 2006c TREC 2007p
    Evaluation Accuracy FPR ($1-$ROCA)% lam% Accuracy FPR ($1-$ROCA)% lam%
    criteria ($\uparrow$) ($\downarrow$) ($\downarrow$) ($\downarrow$) ($\uparrow$) ($\downarrow$) ($\downarrow$) ($\downarrow$)
    DISvm [34] 0.9594 0.0107 0.0383 2.73 0.9658 0.0087 0.0321 2.10
    ROSVM [35] 0.9935 0.0036 0.0094 0.34 0.9848 0.0060 0.0108 0.86
    MLC [36] 0.9992 0.0021 0.0004 0.08 0.9855 0.0056 0.0096 0.64
    PRFC 0.9984 0.0013 0.0025 0.12 0.9865 0.0053 0.0068 0.45