logo

SCIENCE CHINA Information Sciences, Volume 60 , Issue 8 : 082104(2017) https://doi.org/10.1007/s11432-016-0442-1

Embedding differential privacy in decision tree algorithm with different depths

More info
  • ReceivedOct 13, 2016
  • AcceptedNov 14, 2016
  • PublishedJul 6, 2017

Abstract


Acknowledgment

This work was supported in part by National Natural Science Foundation of China (Grant Nos. 61525204, 61572322), Science and Technology Commission of Shanghai Municipality Project (Grant Nos. 14510722600, 16QA1402200), Aeronautical Science Foundation of China (Grant No. 20145557010), and NRF Singapore CREATE Program E2S2.

  • Figure 1

    DP implementation skeleton of decision tree. (a) Previous work; (b) this work.

  • Figure 2

    (Color online) Embedding DP in DT with different depths. (a) One-step computation (previous work); (b) two-step computation; (c) three-step computation; (d) whole tree computation.

  • Figure 3

    (Color online) Trace of $st.score$ for $mushroom$. (a) $\epsilon = 0.01$ or 0.1; (b) $\epsilon = 1.0$ or 10.

  • Figure 4

    (Color online) Accuracy of algorithms on $vote$ with different embedding depths and different privacy budgets. (a) $\epsilon=0.01$; (b) $\epsilon=0.1$; (c) $\epsilon=1.0$; (d) $\epsilon=10$.

  • Figure 5

    (Color online) Accuracy of algorithms on $mushroom$ with different embedding depths and different privacy budgets. (a) $\epsilon=0.01$; (b) $\epsilon=0.1$; (c) $\epsilon=1.0$; (d) $\epsilon=10$.

  • Figure 6

    (Color online) Accuracy of algorithms on $tic\text{-}tac\text{-}toe$ with different embedding depths and different privacy budgets. (a) $\epsilon=0.01$; (b) $\epsilon=0.1$; (c) $\epsilon=1.0$; (d) $\epsilon=10$.

  • Figure 7

    (Color online) Comparison of $Private$-$RDT$ and our algorithms on two datasets. (a) $vote$; (b) $mushroom$.

  • Figure 8

    (Color online) Running time of algorithms on $mushroom$ with different number of instances and different privacy budgets. (a) $\epsilon=0.01$; (b) $\epsilon=0.1$; (c) $\epsilon=1.0$; (d) $\epsilon=10$.

  •   

    Algorithm 1 DP Embedding algorithm—DPEA($\mathcal{T}$, $A$, $C$, $D$, $d$, $\epsilon$)

    Require:$\mathcal{T}$—training dataset, $A$—attribute set, $C$—class label, $D$—maximal tree depth, $d$—steps of computation to embed DP, $\epsilon$—privacy budget;

    Output:$dt$—a decision tree;

    $\epsilon' = \frac{\epsilon}{\lceil\frac{D}{d}\rceil+1}$

    $dt$ = root

    $innerNodes$ = root, stopFlag=False

    while $|innerNodes| > 0$ do

    $innerNode$ = pop one node from $innerNodes$

    $left = D - \text{(current depth)}$

    if $left < d$ then

    if $left \leq 2$ then

    subTree $st$ = exhaustive_search_Exp($innerNode$, $\mathcal{T}$, $A$, $C$, $D$, $left$, $\epsilon'$)

    else

    subTree $st$ = mcmc_search_Exp($innerNode$, $\mathcal{T}$, $A$, $C$, $D$, $left$,$\epsilon'$)

    end if

    else

    if $d \leq 2$ then

    subTree $st$ = exhaustive_search_Exp($innerNode$, $\mathcal{T}$, $A$, $C$, $D$, $d$, $\epsilon'$)

    else

    subTree $st$ = mcmc_search_Exp($innerNode$, $\mathcal{T}$, $A$, $C$, $D$, $d$, $\epsilon'$)

    end if

    end if

    for each leaf node $ln$ in $st$

    if $ln.stopFlag$ = False then

    $innerNodes$ = $innerNodes$ $\cup$ $\{ln\}$

    end if

    end for

    extend $dt$ with $st$ from $innerNode$

    end while

    return $dt$

  •   

    Algorithm 2 Exhaustively search the subtree space—exhaustive_search_Exp($innerNode$, $\mathcal{T}$, $A$, $C$, $D$, $d$, $\epsilon'$)

    Require:$innerNode$—the node to split, $\mathcal{T}$—training dataset, $A$—attribute set, $C$—class label, $D$—maximal tree depth, $d$—steps of computation to embed DP, $\epsilon'$—privacy budget;

    Output:st—a subtree;

    $ST[] = $ all ($d$+1)-layer subtrees rooted at $innerNode$ whose attributes do not appear in $innerNode$'s ancestors

    for each $ST[i] \in ST$

    calculate $ST[i].score$

    $P[i] =$ $\exp(\frac{\epsilon'*ST[i].score}{2*\bigtriangleup q'})$

    end for

    $sum = \sum_{i=0}^{|P|-1} P[i]$

    for each $P[i]$

    $P[i] = P[i]/sum$

    end for

    $r = random(0,1)$

    $st = ST[i] | \text{where $r$ falls in P[i]}$

    for each leaf node $ln$ in $st$

    if $ln$ reach the stopping condition then

    $\forall c \in C : N_c = NoiseCount(\text{c in $ln$}, \epsilon')$

    $ln.label$ = $argmax_c(N_c)$

    $ln.stopFlag$ = True

    else

    $ln.stopFlag$ = False

    end if

    end for

    return $st$

  •   

    Algorithm 3 Search the subtree space with MCMC—mcmc_search_Exp($innerNode$, $\mathcal{T}$, $A$, $C$, $D$, $d$, $\epsilon'$)

    Require:$innerNode$—the node to split, $\mathcal{T}$—training dataset, $A$—attribute set, $C$—class label, $D$—maximal tree depth, $d$—steps of computation to embed DP, $\epsilon'$—privacy budget;

    Output:st—a subtree;

    $st$ = a random ($d$+1)-layer subtree rooted at $innerNode$ whose attributes do not appear in $innerNode$'s ancestors

    $t$ = $tag$ = 0

    while $buffer.varience$ $>$ $threshold$ and $t$ $<$ $stepLimit$ do

    compute $st.score$

    $buffer[(tag$+$)$ % $buffer.size$ $]$ = $st.score$

    $st'$ = randomly replace one inner node in $st$

    compute $st'.score$

    $st = st'$ with probability

    $t$+;

    end while

    for each leaf node $ln$ in $st$

    if $ln$ reach the stopping condition then

    $\forall c \in C : N_c = NoiseCount(\text{c in $ln$}, \epsilon')$

    $ln.label$ = $argmax_c(N_c)$

    $ln.stopFlag$ = True

    else

    $ln.stopFlag$ = False

    end if

    end for

    return $st$

qqqq

Contact and support