logo

SCIENTIA SINICA Informationis, Volume 50 , Issue 12 : 1834(2020) https://doi.org/10.1360/N112019-00004

An approximately optimal disk repair algorithm for distributed storage systems

More info
  • ReceivedJan 4, 2019
  • AcceptedAug 13, 2019
  • PublishedNov 20, 2020

Abstract


Funded by

国家重点研发项目(2018YFC0830900,2018YFC0830903)


References

[1] Ghemawat S, Gobioff H, Leung S T. The Google file system. SIGOPS Oper Syst Rev, 2003, 37: 29-43 CrossRef Google Scholar

[2] Shvachko K, Kuang H, Radia S, et al. The hadoop distributed file system. In: Proceedings of IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), 2010. 10: 1--10. Google Scholar

[3] Dimakis A G, Godfrey P B, Wu Y. Network Coding for Distributed Storage Systems. IEEE Trans Inform Theor, 2010, 56: 4539-4551 CrossRef Google Scholar

[4] Rashmi K V, Shah N B, Gu D. A "hitchhiker's" guide to fast and efficient data reconstruction in erasure-coded data centers. SIGCOMM Comput Commun Rev, 2015, 44: 331-342 CrossRef Google Scholar

[5] Rashmi K V, Nakkiran P, Wang J, et al. Having your cake and eating it too: jointly optimal erasure codes for i/o, storage, and network-bandwidth. In: Proceedings of the 13th USENIX Conference on File and Storage Technologies, 2015. 81--94. Google Scholar

[6] Lihao Xu , Bruck J. X-code: MDS array codes with optimal encoding. IEEE Trans Inform Theor, 1999, 45: 272-276 CrossRef Google Scholar

[7] Xu S, Li R, Lee P P C. Single Disk Failure Recovery for X-Code-Based Parallel Storage Systems. IEEE Trans Comput, 2014, 63: 995-1007 CrossRef Google Scholar

[8] Chowdhury M, Kandula S, Stoica I. Leveraging endpoint flexibility in data-intensive clusters. SIGCOMM Comput Commun Rev, 2013, 43: 231-242 CrossRef Google Scholar

[9] Dean J, Ghemawat S. MapReduce. Commun ACM, 2008, 51: 107 CrossRef Google Scholar

[10] Li R, Hu Y, Lee P P C. Enabling Efficient and Reliable Transition from Replication to Erasure Coding for Clustered File Systems. IEEE Trans Parallel Distrib Syst, 2017, 28: 2500-2513 CrossRef Google Scholar

[11] Rashmi K V, Shah N B, Kumar P V. Optimal Exact-Regenerating Codes for Distributed Storage at the MSR and MBR Points via a Product-Matrix Construction. IEEE Trans Inform Theor, 2011, 57: 5227-5239 CrossRef Google Scholar

[12] Huang C, Simitci H, Xu Y, et al. Erasure coding in windows azure storage. In: Proceedings of Presented as part of the 2012 USENIX Annual Technical Conference, 2012. 15--26. Google Scholar

[13] Calder B, Wang J, Ogus A, et al. Windows azure storage: a highly available cloud storage service with strong consistency. In: Proceedings of the 23rd ACM Symposium on Operating Systems Principles. New York: ACM, 2011. 143--157. Google Scholar

[14] Zhang Y, Kan H. Locally repairable codes with strict availability from linear functions. Sci China Inf Sci, 2018, 61: 109304 CrossRef Google Scholar

[15] Xu Q Q, Xi W Y, Yong K L. CRL: Efficient Concurrent Regeneration Codes with Local Reconstruction in Geo-Distributed Storage Systems. J Comput Sci Technol, 2018, 33: 1140-1151 CrossRef Google Scholar

[16] Shen Z, Shu J, Lee P P C. Reconsidering single failure recovery in clustered file systems. In: Proceedings of the 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). New York: IEEE, 2016. 323--334. Google Scholar

[17] Hu Y, Li X, Zhang M. Optimal Repair Layering for Erasure-Coded Data Centers. ACM Trans Storage, 2017, 13: 1-24 CrossRef Google Scholar

[18] Chien R. Cyclic decoding procedures for Bose- Chaudhuri-Hocquenghem codes. IEEE Trans Inform Theor, 1964, 10: 357-363 CrossRef Google Scholar

[19] Forney G. On decoding BCH codes. IEEE Trans Inform Theor, 1965, 11: 549-557 CrossRef Google Scholar

  • Figure 3

    (Color online) Broken disk repair model

  • Table 1   Symbol definition
    Symbol Definition
    $S_i$ The storage node $i$
    $n_t$ The number of access switches
    $n_{t_s}$ The number of switches associated with storage nodes participating in one repair process
    $n_{t_c}$ The number of switches associated with computational nodes participating in one repair process
    $\hat{n}_c$ The number of computational nodes
    $n_c$ The number of computational nodes participating in one repair process
    $n_s$ The number of storage nodes participating in one repair process
    $n_{s_i}$ The number of storage nodes connected with access switch $i$
    $n_{c_i}$ The number of computational nodes connected with access switch $i$
    $W$ The amount of data waiting for repairing
    $d_{s_i}$ The amount of data related to broken disk for storage node $i$
    $V_s$ Disk read rate
    $V_{l_i}$ Data transmission rate from the access switch $i$ to the core switch
    $V_{m_i}$ Data transmission rate from the core switch to the access switch $i$
    $V_h$ The maximum throughput of the core switch
    $V_{t_i}$ The maximum throughput of the access switch $i$
  •   

    Algorithm 1 The optimal self-repair ratio algorithm

    Require:$(k,W,V_{s_i},V_{c_i},n_s,n_c,d_{s_i})$;

    Output:$(\alpha_1,\alpha_2,\ldots,\alpha_{n_s})$;

    $f(\beta,~n_c)=~{\rm~min}\{\frac{(k-1)W\beta}{\sum_{i=1}^{n_s}V_{s_i}},~\frac{kW(1-\beta)}{\sum_{i=n_s+1}^{n_c}V_{c_i}}\}$;

    $(f_{\rm~min},\beta_{\rm~min},n_{c_{\rm~min}},{\rm~step})=(\infty,0,0,0.01)$;

    for each $n_{c_i}\in~[n_s,~\hat~n_c]$

    for $\beta_v=0$; $\beta_v\leq~1$; $\beta_v+={\rm~step}$

    if $f(\beta_v,~n_{c_i})<f_{\rm~min}$ then

    $(f_{\rm~min},~\beta_{\rm~min},~n_{c_{\rm~min}})=(f(\beta_v,~n_{c_i}),\beta_v,~n_{c_i})$;

    end if

    end for

    end for

    $(\beta,~n_c)=(\beta_{\rm~min},n_{c_{\rm~min}})$;

    for each $i~\in~[1,n_c]$

    if $\frac{W\beta~V_{c_i}}{\sum_{i=1}^{n_c}V_{c_i}}\geqslant~d_{s_i}$ then

    $\alpha_i=1$;

    else

    $\alpha_i=\frac{W\beta~V_{c_i}}{(\sum_{i=1}^{n_c}V_{c_i})~d_{s_i}}$;

    end if

    end for

  •   

    Algorithm 2 The repair task scheduling algorithm

    Find all the stripeId of the broken disk diskId;

    Let diskWeight be a map structure that key is diskId and value is repairing blocks count of broken disk;

    for all stripeId in stripeId

    for each $i\in~\{1,\ldots,n\}$

    if after adding the $i$ block into the map diskWeight, the capacity of map is more than before then

    Choose block $i$ for stripe repair;

    else

    Skip block $i$;

    end if

    end for

    if the number of choose blocks satisfy $\hat~k<k$ then

    Walk all the blocks in stripe stripeId, and random choose $k-\hat~k$ blocks that is not choosed yet for stripe repair;

    end if

    Add choosed $k$ blocks into map diskWeight;

    end for

    According Algorithm 1, compute the approximately optimal repair ratio $(\alpha_1,\alpha_2,\ldots,\alpha_{n_c})$;

    Generate the repair task list by repair ratio and wait computer node acquire;

    Through socket mechanism, run the following process;

    repeat

    if the event is acquire repairing task then

    Decode the request body and get node index $i$;

    Run allocTask$(i)$ (see Algorithm 3);ELSIFthe event is finish repaired task

    Decode the request body and get node index $i$;

    Run finishTask$(i)$ (see Algorithm alg_finishTask);

    else

    Handle other tasks;

    end if

    until Close socket.

  •   

    Algorithm 3 allocTask(int $i$)

    ELSIFnode $i$ is a compute node in $c_s$ a random repair task;

    $n_s=$ len(diskWeight);

    $c_s=n_c-n_s$, let $c_s$ be number of compute node that not handle self-repair tasks;

    if $i$ a key of map diskWeight then

    Let $\hat~\alpha_i$ be the ratio of self-repair tasks which have been alloc before;

    if $\hat~\alpha_i~<\alpha_i$ then

    return a self-repair task for node $i$;ELSIF$\alpha_i<\hat~\alpha_i<1$, and the last alloc time of task for diskWeight$[i]$ exceeds a particular value

    return a self-repair task for node $i$;ELSIF$\alpha_i<\hat~\alpha_i<1$, and the last alloc time of task for diskWeight$[i]$ less than a particular value

    return nil;

    end if